Prometheus, the open source monitoring system for Docker-style containers running in cloud architectures, has formally released a 2.0 version with major architectural changes to improve its performance.
Among the changes that have landed since the release of version 1.6 earlier this year:
- An entirely new storage format for the data accumulated by Prometheus.
- A new way for Prometheus to handle “staleness,” i.e. problems resulting when data reported by Prometheus doesn’t match the actual state of the cluster.
- A method for taking efficient snapshot backups of the entire database.
Most of the changes shouldn’t force experienced Prometheus users to retool their environments. The new features are meant to work under the hood, without significantly altering workflow, although there are a few breaking changes (documented here).
New in Prometheus 2.0: More efficient time-series database storage format
Under the hood, Prometheus is a time-series database—a system for gathering statistics about running containers and storing them in a way that’s indexed by timestamps. Because time-series data arrives at high speed and from many sources, it’s hard to aggregate properly. Writing the data to disk becomes a major bottleneck.
Prometheus 2.0 addresses this by partitioning the data by ranges of time, rather than by data source. The result is far less CPU and disk usage, more manageable latency for queries, and a better mechanism for mopping up data that isn’t needed anymore.
Again, the vast majority of Prometheus deployments won’t need to do anything to leverage these improvements, other than deploy Prometheus 2.0.
New in Prometheus 2.0: Better handling of stale data from containers
Another problem Prometheus users have observed is how the system has trouble handling stale data. For instance, users sometimes get bombarded with alerts about a service being down, even after that service has already come back up. Another problem is if a resource disappears from monitoring and then reappears within a certain timefrane, it can end up being counted twice and produce misleading statistics.
Prometheus 2.0 deals with this by having more explicit rules for handling events from sources that have gone stale. The logic for handling this is surprisingly complex (see this slide deck for details), but the end user doesn’t have to deal with the vast majority of the details.
New in Prometheus 2.0: Full database snapshot backups
The new storage engine in Prometheus 2.0 makes it possible to take efficient point-in-time snapshots of the database. Triggering a snapshot is as simple as hitting a specific Prometheus API endpoint.
According to Prometheus developer Fabian Reinartz, those snapshots are small—a fractional percentage of the size of the whole database—and can be copied somewhere for safekeeping. “On disk failure or other scenarios, new Prometheus servers can be started with the snapshot backup with minimal data loss,” says Reinartz.
Precompiled binaries and Docker images are available for download from the official Prometheus project page. Source code for the project, and all its related subprojects, is available on GitHub.