Prometheus is our monitoring and trending system. It collects metrics from all TPA-managed hosts and external services, and sends alerts when out-of-bound conditions occur.

Prometheus also supports basic graphing capabilities although those are limited enough that we use a separate graphing layer on top (see Grafana).

This page also documents auxiliary services connected to Prometheus like the Karma alerting dashboard and IRC bots.

[[TOC]]

Tutorial

If you're just getting started with Prometheus, you might want to follow the training course or see the web dashboards section.

Training course plan

Web dashboards

The main Prometheus web interface is available at:

https://prometheus.torproject.org

It's protected by the same "web password" as Grafana, see the basic authentication in Grafana for more information.

A simple query you can try is to pick any metric in the list and click Execute. For example, this link will show the 5-minute load over the last two weeks for the known servers.

The Prometheus web interface is crude: it's better to use Grafana dashboards for most purposes other than debugging.

It also shows alerts, but for that, there are better dashboards, see below.

Note that the "classic" dashboard has been deprecated upstream and, starting from Debian 13, has been failing at some tasks. We're slowly replacing it with Grafana and Fabric scripts, see tpo/tpa/team#41790 for progress.

For general queries, in particular, use the prometheus.query-to-series task, for example:

fab prometheus.query-to-series --expression 'up!=1'

... will show jobs that are "down".

Alerting dashboards

There are a couple of web interfaces to see alerts in our setup:

  • Karma dashboard - our primary view on currently firing alerts. The alerts are grouped by labels.
  • This web interface only shows what's current, not some form of alert history.
  • Shows links to "run books" related to alerts
  • Useful view: @state!=suppressed to hide silenced alerts from the dashboard by default.
  • Grafana availability dashboard - drills down into alerts and, more importantly shows their past values.
  • Prometheus' Alerts dashboard - show all alerting rules and which file they are from
  • Also contains links to graphs based on alerts' PromQL expressions

Normally, all rules are defined in the [prometheus-alerts.git repository][]. Another view of this is the rules configuration dump which also shows when the rule was last evaluated and how long it took.

Each alert should have a URL to a "run book" in its annotations, typically a link to this very wiki, in the "Pager playbook" section, which shows how to handle any particular outage. If it's not present, it's a bug and can be filed as such.

Silencing alerts

With Alertmanager, you can stop alerts from sending notifications by creating a "silence". A silence is an expression matching alerts with tags and other values with a start and end times. Silences can have optional author name and description, and we strongly recommend setting them so that others can refer to you if they have questions.

The main method for managing silences is via the Karma dashboard. You can also manage them on the command line via fabric.

Silencing an alert in advance

Say you are planning some service maintenance and expect an alert to trigger, but you don't want things to be screaming everywhere.

For this, you want to create a "silence", which technically resides in the Alertmanager, but we manage them through the Karma dashboard.

Here is how to set an alert to silence notifications in the future:

  1. Head for the Karma dashboard
  2. Click on the "bell" on the top right
  3. Enter a label name and value matching the expected alert, typically you would pick alertname as a key and the name as the value (e.g. JobDown for a reboot)

    You will also likely want to select an alias to match for a specific host. 4. Pick the duration: this can be done through duration (e.g. one hour is the default) or start and end time 5. Enter your name 6. Enter a comment describing why this silence is there, preferably pointing at an issue describing the work. 7. Click Preview 8. It will likely say "No alerts matched", ignore that and click Submit

When submitting an alert, Karma is quite terse: it only shows a green checkbox and a UUID, which is the unique identifier for this alert, as a link to the Alertmanager. Don't click that link, as it doesn't work and anyways we can do everything we do with alerts in Karma.

Silencing active alerts

Silencing active alerts is slightly easier than planning one in advance. You can just:

  1. Head for the Karma dashboard
  2. Click on the "hamburger menu"
  3. Select "Silence this group"
  4. Change the comment to link to the incident or who's working on this
  5. Click Preview
  6. It will show which alerts are affected, click Submit

When submitting an alert, Karma is quite terse: it only shows a green checkbox and a UUID, which is the unique identifier for this alert, as a link to the Alertmanager. Don't click that link, as it doesn't work and anyways we can do everything we do with alerts in Karma.

Note that you can replace steps 2 and 3 above with a series of manipulations to get a filter in the top bar that corresponds to what you want to silence (for example clicking on a label in alerts, or manually entering new filtering criteria) and then clicking on the bell icon at the top, just right of the filter bar. This method can help you create a silence for more than just one alert at a time.

Adding and updating silences with fabric

You can use Fabric to manage silences from the command line or via scripts. This is mostly useful for automatically adding a silence from some other, higher-level tasks. But you can use the fabric task either directly or in other scripts if you'd like.

Here's an example for adding a new silence for all backup alerts for the host idle-dal-02.torproject.org with author "wario" and a comment:

fab silence.create --comment="machine waiting for first backup" \
  --matchers job=bacula --matchers alias=idle-dal-02.torproject.org \
  --ends-at "in 5 days" --created-by "wario"

The author is optional and defaults to the local username. Make sure you have a valid user set in your configuration and to set a correct --comment so that others can understand the goal of the silence and can refer to you for questions. The user comes from the getpass.getuser Python function, see that documentation on how to override defaults from the environment.

The matchers option can be specified multiple times. All values of the matchers option must match for the silence to find alerts (so the values have an "and" type boolean relationship)

The --starts-at option is not specified in the example above and that implies that the silence starts from "now". You can use --starts-at for example for planning a silence that will only take effect at the start of a planned maintenance window in the future.

The --starts-at and --ends-at options both accept either ISO 8601 formatted dates or textual dates accepted by the dateparser Python module.

Finally, if you want to update a silence, the command is slightly different but the arguments are the same, except for one addition silence-id which specifies the ID of the alert that needs to be modified:

fab silence.update --silence-id=9732308d-3390-433e-84c9-7f2f0b2fe8fa \
  --comment="machine waiting for first backup - tpa/tpa/team#12345678" \
  --matchers job=bacula --matchers alias=idle-dal-02.torproject.org \
  --ends-at "in 7 days" --created-by "wario"

Adding metrics to applications

If you want your service to be monitored by Prometheus, you need to write or reuse an existing exporter. Writing an exporter is more involved, but still fairly easy and might be necessary if you are the maintainer of an application not already instrumented for Prometheus.

The actual documentation is fairly good, but basically: a Prometheus exporter is a simple HTTP server which responds to a specific HTTP URL (/metrics, by convention, but it can be anything). It responds with a key/value list of entries, one on each line, in a simple text format more or less following the OpenMetrics standard.

Each "key" is a simple string with an arbitrary list of "labels" enclosed in curly braces. The value is a float or integer.

For example, here's how the "node exporter" exports CPU usage:

# HELP node_cpu_seconds_total Seconds the cpus spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 948736.11
node_cpu_seconds_total{cpu="0",mode="iowait"} 1659.94
node_cpu_seconds_total{cpu="0",mode="irq"} 0
node_cpu_seconds_total{cpu="0",mode="nice"} 516.23
node_cpu_seconds_total{cpu="0",mode="softirq"} 16491.47
node_cpu_seconds_total{cpu="0",mode="steal"} 0
node_cpu_seconds_total{cpu="0",mode="system"} 35893.84
node_cpu_seconds_total{cpu="0",mode="user"} 67711.74

Note that the HELP and TYPE lines look like comments, but they are actually important, and misusing them will lead to the metric being ignored by Prometheus.

Also note that Prometheus's actual support for OpenMetrics varies across the ecosystem. It's better to rely on Prometheus' documentation than OpenMetrics when writing metrics for Prometheus.

Obviously, you don't necessarily have to write all that logic yourself, however: there are client libraries (see the Golang guide, Python demo or C documentation for examples) that do most of the job for you.

In any case, you should be careful about the names and labels of the metrics. See the metric and label naming best practices.

Once you have an exporter endpoint (say at http://example.com:9090/metrics), make sure it works:

curl http://example.com:9090/metrics

This should return a number of metrics that change (or not) at each call. Note that there's a registry of official Prometheus export port numbers that should be respected, but it's full (oops).

From there on, provide that endpoint to the sysadmins (or someone with access to the external monitoring server), which will follow the procedure below to add the metric to Prometheus.

Once the exporter is hooked into Prometheus, you can browse the metrics directly at: https://prometheus.torproject.org. Graphs should be available at https://grafana.torproject.org, although those need to be created and committed into git by sysadmins to persist, see the [grafana-dashboards.git repository][] for more information.

Adding scrape targets

"Scrape targets" are remote endpoints that Prometheus "scrapes" (or fetches content from) to get metrics.

There are two ways of adding metrics, depending on whether or not you have access to the Puppet server.

Adding metrics through the git repository

People outside of TPA without access to the Puppet server can contribute targets through a repo called [prometheus-alerts.git][]. To add a scrape target:

  1. Clone the repository, if not done already:

    git clone https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/
    cd prometheus-alerts
    
  2. Assuming you're adding a node exporter, to add the target:

    cat > targets.d/node_myproject.yaml <<EOF
    # scrape the external node exporters for project Foo
    ---
    - targets:
      - targetone.example.com
      - targettwo.example.com
    
  3. Add, commit, and push:

    git checkout -b myproject
    git add targets.d
    git commit -m"add node exporter targets for my project"
    git push origin -u myproject
    

The last push command should show you the URL where you can submit your merge request.

After being merged, the changes should propagate within 4 to 6 hours. Prometheus automatically reloads those rules when they are deployed.

See also the [targets.d documentation in the git repository][].

Adding metrics through Puppet

TPA-managed services should define their scrape jobs, and thus targets, via puppet profiles.

To add a scrape job in a puppet profile, you can use the prometheus::scrape_job defined type, or one of the defined types which are convenience wrappers around that.

Here is, for example, how the GitLab runners are scraped:

# tell Prometheus to scrape the exporter
@@prometheus::scrape_job { "gitlab-runner_${facts['networking']['fqdn']}_9252":
  job_name => 'gitlab_runner',
  targets  => [ "${facts['networking']['fqdn']}:9252" ],
  labels   => {
    'alias' => $facts['networking']['fqdn'],
    'team'  => 'TPA',
  },
}

The job_name (gitlab_runner above) needs to be added to the profile::prometheus::server::internal::collect_scrape_jobs list in hiera/common/prometheus.yaml, for example:

profile::prometheus::server::internal::collect_scrape_jobs:
  # [...]
  - job_name: 'gitlab_runner'
  # [...]

Note that you will likely need a firewall rule to poke a hole for the exporter:

# grant Prometheus access to the exporter, activated with the
# listen_address parameter above
Ferm::Rule <<| tag == 'profile::prometheus::server-gitlab-runner-exporter' |>>

That rule, in turn, is defined with the profile::prometheus::server::rule define, in profile::prometheus::server::internal, like so:

profile::prometheus::server::rule {
  # [...]
  'gitlab-runner': port => 9252;
  # [...]
}

Targets for scrape jobs defined in Hiera are however not managed by puppet. They are defined through files in the [prometheus-alerts.git repository][]. See the section below for more details on how things are maintained there. In the above example, we can see that targets are obtained via files on disk. The [prometheus-alerts.git repository][] is cloned in /etc/prometheus-alerts on the Prometheus servers.

Note: we currently have a handful of blackbox_exporter-related targets for TPA services, namely for the HTTP checks. We intend to move those into puppet profiles whenever possible.

Manually adding targets in Puppet

Normally, services configured in Puppet SHOULD automatically be scraped by Prometheus (see above). If, however, you need to manually configure a service, you may define extra jobs in the $scrape_configs array, in the profile::prometheus::server::internal Puppet class.

For example, because the GitLab setup is fully managed by Puppet (e.g. [gitlab#20][], but other similar issues remain), we cannot use this automatic setup, so manual scrape targets are defined like this:

  $scrape_configs =
  [
    {
      'job_name'       => 'gitaly',
      'static_configs' => [
        {
          'targets' => [
            'gitlab-02.torproject.org:9236',
          ],
          'labels'  => {
            'alias' => 'Gitaly-Exporter',
          },
        },
      ],
    },
    [...]
  ]

But ideally those would be configured with automatic targets, below.

Metrics for the internal server are scraped automatically if the exporter is configured by the [puppet-prometheus][] module. This is done almost automatically, apart from the need to open a firewall port in our configuration.

To take the apache_exporter, as an example, in profile::prometheus::apache_exporter, include the prometheus::apache_exporter class from the upstream Puppet module, then we open the port to the Prometheus server on the exporter, with:

Ferm::Rule <<| tag == 'profile::prometheus::server-apache-exporter' |>>

Those rules are declared on the server, in prometheus::prometheus::server::internal.

Adding a blackbox target

Most exporters are pretty straightforward: a service binds to a port and exposes metrics through HTTP requests on that port, generally on the /metrics URL.

The blackbox exporter is a special case for exporters: it is scraped by Prometheus via multiple scrape jobs and each scrape job has targets defined.

Each scrape job represents one type of check (e.g. TCP connections, HTTP requests, ICMP ping, etc) that the blackbox exporter is launching and each target is a host or URL or other "address" that the exporter will try to reach. The check will be initiated from the host running the blackbox exporter to the target at the moment the Prometheus server is scraping the exporter.

The blackbox exporter is rather peculiar and counter-intuitive, see the how to debug the blackbox exporter for more information.

Scrape jobs

In Prometheus's point of view, two information are needed:

  • The address and port of the host where Prometheus can reach the blackbox exporter
  • The target (and possibly the port tested) that the exporter will try to reach

Prometheus transfers the information above to the exporter via two labels:

  • __address__ is used to determine how Prometheus can reach the exporter. This is standard, but because of how we create the blackbox targets, it will initially contain the address of the blackbox target instead of the exporter's. So we need to shuffle label values around in order for the __address__ label to contain the correct value.
  • __param_target is used by the blackbox exporter to determine what it should contact when running its test, i.e. what is the target of the check. So that's the address (and port) of the blackbox target.

The reshuffling of labels mentioned above is achieved with the relabel_configs option for the scrape job.

For TPA-managed services, we define this scrape jobs in Hiera in common/prometheus.yml under keys named collect_scrape_jobs. Jobs in those keys expect targets to be exported by other parts of the puppet code.

For example, here's how the ssh scrape job is configured:

- job_name: 'blackbox_ssh_banner'
  metrics_path: '/probe'
  params:
    module:
      - 'ssh_banner'
  relabel_configs:
    - source_labels:
        - '__address__'
      target_label: '__param_target'
    - source_labels:
        - '__param_target'
      target_label: 'instance'
    - target_label: '__address__'
      replacement: 'localhost:9115'

Scrape jobs for non-TPA services are defined in Hiera under keys named scrape_configs in hiera/common/prometheus.yaml. Jobs in those keys expect to find their targets in files on the Prometheus server, through the prometheus-alerts repository. Here's one example of such a scrape job definition:

profile::prometheus::server::external::scrape_configs:
# generic blackbox exporters from any team
- job_name: blackbox
  metrics_path: "/probe"
  params:
    module:
    - http_2xx
  file_sd_configs:
  - files:
    - "/etc/prometheus-alerts/targets.d/blackbox_*.yaml"
  relabel_configs:
  - source_labels: [__address__]
    target_label: __param_target
  - source_labels: [__param_target]
    target_label: instance
  - target_label: __address__
    replacement: localhost:9115

In both of the examples, the relabel_configs starts by copying the target's address into the __param_target label. It also populates the instance label with the same value since that label is used in alerts and graphs to display information. Finally, the __address__ label is overridden with the address where Prometheus can reach the exporter.

Known pitfalls with blackbox scrape jobs

Some tests that can be performed with blackbox exporter can have some pitfalls, cases where the monitoring is not doing what you'd expect and thus we're not receiving the information required for proper monitoring. This is a list of some known issues that you should look out for:

  • With the http module, if you let it follow redirections it simplifies some checks. However, this has the potential side-effect that the metrics associated with the SSL certificate for that check does not contain information about the certificate of the domain name of the target, but rather about the certificate for the domain last visited (after following redirections). So certificate expiration alerts will not be alerting about the right thing!

Targets

TPA-managed services use puppet exported resources in the appropriate profiles. The targets parameter is used to convey information about the blackbox exporter target (the host being tested by the exporter).

For example, this is how the ssh scrape jobs (in modules/profile/manifests/ssh.pp) are created:

@@prometheus::scrape_job { "blackbox_ssh_banner_${facts['networking']['fqdn']}":
  job_name => 'blackbox_ssh_banner',
  targets  => [ "${facts['networking']['fqdn']}:22" ],
  labels   => {
    'alias' => $facts['networking']['fqdn'],
    'team'  => 'TPA',
  },
}

For non-TPA services, the targets need to be defined in the prometheus-alerts repository.

The targets defined this way for blackbox exporter look exactly like normal Prometheus targets, except that they define what the blackbox exporter will try to reach. The targets can be hostname:port pairs or URLs, depending on the nature of the type of check being defined.

See documentation for targets in the repository for more details

PromQL primer

The upstream documentation on PromQL can be a little daunting, so we provide you with a few examples from our infrastructure.

A query, fundamentally, asks the Prometheus server to query its database for a given metric. For example, this simple query will return the status of all exporters, with a value of 0 (down) or 1 (up):

up

You can use labels to select a subset of those, for example this will only check the [node_exporter][]:

up{job="node"}

You can also match the metric against a value, for example this will list all exporters that are unavailable:

up{job="node"}==0

The up metric is not very interesting because it doesn't change often. It's tremendously useful for availability of course, but typically we use more complex queries.

This, for example, is the number of accesses on the Apache web server, according to the [apache_exporter][]:

apache_accesses_total

In itself, however, that metric is not that useful because it's a constantly incrementing counter. What we want is actually the rate of that counter, for which there is of course a function, rate(). We need to apply that to a vector, however, a series of samples for the above metric, over a given time period, or a time series. This, for example, will give us the access rate over 5 minutes:

rate(apache_accesses_total[5m])

That will give us a lot of results though, one per web server. We might want to regroup those, for example, so we would do something like:

sum(rate(apache_accesses_total[5m])) by (classes)

Which would show you the access rate by "classes" (which is our poorly-named "role" label).

Another similar example is this query, which will give us the number of bytes incoming or outgoing, per second, in the last 5 minutes, across the infrastructure:

sum(rate(node_network_transmit_bytes_total[5m]))
sum(rate(node_receive_transmit_bytes_total[5m]))

Finally, you should know about the difference between rate and increase. The rate() is always "per second", and can be a little hard to read if you're trying to figure our things like "how many hits did we have in the last month", or "how much data did we actually transfer yesterday". For that, you need increase() which will actually count the changes in the time period. So for example, to answer those two questions, this is the number of hits in the last month:

sum(increase(apache_accesses_total[30d])) by (classes)

And the data transferred in the last 24h:

sum(increase(node_network_transmit_bytes_total[24h]))
sum(increase(node_receive_transmit_bytes_total[24h]))

For more complex examples of queries, see the queries cheat sheet, the [prometheus-alerts.git repository][], and the [grafana-dashboards.git repository][].

Writing an alert

Now that you have metrics in your application and those are scraped by Prometheus, you are likely going to want alert on some of those metrics. Be careful writing alerts that are not too noisy, and alert on user-visible symptoms, not on underlying technical issues you think might affect users, see our Alerting philosophy for a discussion on that.

An alerting rule is a simple YAML file that consists mainly of:

  • A name (say JobDown).
  • A Prometheus query, or "expression" (say up != 1).
  • Extra labels and annotations.

Expressions

The most important part of the alert is the expr field, which is a Prometheus query that should evaluate to "true" (non-zero) for the alert to fire.

Here is, for example, the first alert in the [rules.d/tpa_node.rules file][]:

  - alert: JobDown
    expr: up < 1
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: 'Exporter job {{ $labels.job }} on {{ $labels.instance }} is down'
      description: 'Exporter job {{ $labels.job }} on {{ $labels.instance }} has been unreachable for more than 15 minutes.'
      playbook: "https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/prometheus/#exporter-job-down-warnings"

In the above, Prometheus will generate an alert if the metric up is not equal to 1 for more than 15 minutes, hence up < 1.

See the PromQL primer for more information about queries and the queries cheat sheet for more examples.

Duration

The for field means the alert is not immediately passed down to the Alertmanager until that time has passed. It is useful to avoid flapping and temporary conditions.

Here are some typical for delays we use, as a rule of thumb:

  • 0s: checks that already have a built-in time threshold in its expression (see below), or critical condition requiring immediate action, immediate notification (default). Examples: AptUpdateLagging (checks for apt update not running for more than 24h), RAIDDegraded (failed disk won't come back on its own in 15m)
  • 15m: availability checks, designed to ignore transient errors. Examples: JobDown, DiskFull
  • 1h: consistency checks, things an operator might have deployed incorrectly but could recover on its own. Examples: OutdatedLibraries, as needrestart might recover at the end of the upgrade job, which could take more than 15m
  • 1d: daily consistency check. Examples: PackagesPendingTooLong (upgrades are supposed to run daily)

Try to align yourself, but don't obsess over those. If an alert is better suited to a for delay that differs from the above, simply add a comment to the alert to explain why the period is being used.

Grouping

At this point, what it effectively does is generate a message that it passes along to the Alertmanager with the annotations, the labels defined in the alerting rule (severity="warning"). It also passes along all other labels that might be attached to the up metric*, which is important, as the query can modify which labels are visible. For example, the up metric typically looks like this:

up{alias="test-01.torproject.org",classes="role::ldapdb",instance="test-01.torproject.org:9100",job="node",team="TPA"} 1

Also note that this single expression will generate multiple alerts for multiple matches. For example, if two hosts are down, the metric would look like this:

up{alias="test-01.torproject.org",classes="role::ldapdb",instance="test-01.torproject.org:9100",job="node",team="TPA"} 0
up{alias="test-02.torproject.org",classes="role::ldapdb",instance="test-02.torproject.org:9100",job="node",team="TPA"} 0

This will generate two alerts. This matters, because it can create a lot of noise and confusion on the other end. A good way to deal with this is to use aggregation operators. For example, here is the DRBD alerting rule, which often fires for multiple disks at once because we're mass-migrating instances in Ganeti:

- alert: DRBDDegraded
    expr: count(node_drbd_disk_state_is_up_to_date != 1) by (job, instance, alias, team)
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: "DRBD has {{ $value }} out of date disks on {{ $labels.alias }}"
      description: "Found {{ $value }} disks that are out of date on {{ $labels.alias }}."
      playbook: "https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/drbd#resyncing-disks"

The expression, here, is:

sum(node_drbd_disk_state_is_up_to_date != 1) by (job, instance, alias, team)

This matters because otherwise this would create a lot of alerts, one per disk! For example, on fsn-node-01, there are 52 drives:

count(node_drbd_disk_state_is_up_to_date{alias=~"fsn-node-01.*"}) == 52

So we use the count() function to count the number of drives per machine. Technically, we count by (job, instance, alias, team), but typically, the 4 metrics will be the same for each alert. We still have to specify all of those because otherwise they get redacted by the aggregation function.

Note that the Alertmanager does its own grouping as well, see the group_by setting.

Labels

As mentioned above, labels typically come from the metrics used in the alerting rule itself. It's the job of the exporter and the Prometheus configuration to attach most necessary labels to the metrics for the Alertmanager to function properly. In conjunction with metrics that come from the exporter, we expect the following labels to be produced by either the exporter, the Prometheus scrape configuration, or alerting rule:

Label syntax normal example backup example blackbox example
job name of the job node bacula blackbox_https_2xx_or_3xx
team name of the team TPA TPA TPA
severity warning or critical warning warning warning
instance host:port web-fsn-01.torproject.org:9100 bacula-director-01.torproject.org:9133 localhost:9115
alias host web-fsn-01.torproject.org web-fsn-01.torproject.org web-fsn-01.torproject.org
target target used by blackbox not produced not produced www.torproject.org

Some notes about the lines of the table above:

  • team: which group to contact for this alert, which affects how alerts get routed. See List of team names

  • severity: affects alert routing. Use warning unless the alert absolutely needs immediate attention. TPA-RFC-33 defines the alert levels as:

  • warning (new): non-urgent condition, requiring investigation and fixing, but not immediately, no user-visible impact; example: server needs to be rebooted

  • critical: serious condition with disruptive user-visible impact which requires prompt response; example: donation site returns 500 errors

  • instance: host name and port that Prometheus used for scraping.

For example, for the node exporter it is port 9100 on the monitored host, but for other exporters, it might be another host running the exporter.

Another example, for the blackbox exporter, it is port 9115 on the blackbox exporter (localhost by default, but there's a blackbox exporter running to monitor the Redis tunnel on the donate service).

For backups, the exporter is running on the Bacula director, so the instance is bacula-director-01.torproject.org:9133, where the bacula exporter runs.

  • alias: FQDN of the host concerned by the scraped metrics.

For example, for a blackbox check, this would be the host that serves an HTTPS website we're getting information about. For backups, this would be the FQDN of the machine that is getting backed up.

This is not the same as "instance without a port", as this does not point to the exporter.

  • target: in the case of a blackbox alert, the actual target being checked. Can be for example the full URL, or the SMTP host name and port, etc.

Note that for URLs, we rely on the blackbox module to determine the scheme that's used for HTTP/HTTPS checks, so we set the target without the scheme prefix (e.g. no https:// prefix). This lets us link HTTPS alerts to HTTP ones in alert inhibitions.

Annotations

Annotations are another field that's part of the alert generated by Prometheus. Those are use to generate messages for the users, depending on the Alertmanager routing. The summary field ends up in the Subject field of outgoing email, and the description is the email body, for example.

Those fields are Golang templates with variables accessible with curly braces. For example, {{ $value }} is the actual value of the metric in the expr query. The list of available variables is somewhat obscure, but some of it is visible in the Prometheus template reference and the Alertmanager template reference. The Golang template system also comes with its own limited set of built-in functions.

Writing a playbook

Every alert in Prometheus must have a playbook annotation. This is (if done well), a URL pointing at a service page like this one, typically in the Pager playbook section, that explains how to deal with the alert.

The playbook must include those things:

  1. The actual code name of the alert (e.g. JobDown or DiskWillFillSoon).

  2. An example of the alert output (e.g. Exporter job gitlab_runner on tb-build-02.torproject.org:9252 is down).

  3. Why this alert triggered, what is its impact.

  4. Optionally, how to reproduce the issue.

  5. How to fix it.

How to reproduce the issue is optional, but important. Think of yourself in the future, tired and panicking because things are broken:

  • Where do you think the error will be visible?
  • Can we curl something to see it happening?
  • Is there a dashboard where you can see trends?
  • Is there a specific Prometheus query to run live?
  • Which log file can we inspect?
  • Which systemd service is running it?

The "how to fix it" can be a simple one line, or it can go into a multiple case example of scenarios that were found in the wild. It's the hard part: sometimes, when you make an alert, you don't actually know how to handle the situation. If so, explicitly state that problem in the playbook, and say you're sorry, and that it should be fixed.

If the playbook becomes too complicated, consider making a Fabric script out of it.

A good example of a proper playbook is the text file collector errors playbook here. It has all the above points, including actual fixes for different actual scenarios.

Here's a template to get started:

### Foo errors

The `FooDegraded` looks like this:

    Service Foo has too many errors on test.torproject.org

It means that the service Foo is having some kind of trouble. [Explain
why this happened, and what the impact is, what means for which
users. Are we losing money, data, exposing users, etc.]

[Optional] You can tell this is a real issue by going to place X and
trying Y.

[Ideal] To fix this issue, [inverse the polarity of the shift inverter
in service Foo].

[Optional] We do not yet exactly know how to fix issue, sorry. Please
document here how you fix this next time.

Alerting rule template

Here is an alert template that has most fields you should be using in your alerts.

- alert: FooDegraded
    expr: sum(foo_error_count) by (job, instance, alias, team)
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: "Service Foo has too many errors on {{ $labels.alias }}"
      description: "Found {{ $value }} errors in service Foo on {{ $labels.alias }}."
      playbook: "https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/foo#too-many-errors"

Adding alerting rules to Prometheus

Now that you have an alert, you need to deploy it. The Prometheus servers regularly pull the [prometheus-alerts.git repository][] for alerting rule and target definitions. Alert rules can be added through the repository by adding a file in the rules.d directory, see [rules.d][] directory for more documentation on that.

Note the top of .rules file, for example in the above tpa_node.rules sample we didn't include:

groups:
- name: tpa_node
  rules:

That structure just serves to declare the rest of the alerts in the file. However, consider that "rules within a group are run sequentially at a regular interval, with the same evaluation time" (see the recording rules documentation). So avoid putting all alerts inside the same file. In TPA, we group alerts by exporter, so we have (above) tpa_node for alerts pertaining to the [node_exporter][], for example.

After being merged, the changes should propagate within 4 to 6 hours. Prometheus does not automatically reload those rules by itself, but Puppet should handle reloading the service as a consequence of the file changes. TPA members can accelerate this by running Puppet on the Prometheus servers, or pulling the code and reloading the Prometheus server with:

git -C /etc/prometheus-alerts/ pull
systemctl reload prometheus

Other expression examples

The AptUpdateLagging alert is a good example of an expression with a built-in threshold:

(time() - apt_package_cache_timestamp_seconds)/(60*60) > 24

What this does is calculate the age of the package cache (given by the apt_package_cache_timestamp_seconds metric) by subtracting it from the current time. It gives us a number of second, which we convert to hours (/3600) and then check against our threshold (> 24). This gives us a value (in this case, in hours), we can reuse in our annotation. In general, the formula looks like:

(time() - metric_seconds)/$tick > $threshold

Where threshold is the order of magnitude (minutes, hours, days, etc) similar to the threshold. Note the priority of operators here requires putting the 60*60 tick in parenthesis.

The DiskWillFillSoon alert does a linear regression to try to

predict if a disk will fill in less than 24h:

  (node_filesystem_readonly != 1)
  and (
    node_filesystem_avail_bytes
    / node_filesystem_size_bytes < 0.2
  )
  and (
    predict_linear(node_filesystem_avail_bytes[6h], 24*60*60)
    < 0
  )

The core of the logic is the magic predict_linear function, but also note how it also restricts its checks to file systems with only 20% space left, to avoid warning about normal write spikes.

How-to

Accessing the web interface

Access to prometheus is granted in the same way as for grafana. To obtain access to the prometheus web interface and also to the karma alert dashboard, follow the instructions for accessing grafana

Queries cheat sheet

This section collects PromQL queries we find interesting.

Those are useful, but more complex queries we had to recreate a few times before writing them down.

If you're looking for more basic information about PromQL, see our PromQL primer.

Availability

Those are almost all visible from the availability dashboard.

Unreachable hosts (technically, unavailable node exporters):

up{job="node"} != 1

Currently firing alerts:

ALERTS{alertstate="firing"}

[How much time was the given service (node job, in this case) up in the past period (30d)][]:

avg(avg_over_time(up{job="node"}[30d]))

How many hosts are online at any given point in time:

sum(count(up==1))/sum(count(up)) by (alias)

How long did an alert fire over a given period of time, in seconds per day:

sum_over_time(ALERTS{alertname="MemFullSoon"}[1d:1s])

HTTP Status code associated with blackbox probe failures

sort((probe_success{job="blackbox_https_200"} < 1) + on (alias) group_right probe_http_status_code)

The latter is an example of vector matching, which allows you to "join" multiple metrics together, in this case failed probes (probe_success < 1) with their status code (probe_http_status_code).

Inventory

Those are visible in the main Grafana dashboard.

Number of machines:

count(up{job="node"})

Number of machine per OS version:

count(node_os_info) by (version_id, version_codename)

Number of machines per exporters, or technically, number of machines per job:

sort_desc(sum(up{job=~"$job"}) by (job)

Number of CPU cores, memory size, file system and LVM sizes:

count(node_cpu_seconds_total{classes=~"$class",mode="system"})
sum(node_memory_MemTotal_bytes{classes=~"$class"}) by (alias)
sum(node_filesystem_size_bytes{classes=~"$class"}) by (alias)
sum(node_volume_group_size{classes=~"$class"}) by (alias)

See also the CPU, memory, and disk dashboards.

Uptime, in days:

round((time() - node_boot_time_seconds) / (24*60*60))

Disk usage

This is a less strict version of the [DiskWillFillSoon alert][], see also the disk usage dashboard.

Find disks that will be full in 6 hours:

predict_linear(node_filesystem_avail_bytes[6h], 24*60*60) < 0

Running commands on hosts matching a PromQL query

Say you have an alert or situation (e.g. high load) affecting multiple servers. Say, for example, that you have some issue that you fixed in Puppet that will clear such an alert, and want to run Puppet on all affected servers.

You can use the Prometheus JSON API to return the host list of the hosts matching the query (in this case up < 1) and run commands (in this case, Puppet, or patc) with Cumin:

cumin "$(curl -sSL --data-urlencode='up < 1' 'https://$HTTP_USER@prometheus.torproject.org/api/v1/query | jq -r .data.result[].metric.alias | grep -v '^null$' | paste -sd,)" 'patc'

Make sure to populate the HTTP_USER environment to authenticate with the Prometheus server.

Alert debugging

We are now using Prometheus for alerting for TPA services. Here's a basic overview of how things interact around alerting:

  1. Prometheus is configured to create alerts on certain conditions on metrics.
  2. When the PromQL expression produces a result, an alert is created in state pending.
  3. If the PromQL keeps on producing a result for the whole for duration configured in the alert, then the alert changes to state firing and Prometheus then sends the alert to one or more Alertmanager instance.
  4. Alertmanager receives alerts from Prometheus and is responsible for routing the alert to the appropriate channels. For example:
  5. A team's or service operator's email address
  6. TPA's IRC channel for alerts, #tor-alerts
  7. Karma and Grafana read alert data from Alertmanager and displays them in a way that can be used by humans.

Currently, the secondary Prometheus server (prometheus2) reproduces this setup specifically for sending out alerts to other teams with metrics that are not made public.

This section details how the alerting setup mentioned above works.

In general, the upstream documentation for alerting starts from the Alerting Overview but it can be lacking at times. This tutorial can be quite helpful in better understanding how things are working.

Note that Grafana also has its own alerting system but we are not using that, see the Grafana for alerting section of the TPA-RFC-33 proposal.

Diagnosing alerting failures

Normally, alerts should fire on the Prometheus server and be sent out to the Alertmanager server, and be visible in Karma. See also the alert routing details reference.

If you're not sure alerts are working, head to the Prometheus dashboard and look at the /alerts, and /rules pages. For example:

Typically, the Alertmanager address (currently http://localhost:9093, but to be exposed) should also be useful to manage the Alertmanager, but in practice the Debian package does not ship the web interface, so its interest is limited in that regard. See the amtool section below for more information.

Note that the [/api/v1/targets][] URL is also useful to diagnose problems with exporters, in general, see also the troubleshooting section below.

If you can't access the dashboard at all or if the above seems too complicated, Grafana can be used as a debugging tool for metrics as well. In the Explore section, you can input Prometheus metrics, with auto-completion, and inspect the output directly.

There's also the Grafana availability dashboard, see the Alerting dashboards section for details.

Managing alerts with amtool

Since the Alertmanager web UI is not available in Debian, you need to use the [amtool][] command. A few useful commands:

  • amtool alert: show firing alerts
  • amtool silence add --duration=1h --author=anarcat --comment="working on it" ALERTNAME: silence alert ALERTNAME for an hour, with some comments

Checking alert history

Note that all alerts sent through the Alertmanager are dumped in system logs, through a first "fall through" web hook route:

  routes:
    # dump *all* alerts to the debug logger
    - receiver: 'tpa_http_post_dump'
      continue: true

The receiver is configured below:

  - name: 'tpa_http_post_dump'
    webhook_configs:
      - url: 'http://localhost:8098/'

This URL, in turn, runs a simple Python script that just dumps to a JSON log file all POST requests it receives, which provides us with a history of all notifications sent through the Alertmanager.

All logged entries since last boot can be seen with:

journalctl -u tpa_http_post_dump.service -b

This includes other status logs, so if you want to parse the actual alerts, it's easier to use the logfile in /var/log/prometheus/tpa_http_post_dump.json.

For example, you can see a prettier version of today's entries with the jq command, for example:

jq -C . < /var/log/prometheus/tpa_http_post_dump.json | less -r

Or to follow updates in real time:

tail -F /var/log/prometheus/tpa_http_post_dump.json | jq .

The top-level objects are logging objects, you can also restrict the output to only the alerts being sent with:

tail -F /var/log/prometheus/tpa_http_post_dump.json | jq .args

... which is actually alert groups, which is how Alertmanager dispatches alerts. To see individual alerts inside that group, you want:

tail -F /var/log/prometheus/tpa_http_post_dump.json | jq .args.alerts[]

Logs are automatically rotated every day by the script itself, and kept for 30 days. That configuration is hardcoded in the script's source code.

See tpo/tpa/team#42222 for improvements on retention and more lookup examples.

Testing alerts

Prometheus can run unit tests for your defined alerts. See upstream unit test documentation.

We managed to build a minimal unit test for an alert. Note that for a unit test to succeed, the test must match all the tags and annotations for alerts that are expected, including ones that are added by rewrite in Prometheus:

root@hetzner-nbg1-02:~/tests# cat tpa_system.yml
rule_files:
  - /etc/prometheus-alerts/rules.d/tpa_system.rules

evaluation_interval: 1m

tests:
  # NOTE: interval is *necessary* here. contrary to what the documentation
  #  shows, leaving it out will not default to the evaluation_interval set
  #  above
  - interval: 1m
    # Set of fixtures for the tests below
    input_series:
      - series: 'node_reboot_required{alias="NetworkHealthNodeRelay",instance="akka.0x90.dk:9100",job="relay",team="network"}'
        # this means "one sample set to the value 60" or, as a Python
        # list: [1, 1, 1, 1, ..., 1] or [1 for _ in range(60)]
        #
        # in general, the notation here is 'a+bxn' which turns into
        # the list [a, a+b, a+(2*b), ..., a+(n*b)], or as a list
        # comprehention [a+i*b for i in range(n)]. b defaults to zero,
        # so axn is equivalent to [a for i in range(n)]
        #
        # see https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/#series
        values: '1x60'

    alert_rule_test:
        # NOTE: eval_time is the offset from 0s at which the alert should be
        #  evaluated. if it is shorter than the alert's `for` setting, you will
        #  have some missing values for a while (which might be something you
        #  need to test?). You can play with the eval_time in other test
        #  entries to evaluate the same alert at different offsets in the
        #  timeseries above.
        #
        # Note that the `time()` function returns zero when the evaluation
        # starts, and increments by `interval` until `eval_time` is
        # reached, which differs from how things work in reality,
        # where time() is the number of seconds since the
        # epoch.
        #
        # in other words, this means the simulation starts at the
        # Epoch and stops (here) an hour later.
        - eval_time: 60m
          alertname: NeedsReboot
          exp_alerts:
              # Alert 1.
              - exp_labels:
                    severity: warning
                    instance: akka.0x90.dk:9100
                    job: relay
                    team: network
                    alias: "NetworkHealthNodeRelay"
                exp_annotations:
                    description: "Found pending kernel upgrades for host NetworkHealthNodeRelay"
                    playbook: "https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/reboots"
                    summary: "Host NetworkHealthNodeRelay needs to reboot"

The success result:

root@hetzner-nbg1-01:~/tests# promtool test rules tpa_system.yml
Unit Testing:  tpa_system.yml
  SUCCESS

A failing test will show you what alerts were obtained and how they compare to what your failing test was expecting:

root@hetzner-nbg1-02:~/tests# promtool test rules tpa_system.yml
Unit Testing:  tpa_system.yml
  FAILED:
    alertname: NeedsReboot, time: 10m,
        exp:[
            0:
              Labels:{alertname="NeedsReboot", instance="akka.0x90.dk:9100", job="relay", severity="warning", team="network"}
              Annotations:{}
            ],
        got:[]

The above allows us to confirm that, under a specific set of circumstances (the defined series), a specific query will generate a specific alert with a given set of labels and annotations.

Those labels can then be fed into amtool to test routing. For example, the above alert can be tested against the Alertmanager configuration with:

amtool config routes test alertname="NeedsReboot" instance="akka.0x90.dk:9100" job="relay" severity="warning" team="network"

Or really, what matters in most cases are severity and team, so this also works, and gives out the proper route:

amtool config routes test severity="warning" team="network" ; echo $?

Example:

root@hetzner-nbg1-02:~/tests# amtool config test alertname="NeedsReboot" instance="akka.0x90.dk:9100" job="relay" severity="warning" team="network"
network team

Ignore the warning, it's the difference between testing the live server and the local configuration. Naturally, you can test what happens if the team label is missing or incorrect, to confirm default route errors:

root@hetzner-nbg1-02:~/tests# amtool config routes test severity="warning" team="networking"
fallback

The above, for example, confirms that networking is not the correct team name (it should be network).

Note that you can also deliver an alert to a web hook receiver synthetically. For example, this will deliver and empty message to the IRC relay:

curl --header "Content-Type: application/json" --request POST --data "{}" http://localhost:8098

Checking for targets changes

If you are making significant changes to the way targets are discovered by Prometheus, you might want to make sure you are not missing anything.

There used to be a targets web interface but it might be broken (1108095) or even retired altogether (tpo/tpa/team#41790) and besides, visually checking for this is error-prone.

It's better to do a stricter check. For that, you can use the API endpoint and diff the resulting JSON, after some filtering. Here's an example.

  1. fetch the targets before the change:

    curl localhost:9090/api/v1/targets > before.json
    
  2. make the change (typically by running Puppet):

    pat
    
  3. fetch the targets after the change:

    curl localhost:9090/api/v1/targets > after.json
    
  4. diff the two, you'll notice this is way too noisy because the scrape times have changed. you might also get changed paths that you should ignore:

    diff -u before.json after.json
    

    Files might be sorted differently as well.

  5. so instead, created a filtered and sorted JSON file:

    jq -S '.data.activeTargets| sort_by(.scrapeUrl)' < before.json  | grep -v -e lastScrape -e 'meta_filepath' > before-subset.json
    jq -S '.data.activeTargets| sort_by(.scrapeUrl)' < after.json  | grep -v -e lastScrape -e 'meta_filepath' > after-subset.json
    
  6. then diff the filtered views:

    diff -u before-subset.json after-subset.json
    

Metric relabeling

The blackbox target documentation uses a technique called "relabeling" to have the blackbox exporter actually provide useful labels. This is done with the relabel_configs configuration, which changes labels before the scrape is performed, so that the blackbox exporter is scraped instead of the configured target, and that the configured target is passed to the exporter.

The site relabeler.promlabs.com can be extremely useful to learn how to use and iterate more quickly over those configurations. It takes in a set of labels and a set of relabeling rules and will output a diff of the label set after each rule is applied, showing you in detail what's going on.

There are other uses for this. In the bacula job, for example, we relabel the alias label so that it points at the host being backed up instead of the host where backups are stored:

  - job_name: 'bacula'
    metric_relabel_configs:
      # the alias label is what's displayed in IRC summary lines. we want to
      # know which backup jobs failed alerts, not which backup host contains the
      # failed jobs.
      - source_labels:
          - 'alias'
        target_label: 'backup_host'
      - source_labels:
          - 'bacula_job'
        target_label: 'alias'

The above takes the alias label (e.g. bungei.torproject.org) and copies it to a new label, backup_host. It then takes the bacula_job label and uses that as an alias label. This has the effect of turning a metric like this:

bacula_job_last_execution_end_time{alias="bacula-director-01.torproject.org",bacula_job="alberti.torproject.org",instance="bacula-director-01.torproject.org:9133",job="bacula",team="TPA"}

into that:

bacula_job_last_execution_end_time{alias="alberti.torproject.org",backup_host="bacula-director-01.torproject.org",bacula_job="alberti.torproject.org",instance="bacula-director-01.torproject.org:9133",job="bacula",team="TPA"}

This configuration is different from the blackbox exporter because it operates after the scrape, and therefore affects labels coming out of the exporter (which plain relabel_configs can't do).

This can be really tricky to get right. The equivalent change, for the Puppet reporter, initially caused problems because it dropped the alias label on all node metrics. This was the incorrect configuration:

  - job_name: 'node'
    metric_relabel_configs:
      - source_labels: ['host']
        target_label: 'alias'
        action: 'replace'
      - regex: '^host$'
        action: 'labeldrop'

That destroyed the alias label because the first block matches even if the host was empty. The fix was to match something (anything!) in the host label, making sure it was present, by changing the regex field:

  - job_name: 'node'
    metric_relabel_configs:
      - source_labels: ['host']
        target_label: 'alias'
        action: 'replace'
        regex: '(.+)'
      - regex: '^host$'
        action: 'labeldrop'

Those configurations were done to make it possible to inhibit alerts based on common labels. Before those changes, the alias field (for example) was not common between (say) the Puppet metrics and the normal node exporter, which made it impossible to (say) avoid sending alerts about a catalog being stale in Puppet because a host is down. See tpo/tpa/team#41642 for a full discussion on this.

Note that this is not the same as recording rules, which we do not currently use.

Debugging the blackbox exporter

The upstream documentation has some details that can help. We also have examples above for how to configure it in our setup.

One thing that's nice to know in addition to how it's configured is how you can debug it. You can query the exporter from localhost in order to get more information. If you are using this method for debugging, you'll most probably want to include debugging output. For example, to run an ICMP test on host pauli.torproject.org:

curl http://localhost:9115/probe?target=pauli.torproject.org&module=icmp&debug=true

Note that the above trick can be used for any target, not just for ones currently configured in the blackbox exporter. So you can also use this to test things before creating the final configuration for the target.

Tracing a metric to its source

If you have a metric (say gitlab_workhorse_http_request_duration_seconds_bucket) that you don't know where it's coming from, try getting the full metric with its label, and look at the job label. This can be done in the Prometheus web interface or with Fabric, for example with:

fab prometheus.query-to-series --expression gitlab_workhorse_http_request_duration_seconds_bucket

For our sample metric, it shows:

anarcat@angela:~/s/t/fabric-tasks> fab prometheus.query-to-series --expression gitlab_workhorse_http_request_duration_seconds_bucket | head
INFO: sending query gitlab_workhorse_http_request_duration_seconds_bucket to https://prometheus.torproject.org/api/v1/query
gitlab_workhorse_http_request_duration_seconds_bucket{alias="gitlab-02.torproject.org",backend_id="rails",code="200",instance="gitlab-02.torproject.org:9229",job="gitlab-workhorse",le="0.005",method="get",route_id="default",team="TPA"} 162
gitlab_workhorse_http_request_duration_seconds_bucket{alias="gitlab-02.torproject.org",backend_id="rails",code="200",instance="gitlab-02.torproject.org:9229",job="gitlab-workhorse",le="0.025",method="get",route_id="default",team="TPA"} 840

The details of those metrics don't matter, what matters is the job label here:

job="gitlab-workhorse"

This corresponds to a job field in the Prometheus configuration. On the prometheus1 server, for example, we can see this in /etc/prometheus/prometheus.yml:

- job_name: gitlab-workhorse
  static_configs:
  - targets:
    - gitlab-02.torproject.org:9229
    labels:
      alias: gitlab-02.torproject.org
      team: TPA

Then you can go on gitlab-02 and see what listens on port 9229:

root@gitlab-02:~# lsof -n -i :9229
COMMAND    PID USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
gitlab-wo 1282  git    3u  IPv6   14159      0t0  TCP *:9229 (LISTEN)
gitlab-wo 1282  git  561u  IPv6 2450737      0t0  TCP [2620:7:6002:0:266:37ff:feb8:3489]:9229->[2a01:4f8:c2c:1e17::1]:59922 (ESTABLISHED)

... which is:

root@gitlab-02:~# ps 1282
    PID TTY      STAT   TIME COMMAND
   1282 ?        Ssl    9:56 /opt/gitlab/embedded/bin/gitlab-workhorse -listenNetwork unix -listenUmask 0 -listenAddr /var/opt/gitlab/gitlab-workhorse/sockets/s

So that's the GitLab Workhorse proxy, in this case.

In other case, you'll more typically find it's the node job, in which case that's typically the node exporter. But rather exotic metrics can show up there: typically, those would be written by an external job to /var/lib/prometheus/node-exporter, also known as the "textfile collector". To find what generates that, you need to either watch the file change or grep for the filename in Puppet.

Advanced metrics ingestion

This section documents more advanced metrics injection topics that we rarely need or use.

Back-filling

Starting from Prometheus 2.24, Prometheus now supports back-filling. This is untested, but this guide might provide a good tutorial.

Push metrics to the Pushgateway

The Pushgateway is setup on the secondary Prometheus server (prometheus2). Note that you might not need to use the Pushgateway, see the article about pushing metrics before going down this route.

The Pushgateway is fairly particular: it listens on port 9091 and gets data through a fairly simple curl-friendly command line API. We have found that, once installed, this command just "does the right thing", more or less:

echo 'some_metrics{foo="bar"} 3.14 | curl --data-binary @- http://localhost:9091/metrics/job/jobtest/instance/instancetest

To confirm the data was injected by the Push gateway, this can be done:

curl localhost:9091/metrics | head

The Pushgateway is scraped, like other Prometheus jobs, every minute, with metrics kept for a year, at the time of writing. This is configured, inside Puppet, in profile::prometheus::server::external.

Note that it's not possible to push timestamps into the Pushgateway, so it's not useful to ingest past historical data.

Deleting metrics

Deleting metrics can be done through the Admin API. That first needs to be enabled in /etc/default/prometheus, by adding --web.enable-admin-api to the ARGS list, then Prometheus needs to be restarted:

service prometheus restart

WARNING: make sure there is authentication in front of Prometheus because this could expose the server to more destruction.

Then you need to issue a special query through the API. This, for example, will wipe all metrics associated with the given instance:

curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={instance="gitlab-02.torproject.org:9101"}'

The same, but only for about an hour, good for testing that only the wanted metrics are destroyed:

curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={instance="gitlab-02.torproject.org:9101"}&start=2021-10-25T19:00:00Z&end=2021-10-25T20:00:00Z'

To match only a job on a specific instance:

curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={instance="gitlab-02.torproject.org:9101"}&match[]={job="gitlab"}'

Deleted metrics are not necessarily immediately removed from disk but are "eligible for compaction". Changes should show up immediately however. The "Clean Tombstones" should be used to remove samples from disk, if that's absolutely necessary:

curl -XPOST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones

Make sure to disable the Admin API when done.

Pager playbook

This section documents alerts and issues with the Prometheus service itself. Do NOT document all alerts possibly generated from the Prometheus here! Document those in the individual services pages, and link to that in the alert's playbook annotation.

What belong here are only alerts that truly don't have any other place to go, or that are completely generic to any service (e.g. JobDown is in its place here). Generic operating system issues like "disk full" or else must be documented elsewhere, typically in incident-response.

Troubleshooting missing metrics

If metrics do not correctly show up in Grafana, it might be worth checking in the Prometheus dashboard itself for the same metrics. Typically, if they do not show up in Grafana, they won't show up in Prometheus either, but it's worth a try, even if only to see the raw data.

Then, if data truly isn't present in Prometheus, you can track down the "target" (the exporter) responsible for it in the [/api/v1/targets][] listing. If the target is "unhealthy", it will be marked as "down" and an error message will show up.

This will show all down targets with their error messages:

curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up") | {instance: .labels, scrapeUrl, health, lastError}'

If it returns nothing, it means that all targets are empty. Here's an example of a probe that has not completed yet:

root@hetzner-nbg1-01:~# curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up") | {instance: .labels, scrapeUrl, health, lastError}'
{
  "instance": "gitlab-02.torproject.org:9188",
  "health": "unknown",
  "lastError": ""
}

... and, after a while, an error might come up:

root@hetzner-nbg1-01:~# curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up") | {instance: .labels, scrapeUrl, health, lastError}'
{
  "instance": {
    "alias": "gitlab-02.torproject.org",
    "instance": "gitlab-02.torproject.org:9188",
    "job": "gitlab",
    "team": "TPA"
  },
  "scrapeUrl": "http://gitlab-02.torproject.org:9188/metrics",
  "health": "down",
  "lastError": "Get \"http://gitlab-02.torproject.org:9188/metrics\": dial tcp [2620:7:6002:0:266:37ff:feb8:3489]:9188: connect: connection refused"
}

In that case, there was a typo in the port number, which was incorrect. The correct port was 9187 and, when changed, the target was scraped properly. You can directly verify a given target with this jq incantation:

curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.instance == "gitlab-02.torproject.org:9187") | {instance: .labels, health, lastError}'

For example:

root@hetzner-nbg1-01:~# curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.instance == "gitlab-02.torproject.org:9187") | {instance: .labels, health, lastError}'
{
  "instance": {
    "alias": "gitlab-02.torproject.org",
    "instance": "gitlab-02.torproject.org:9187",
    "job": "gitlab",
    "team": "TPA"
  },
  "health": "up",
  "lastError": ""
}
{
  "instance": {
    "alias": "gitlab-02.torproject.org",
    "classes": "role::gitlab",
    "instance": "gitlab-02.torproject.org:9187",
    "job": "postgres",
    "team": "TPA"
  },
  "health": "up",
  "lastError": ""
}

Note that the above is an example of a mis-configuration: in this case, the target was scraped twice. Once from Puppet (the classes label is a good hint of that) and the other from the static configuration. The latter was removed.

If the target is marked healthy, the next step is to scrape the metrics manually. This, for example, will scrape the Apache exporter from the host gayi:

curl -s http://gayi.torproject.org:9117/metrics | grep apache

In the case of this bug, the metrics were not showing up at all:

root@hetzner-nbg1-01:~# curl -s http://gayi.torproject.org:9117/metrics | grep apache
# HELP apache_exporter_build_info A metric with a constant '1' value labeled by version, revision, branch, and goversion from which apache_exporter was built.
# TYPE apache_exporter_build_info gauge
apache_exporter_build_info{branch="",goversion="go1.7.4",revision="",version=""} 1
# HELP apache_exporter_scrape_failures_total Number of errors while scraping apache.
# TYPE apache_exporter_scrape_failures_total counter
apache_exporter_scrape_failures_total 18371
# HELP apache_up Could the apache server be reached
# TYPE apache_up gauge
apache_up 0

Notice, however, the apache_exporter_scrape_failures_total, which was incrementing. From there, we reproduced the work the exporter was doing manually and fixed the issue, which involved passing the correct argument to the exporter.

Slow startup times

If Prometheus takes a long time to start, and floods logs with lines like this every second:

Nov 01 19:43:03 hetzner-nbg1-02 prometheus[49182]: level=info ts=2022-11-01T19:43:03.788Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=30182 maxSegment=30196

It's somewhat normal. At the time of writing, Prometheus2 takes over a minute to start because of this problem. When it's done, it will show the timing information, which is currently:

Nov 01 19:43:04 hetzner-nbg1-02 prometheus[49182]: level=info ts=2022-11-01T19:43:04.533Z caller=head.go:722 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=314.859946ms wal_replay_duration=1m16.079474672s total_replay_duration=1m16.396139067s

The solution for this is to use the memory-snapshot-on-shutdown feature flag, but that is available only from 2.30.0 onward (not in Debian bullseye), and there are critical bugs in the feature flag before 2.34 (see PR 10348), so thread carefully.

In other words, this is frustrating, but expected for older releases of Prometheus. Newer releases may have optimizations for this, but they need a restart to apply.

Pushgateway errors

The Pushgateway web interface provides some basic information about the metrics it collects, and allow you to view the pending metrics before they get scraped by Prometheus, which may be useful to troubleshoot issues with the gateway.

To pull metrics by hand, you can pull directly from the Pushgateway:

curl localhost:9091/metrics

If you get this error while pulling metrics from the exporter:

An error has occurred while serving metrics:

collected metric "some_metric" { label:<name:"instance" value:"" > label:<name:"job" value:"some_job" > label:<name:"tag" value:"val1" > counter:<value:1 > } was collected before with the same name and label values

It's because similar metrics were sent twice into the gateway, which corrupts the state of the Pushgateway, a known problems in earlier versions and fixed in 0.10 (Debian bullseye and later). A workaround is simply to restart the Pushgateway (and clear the storage, if persistence is enabled, see the --persistence.file flag).

Running out of disk space

In #41070, we encountered a situation where disk usage on the main Prometheus server was growing linearly even if the number of targets didn't change. This is a typical problem in time series like this where the "cardinality" of metrics grows without bound, consuming more and more disk space as time goes by.

The first step is to confirm the diagnosis by looking at the Grafana graph showing Prometheus disk usage over time. This should show a "sawtooth wave" pattern where compactions happen regularly (about once every three weeks), but without growing much over longer periods of time. In the above ticket, the usage was growing despite compactions. There are also shorter-term (~4h) and smaller compactions happening. This information is also available in the normal disk usage graphic.

We then headed for the self-diagnostics Prometheus provides at:

https://prometheus.torproject.org/classic/status

The "Most Common Label Pairs" section will show us which job is responsible for the most number of metrics. It should be job=node, as that collects a lot of information for all the machines managed by TPA. About 100k pairs is expected there.

It's also expected to see the "Highest Cardinality Labels" to be __name__ at around 1600 entries.

We haven't implemented it yet, but the upstream Storage documentation has some interesting tips, including advice on long-term storage which suggests tweaking the storage.local.series-file-shrink-ratio.

This guide from Alexandre Vazquez also had some useful queries and tips we didn't fully investigate. For example, this reproduces the "Highest Cardinality Metric Names" panel in the Prometheus dashboard:

topk(10, count by (__name__)({__name__=~".+"}))

The api/v1/status/tsdb endpoint also provides equivalent statistics. Here are the equivalent fields:

  • Highest Cardinality Labels: labelValueCountByLabelName
  • Highest Cardinality Metric Names: seriesCountByMetricName
  • Label Names With Highest Cumulative Label Value Length: memoryInBytesByLabelName
  • Most Common Label Pairs: seriesCountByLabelValuePair

Default route errors

If you get an email like:

Subject: Configuration error - Default route: [FIRING:1] JobDown

It's because an alerting rule fired with an incorrect configuration. Instead of being routed to the proper team, it fell through the default route.

This is not an emergency in the sense that it's a normal alert, but it just got routed improperly. It should be fixed, in time. If in a rush, open a ticket for the team likely responsible for the alerting rule.

Finding the responsible party

So the first step, even if just filing a ticket, is to find the responsible party.

Let's take this email for example:

Date: Wed, 03 Jul 2024 13:34:47 +0000
From: alertmanager@hetzner-nbg1-01.torproject.org
To: root@localhost
Subject: Configuration error - Default route: [FIRING:1] JobDown


CONFIGURATION ERROR: The following notifications were sent via the default route node, meaning
that they had no team label matching one of the per-team routes.

This should not be happening and it should be fixed. See:
https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#reference

Total firing alerts: 1



## Firing Alerts

-----
Time: 2024-07-03 13:34:17.366 +0000 UTC
Summary:  Job mtail@rdsys-test-01.torproject.org is down
Description:  Job mtail on rdsys-test-01.torproject.org has been down for more than 5 minutes.

-----

in the above, the mtail job on rdsys-test-01 "has been down for more than 5 minutes" and has been routed to root@localhost.

The more likely target for that rule would probably be TPA, which manages the mtail service and jobs, even though the services on that host are managed by the anti-censorship team service admins. If the host was not managed by TPA or this was a notification about a service operated by the team, then a ticket should be filed there.

In this case, #41667 was filed.

Fixing routing

To fix this issue, you must first reproduce the query that triggered the alert. This can be found in the Prometheus alerts dashboard, if the alert is still firing. In this case, we see this:

Labels State Active Since Value
alertname="JobDown" alias="rdsys-test-01.torproject.org" classes="role::rdsys::backend" instance="rdsys-test-01.torproject.org:3903" job="mtail" severity="warning" Firing 2024-07-03 13:51:17.36676096 +0000 UTC 0

In this case, we can see there's no team label on that metric, which is the root cause.

If we can't find the alert anymore (say it fixed itself), we can still try to look for the matching alerting rule. Grep for the alertname above in prometheus-alerts.git. In this case, we find:

anarcat@angela:prometheus-alerts$ git grep JobDown
rules.d/tpa_system.rules:  - alert: JobDown

and the following rule:

  - alert: JobDown
    expr: up < 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: 'Job {{ $labels.job }}@{{ $labels.alias }} is down'
      description: 'Job {{ $labels.job }} on {{ $labels.alias }} has been down for more than 5 minutes.'
      playbook: "TODO"

The query, in this case, is therefore up < 1. But since the alert has resolved, we can't actually do the exact same query and expect to find the same host, we need instead to broaden the query without the conditional (so just up) and add the right labels. In this case this should do the trick:

up{instance="rdsys-test-01.torproject.org:3903",job="mtail"}

which, when we query Prometheus directly, gives us the following metric:

up{alias="rdsys-test-01.torproject.org",classes="role::rdsys::backend",instance="rdsys-test-01.torproject.org:3903",job="mtail"}
0

There you can see all the labels associated with the metric. Those match the alerting rule labels, but that may not always be the case, so that step can be helpful to confirm root cause.

So, in this case, the mtail job doesn't have the right team label. The fix was to add the team label to the scrape job:

commit 68e9b463e10481745e2fd854aa657f804ab3d365
Author: Antoine Beaupré <anarcat@debian.org>
Date:   Wed Jul 3 10:18:03 2024 -0400

    properly pass team label to postfix mtail job

    Closes: tpo/tpa/team#41667

diff --git a/modules/mtail/manifests/postfix.pp b/modules/mtail/manifests/postfix.pp
index 542782a33..4c30bf563 100644
--- a/modules/mtail/manifests/postfix.pp
+++ b/modules/mtail/manifests/postfix.pp
@@ -8,6 +8,11 @@ class mtail::postfix (
   class { 'mtail':
     logs       => '/var/log/mail.log',
     scrape_job => $scrape_job,
+    scrape_job_labels => {
+      'alias'   => $::fqdn,
+      'classes' => "role::${pick($::role, 'undefined')}",
+      'team'    => 'TPA',
+    },
   }
   mtail::program { 'postfix':
     source => 'puppet:///modules/mtail/postfix.mtail',

See also testing alerts to drill down into queries and alert routing, in case the above doesn't work.

Exporter job down warnings

If you see an error like:

Exporter job gitlab_runner on tb-build-02.torproject.org:9252 is down

That is because Prometheus cannot reach the exporter at the given address. The right way forward is to looks at the targets listing and see why Prometheus is failing to scrape the target.

Service down

The simplest and most obvious case is that the service is just down. For example, Prometheus has this to say about the above gitlab_runner job:

Get "http://tb-build-02.torproject.org:9252/metrics": dial tcp [2620:7:6002:0:3eec:efff:fed5:6c40]:9252: connect: connection refused

In this case, the gitlab-runner was just not running (yet). It was being configured and had been added to Puppet, but wasn't yet correctly setup.

In another scenario, however, it might just be that the service is down. Use curl to confirm Prometheus' view, restricting to IPv4 and IPv6:

curl -4 http://tb-build-02.torproject.org:9252/metrics
curl -6 http://tb-build-02.torproject.org:9252/metrics

Try this from the server itself as well.

If you know which service it is (and the job name should be a good hint), check the service on the server, in this case:

systemctl status gitlab-runner

Invalid exporter output

In another case:

Exporter job civicrm@crm.torproject.org:443 is down

Prometheus was failing with this error:

expected value after metric, got "INVALID"

That means there's a syntax error in the metrics output, in this case no value was provided for a metric, like this:

# HELP civicrm_torcrm_resque_processor_status_up Resque processor status
# TYPE civicrm_torcrm_resque_processor_status_up gauge
civicrm_torcrm_resque_processor_status_up

See [web/civicrm#149][] for further details on this outage.

Forbidden errors

Another example might be:

server returned HTTP status 403 Forbidden

In which case there's a permission issue on the exporter endpoint. Try to reproduce the issue by pulling the endpoint directly, on the Prometheus server, with, for example:

curl -sSL https://donate.torproject.org:443/metrics

Or whatever URL is visible in the targets listing above. This could be a web server configuration or lack of matching credentials in the exporter configuration. Look in tor-puppet.git, the profile::prometheus::server::internal::collect_scrape in hiera/common/prometheus.yaml, where credentials should be defined (although they should actually be stored in Trocla).

Apache exporter scraping failed

If you get the error Apache Exporter cannot monitor web server on test.example.com (ApacheScrapingFailed), Apache is up, but the Apache exporter cannot pull its metrics from there.

That means the exporter cannot pull the URL http://localhost/server-status/?auto. To reproduce, pull the URL with curl from the affected server, for example:

root@test.example.com:~# curl http://localhost/server-status/?auto

This is a typical configuration error in Apache where the /server-status host is not available to the exporter because the "default virtual host" was disabled (apache2::default_vhost in Hiera).

There is normally a workaround for this in the profile::prometheus::apache_exporter class, which configures a localhost virtual host to answer properly on this address. Verify that it's present, consider using apache2ctl -S to see the virtual host configuration.

See also the Apache web server diagnostics in the incident response docs for broader issues with web servers.

Text file collector errors

The NodeTextfileCollectorErrors looks like this:

Node exporter textfile collector errors on test.torproject.org

It means that the text file collector is having trouble parsing one or many of the files in its --collector.textfile.directory (defaults to /var/lib/prometheus/node-exporter).

The error should be visible in the node exporter logs, run the following command to see it:

journalctl -u prometheus-node-exporter -e

Here's a list of issues found in the wild, but your particular issue might be different.

Wrong permissions

Sep 24 20:56:53 bungei prometheus-node-exporter[1387]: ts=2024-09-24T20:56:53.280Z caller=textfile.go:227 level=error collector=textfile msg="failed to collect textfile data" file=tpa_backuppg.prom err="failed to open textfile data file \"/var/lib/prometheus/node-exporter/tpa_backuppg.prom\": open /var/lib/prometheus/node-exporter/tpa_backuppg.prom: permission denied"

In this case, the file was created as a temporary file and moved into place without fixing the permission. The fix was to simply create the file without the tempfile Python library, with a .tmp suffix, and just move it into place.

Garbage in a text file

Sep 24 21:14:41 perdulce prometheus-node-exporter[429]: ts=2024-09-24T21:14:41.783Z caller=textfile.go:227 level=error collector=textfile msg="failed to collect textfile data" file=scheduled_shutdown_metric.prom err="failed to parse textfile data from \"/var/lib/prometheus/node-exporter/scheduled_shutdown_metric.prom\": text format parsing error in line 3: expected '\"' at start of label value, found 'r'"

This was an experimental metric designed in #41734 to keep track of scheduled reboot times, but it was formatted incorrectly. The entire file content was:

# HELP node_shutdown_scheduled_timestamp_seconds time of the next scheduled reboot, or zero
# TYPE node_shutdown_scheduled_timestamp_seconds gauge
node_shutdown_scheduled_timestamp_seconds{kind=reboot} 1725545703.588789

It was missing quotes around reboot, the proper output would have been:

# HELP node_shutdown_scheduled_timestamp_seconds time of the next scheduled reboot, or zero
# TYPE node_shutdown_scheduled_timestamp_seconds gauge
node_shutdown_scheduled_timestamp_seconds{kind="reboot"} 1725545703.588789

But the file was simply removed in this case.

Disaster recovery

If a Prometheus/Grafana is destroyed, it should be completely re-buildable from Puppet. Non-configuration data should be restored from backup, with /var/lib/prometheus/ being sufficient to reconstruct history. If even backups are destroyed, history will be lost, but the server should still recover and start tracking new metrics.

Reference

Installation

Puppet implementation

Every TPA server is configured as a node-exporter through the roles::monitored that is included everywhere. The role might eventually be expanded to cover alerting and other monitoring resources as well. This role, in turn, includes the profile::prometheus::client which configures each client correctly with the right firewall rules.

The firewall rules are exported from the server, defined in profile::prometheus::server. We hacked around limitations of the upstream Puppet module to install Prometheus using backported Debian packages. The monitoring server itself is defined in roles::monitoring.

The Prometheus Puppet module was heavily patched to allow scrape job collection and use of Debian packages for installation, among many other patches sent by anarcat.

Much of the initial Prometheus configuration was also documented in ticket 29681 and especially ticket 29388 which investigates storage requirements and possible alternatives for data retention policies.

Pushgateway

The Pushgateway was configured on the external Prometheus server to allow for the metrics people to push their data inside Prometheus without having to write a Prometheus exporter inside Collector.

This was done directly inside the profile::prometheus::server::external class, but could be moved to a separate profile if it needs to be deployed internally. It is assumed that the gateway script will run directly on prometheus2 to avoid setting up authentication and/or firewall rules, but this could be changed.

Alertmanager

The Alertmanager is configured on the Prometheus servers and is used to send alerts over IRC and email.

It is installed through Puppet, in profile::prometheus::server::external, but could be moved to its own profile if it is deployed on more than one server.

Note that Alertmanager only dispatches alerts, which are actually generated on the Prometheus server side of things. Make sure the following block exists in the prometheus.yml file:

alerting:
  alert_relabel_configs: []
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093

Manual node configuration

External services can be monitored by Prometheus, as long as they comply with the OpenMetrics protocol, which is simply to expose metrics such as this over HTTP:

metric{label=label_val}  value

A real-life (simplified) example:

node_filesystem_avail_bytes{alias="alberti.torproject.org",device="/dev/sda1",fstype="ext4",mountpoint="/"} 16160059392

The above says that the node alberti has the device /dev/sda mounted on /, formatted as an ext4 file system which has 16160059392 bytes (~16GB) free.

System-level metrics can easily be monitored by the secondary Prometheus server. This is usually done by installing the "node exporter", with the following steps:

  • On Debian Buster and later:

    apt install prometheus-node-exporter
    
  • On Debian stretch:

    apt install -t stretch-backports prometheus-node-exporter
    

Assuming that backports is already configured. If it isn't, such a line in /etc/apt/sources.list.d/backports.debian.org.list should suffice, followed by an apt update:

    deb https://deb.debian.org/debian/  stretch-backports   main contrib non-free

The firewall on the machine needs to allow traffic on the exporter port from the server prometheus2.torproject.org. Then open a ticket for TPA to configure the target. Make sure to mention:

  • The host name for the exporter
  • The port of the exporter (varies according to the exporter, 9100 for the node exporter)
  • How often to scrape the target, if non-default (default: 15 seconds)

Then TPA needs to hook those as part of a new node job in the scrape_configs, in prometheus.yml, from Puppet, in profile::prometheus::server.

See also Adding metrics to applications, above.

Upgrades

Upgrades are automatically managed by official Debian packages everywhere, except Grafana that's managed by upstream packages and Karma that's managed through a container, still automated.

SLA

Prometheus is currently not doing alerting so it doesn't have any sort of guaranteed availability. It should, hopefully, not lose too many metrics over time so we can do proper long-term resource planning.

Design and architecture

Here is, from the Prometheus overview documentation, the basic architecture of a Prometheus site:

A
drawing of Prometheus' architecture, showing the push gateway and
exporters adding metrics, service discovery through file_sd and
Kubernetes, alerts pushed to the Alertmanager and the various UIs
pulling from Prometheus

As you can see, Prometheus is somewhat tailored towards Kubernetes but it can be used without it. We're deploying it with the file_sd discovery mechanism, where Puppet collects all exporters into the central server, which then scrapes those exporters every scrape_interval (by default 15 seconds).

It does not show that Prometheus can federate to multiple instances and the Alertmanager can be configured with High availability. We have a monolithic server setup right now, that's planned for TPA-RFC-33-C.

Metrics types

In monitoring distributed systems, Google defines 4 "golden signals", categories of metrics that need to be monitored:

  • Latency: time to service a request
  • Traffic: transactions per second or bandwidth
  • Errors: failure rates, e.g. 500 errors in web servers
  • Saturation: full disks, memory, CPU utilization, etc

In the book, they argue all four should issue pager alerts, but we believe warnings for saturation, except for extreme cases ("disk actually full") might be sufficient.

Alertmanager

The Alertmanager is a separate program that receives notifications generated by Prometheus servers through an API, groups, and deduplicates notifications before sending them by email or other mechanisms.

Here's how the internal design of the Alertmanager looks like:

Internal architecture of the Alert manager, showing how they get the alerts from Prometheus through an API and internally pushes this through various storage queues and deduplicating notification pipelines, along with a clustered gossip protocol

The first deployments of the Alertmanager at TPO do not feature a "cluster", or high availability (HA) setup.

The Alertmanager has its own web interface to see and silence alerts but it's not deployed in our configuration, we use Karma (previously Cloudflare's unsee) instead.

Alerting philosophy

In general, when working on alerting, keeping the "My Philosophy on Alerting" paper from a Google engineer (now the Monitoring distributed systems chapter of the Site Reliability Engineering O'Reilly book.

Alert timing details

Alert timing can be a hard topic to understand in Prometheus alerting, because there are many components associated with it, and Prometheus documentation is not great at explaining how things work clearly. This is an attempt at explaining various parts of it as I (anarcat) understand it as of 2024-09-19, based the latest documentation available on https://prometheus.io and the current Alertmanager git HEAD.

First, there might be a time vector involved in the Prometheus query. For example, take the query:

increase(django_http_exceptions_total_by_type_total[5m]) > 0

Here, the "vector range" is 5m or five minutes. You might think this will fire only after 5 minutes have passed. I'm not actually sure. In my observations, I have found this fires as soon as an increase is detected, but will stop after the vector range has passed.

Second, there's the for: parameter in the alerting rule. Say this was set to 5 minutes again:

- alert: DjangoExceptions
  expr: increase(django_http_exceptions_total_by_type_total[5m]) > 0
  for: 5m

This means that the alert will be considered only pending for that period. Prometheus will not send an alert to the Alertmanager at all unless increase() was sustained for the period. If that happens, then the alert is marked as firing and Alertmanager will start getting the alert.

(Alertmanager might be getting the alert in the pending state, but that makes no difference to our discussion: it will not send alerts before that period has passed.)

Third, there's another setting, keep_firing_for, that will make Prometheus keep firing the alert even after the query evaluates to false. We're ignoring this for now.

At this point, the alert has reached Alertmanager and it needs to make a decision of what to do with it. More timers are involved.

Alerts will be evaluated against the alert routes, thus aggregated into a new group or added to an existing group according to that route's group_by setting, and then Alertmanager will evaluate the timers set on the particular route that was matched. An alert group is created when an alert is received and no other alerts already match the same values for the group_by criteria. An alert group is removed when all alerts in a group are in state inactive (e.g. resolved).

Fourth, there's the group_wait setting (defaults to 5 seconds, can be customized by route). This will keep Alertmanager from routing any alerts for a while thus allowing it to group the first alert notification for all alerts in the same group in one batch. It implies that you will not receive a notification for a new alert before that timer has elapsed. See also the too short documentation on grouping.

(The group_wait timer is initialized when the alerting group is created, see [dispatch/dispatch.go, line 415, function newAggrGroup][].)

Now, more alerts might be sent by Prometheus if more metrics match the above expression. They are different alerts because they have different labels (say, another host might have exceptions, above, or, more commonly, other hosts require a reboot). Prometheus will then relay that alert to the Alertmanager, and another timer comes in.

Fifth, before relaying that new alert that's already part of a firing group, Alertmanager will wait group_interval (defaults to 5m) before re-sending a notification to a group.

When Alertmanager first creates an alert group, a thread is started for that group and the route's group_interval acts like a time ticker. Notifications are only sent when the group_interval period repeats.

So new alerts merged in a group will wait up to group_interval before being relayed.

(The group_interval timer is also initialized [in dispatch.go, line 460, function aggrGroup.run()][]. It's done after that function waits for the previous timer which is normally based on the group_wait value, but can be switched to group_interval after that very iteration, of course.)

So, conclusions:

  • If an alert flaps because it pops in and out of existence, consider tweaking the query to cover a longer vector, by increasing the time range (e.g. switch from 5m to 1h), or by comparing against a moving average

  • If an alert triggers too quickly due to a transient event (say network noise, or someone messing up a deployment but you want to give them a chance to fix it), increase the for: timer.

  • Inversely, if you fail to detect transient outages, reduce the for: timer, but be aware this might pick up other noises.

  • If alerts come too soon and you get a flood of alerts when an outage starts, increase group_wait.

  • If alerts come in slowly but fail to be group because they don't arrive at the same time, increase group_interval.

This analysis was done in response to a mysterious failure to send notification in a particularly flappy alert.

Another issue with alerting in Prometheus is that you can only silence warnings for a certain amount of time, then you get a notification again. The kthxbye bot works around that issue.

Alert routing details

Once Prometheus has created an alert, it sends it to one or more instances of Alertmanager. This one in turn is responsible for routing the alert to the right communication channel.

That is, if Alertmanager is correctly configured, that is if it's configured in prometheus.yml, the alerting section, see Installation section.

Alert routes are set as a hierarchical tree in which the first route that matches gets to handle the alert. The first-matching route may decide to ask Alertmanager to continue processing with other routes so that the same alert can match multiple routes. This is how TPA receives emails for critical alerts and also IRC notifications for both warning and critical.

Each route needs to have one or more receivers set.

Receivers are and routes are defined in Hiera in hiera/common/prometheus.yaml.

Receivers

Receivers are set in the key prometheus::alertmanager::receivers and look like this:

- name: 'TPA-email'
  email_configs:
    - to: 'recipient@example.com'
      require_tls: false
      text: '{{ template "email.custom.txt" . }}'
      headers:
        subject: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " -- " }}'

Here we've configured an email recipient. Alertmanager can send alerts with a bunch of other communications channels. For example to send IRC notifications, we have a daemon binding to localhost on the Prometheus server waiting for web hook calls, and the corresponding receiver has a section webhook_configs instead of email_configs.

Routes

Alert routes are set in the key prometheus::alertmanager::route in Hiera. The default route, the one set at the top level of that key, uses the receiver fallback and some default options for other routes.

The default route should not be explicitly used by alerts. We always want to explicitly match on a set of labels to send alerts to the correct destination. Thus, the default recipient uses a different message template that explicitly says there is a configuration error. This way we can more easily catch what's been wrongly configured.

The default route has a key routes. This is where additional routes are set.

A route needs to set a recipient and then can match on certain label values, using the matchers list. Here's an example for the TPA IRC route:

- receiver: 'irc-tor-admin'
  matchers:
    - 'team = "TPA"'
    - 'severity =~ "critical|warning"'

Pushgateway

The Pushgateway is a separate server from the main Prometheus server that is designed to "hold" onto metrics for ephemeral jobs that would otherwise be around long enough for Prometheus to scrape their metrics. We use it as a workaround to bridge Metrics data with Prometheus/Grafana.

Configuration

The Prometheus server is currently configured mostly through Puppet, where modules define exporters and "export resources" that get collected on the central server, which then scrapes those targets.

The [prometheus-alerts.git repository][] contains all alerts and some non-TPA targets, specified in the targets.d directory for all teams.

Services

Prometheus is made of multiple components:

  • Prometheus: a daemon with an HTTP API that scrapes exporters and targets for metrics, evaluates alerting rules and sends alerts to the Alertmanager
  • Alertmanager: another daemon with HTTP APIs that receives alerts from one or more Prometheus daemons, gossips with other Alertmanagers to deduplicate alerts, and send notifications to receivers
  • Exporters: HTTP endpoints that expose Prometheus metrics, scraped by Prometheus
  • Node exporter: a specific exporter to expose system-level metrics like memory, CPU, disk usage and so on
  • Text file collector: a directory read by the node exporter where other tools can drop metrics

So almost everything happens over HTTP or HTTPS.

Many services expose their metrics by running cron jobs or systemd timers that write to the node exporter text file collector.

Monitored services

Those are the actual services monitored by Prometheus.

Internal server (prometheus1)

The "internal" server scrapes all hosts managed by Puppet for TPA. Puppet installs a [node_exporter][] on all servers, which takes care of metrics like CPU, memory, disk usage, time accuracy, and so on. Then other exporters might be enabled on specific services, like email or web servers.

Access to the internal server is fairly public: the metrics there are not considered to be security sensitive and protected by authentication only to keep bots away.

External server (prometheus2)

The "external" server, on the other hand, is more restrictive and does not allow public access. This is out of concern that specific metrics might lead to timing attacks against the network and/or leak sensitive information. The external server also explicitly does not scrape TPA servers automatically: it only scrapes certain services that are manually configured by TPA.

Those are the services currently monitored by the external server:

  • [bridgestrap][]
  • [rdsys][]
  • OnionPerf external nodes' node_exporter
  • Connectivity test on (some?) bridges (using the [blackbox_exporter][])

Note that this list might become out of sync with the actual implementation, look into Puppet in profile::prometheus::server::external for the actual deployment.

This separate server was actually provisioned for the anti-censorship team (see this comment for background). The server was setup in July 2019 following #31159.

Other possible services to monitor

Many more exporters could be configured. A non-exhaustive list was built in ticket #30028 around launch time. Here we can document more such exporters we find along the way:

  • Prometheus Onion Service Exporter - "Export the status and latency of an onion service"
  • [hsprober][] - similar, but also with histogram buckets, multiple attempts, warm-up and error counts
  • [haproxy_exporter][]

There's also a list of third-party exporters in the Prometheus documentation.

Storage

Prometheus stores data in its own custom "time-series database" (TSDB).

Metrics are held for about a year or less, depending on the server. Look at this dashboard for current disk usage of the Prometheus servers.

The actual disk usage depends on:

  • N: the number of exporters
  • X: the number of metrics they expose
  • 1.3 bytes: the size of a sample
  • P: the retention period (currently 1 year)
  • I: scrape interval (currently one minute)

The formula to compute disk usage is this:

N x X x 1.3 bytes x P / I

For example, in ticket 29388, we compute that a simple node exporter setup with 2500 metrics, with 80 nodes, will end up with 137GiB of disk usage:

> 1.3byte/minute * year * 2500 * 80 to Gibyte

  (1,3 * (byte / minute)) * year * 2500 * 80 = approx. 127,35799 gibibytes

Back then, we configured Prometheus to keep only 30 days of samples, but that proved to be insufficient for many cases, so it was raised to one year in 2020, in issue 31244.

In the retention section of TPA-RFC-33, there is a detailed discussion on retention periods. We're considering multi-year retention periods for the future.

Queues

There are a couple of places where things happen automatically on a schedule in the monitoring infrastructure:

  • Prometheus schedules scrape jobs (pulling metrics) according to rules that can differ for each scrape job. Each job can define its own scrape_interval. The default is to scrape each 15 seconds, but some jobs are currently configured to scrape once every minute.
  • Each alertmanager alert rule can define its own evaluation interval and delay before triggering. See Adding alerts
  • Prometheus can automatically discover scrape targets through different means. We currently don't fully use the auto-discovery feature since we create targets through files created by puppet, so any interval for this feature does not affect our setup.

Interfaces

This system has multiple interfaces. Let's take them one by one.

Long term trends are visible in the Grafana dashboards, which taps into the Prometheus API to show graphs for history. Documentation on that is in the Grafana wiki page.

Alerting: Karma

The main alerting dashboard is the Karma dashboard, which shows the currently firing alerts, and allows users to silence alerts.

Technically, alerts are generated by the Prometheus server and relayed through the Alertmanager server, then Karma taps into the Alertmanager API to show those alerts. Karma provides those features:

  • Silencing alerts
  • Showing alert inhibitions
  • Aggregate alerts from multiple alert managers
  • Alert groups
  • Alert history
  • Dead man's switch (an alert always firing that signals an error when it stops firing)

Notifications: Alertmanager

We aggressively restrict the kind and number of alerts that will actually send notifications. This was done mainly by creating two different alerting levels ("warning" and "critical", above), and drastically limiting the number of critical alerts.

The basic idea is that the dashboard (Karma) has "everything": alerts (both with "warning" and "critical" levels) show up there, and it's expected that it is "noisy". Operators are be expected to look at the dashboard while on rotation for tasks to do. A typical example is pending reboots, but anomalies like high load on a server or a partition to expand in a few weeks is also expected.

All notifications are also sent over the IRC channel (#tor-alerts on OFTC) and logged through the tpa_http_post_dump.service. It is expected that operators look at their emails or the IRC channels regularly and will act upon those notifications promptly.

IRC notifications are handled by the [alertmanager-irc-relay][].

Command-line

Prometheus has a [promtool][] that allows you to query the server from the command-line, but there's also a HTTP API that we can use with curl. For example, this shows the hosts with pending upgrades:

curl -sSL --data-urlencode query='apt_upgrades_pending>0' \
  'https://$HTTP_USER@prometheus.torproject.org/api/v1/query \
  | jq -r .data.result[].metric.alias \
  | grep -v '^null$' | paste -sd,

The output can be passed to a tool like Cumin, for example. This is actually used in the fleet.pending-upgrades task to show an inventory of the pending upgrades across the fleet.

Alertmanager also has a amtool tool which can be used to inspect alerts, and issue silences. It's used in our test suite.

Authentication

Web-based authentication is shared with Grafana, see the Grafana authentication documentation.

Polling from the Prometheus servers to the exporters on servers is permitted by IP address specifically just for the Prometheus server IPs. Some more sensitive exporters require a secret token to access their metrics.

Implementation

Prometheus and Alertmanager are coded in Go and released under the Apache 2.0 license. We use the versions provided by the debian package archives in the current stable release.

By design, no other service is required. Emails get sent out for some notifications and that might depend on Tor email servers, depending on which addresses receive the notifications.

Issues

There is no issue tracker specifically for this project, File or search for issues in the team issue tracker with the ~Prometheus label.

Known issues

Those are major issues that are worth knowing about Prometheus in general, and our setup in particular:

A workaround is to shutdown the previous host to force Prometheus to check the new one during a rotation, or reduce the number of keep alive requests allowed on the server (keepalive_requests on Nginx, MaxKeepAliveRequests on Apache)

See 41902 for further information.

In general, the service is still being launched, see TPA-RFC-33 for the full deployment plan.

Resolved issues

No major issue resolved so far is worth mentioning here.

Maintainers

The Prometheus services have been setup and are managed by anarcat inside TPA.

Users

The internal Prometheus server is mostly used by TPA staff to diagnose issues. The external Prometheus server is used by various TPO teams for their own monitoring needs.

Upstream

The upstream Prometheus projects are diverse and generally active as of early 2021. Since Prometheus is used as an ad-hoc standard in the new "cloud native" communities like Kubernetes, it has seen an upsurge of development and interest from various developers, and companies. The future of Prometheus should therefore be fairly bright.

The individual exporters, however, can be hit and miss. Some exporters are "code dumps" from companies and not very well maintained. For example, Digital Ocean dumped the bind_exporter on GitHub, but it was salvaged by the Prometheus community.

Another important layer is the large amount of Puppet code that is used to deploy Prometheus and its components. This is all part of a big Puppet module, [puppet-prometheus][], managed by the Voxpupuli collective. Our integration with the module is not yet complete: we have a lot of glue code on top of it to correctly make it work with Debian packages. A lot of work has been done to complete that work by anarcat, but work still remains, see upstream issue 32 for details.

Monitoring and metrics

Prometheus is, of course, all about monitoring and metrics. It is the thing that monitors everything and keeps metrics over the long term.

The server monitors itself for system-level metrics but also application-specific metrics. There's a long-term plan for high-availability in TPA-RFC-33-C.

See also storage for retention policies.

Tests

The prometheus-alerts.git repository has tests that run in GitLab CI, see the Testing alerts section on how to write those.

When doing major upgrades, the Karma dashboard should be visited to make sure it works correctly.

There is a test suite in the upstream Prometheus Puppet module as well, but it's not part of our CI.

Logs

Prometheus servers typically do not generate many logs, except when errors and warnings occur. They should hold very little PII. The web frontends collect logs in accordance with our regular policy.

Actual metrics may contain PII, although it's quite unlikely: typically, data is anonymized and aggregated at collection time. It would still be able to deduce some activity patterns from the metrics generated by Prometheus, and use it to leverage side-channel attacks, which is why the external Prometheus server access is restricted.

Alerts themselves are retained in the systemd journal, see Checking alert history.

Backups

Prometheus servers should be fully configured through Puppet and require little backups. The metrics themselves are kept in /var/lib/prometheus2 and should be backed up along with our regular backup procedures.

WAL (write-ahead log) files are ignored by the backups, which can lead to an extra 2-3 hours of data loss since the last backup in the case of a total failure, see #41627 for the discussion. This should eventually be mitigated by a high availability setup (#41643).

Other documentation

Discussion

Overview

The Prometheus and Grafana services were setup after anarcat realized that there was no "trending" service setup inside TPA after Munin had died (ticket 29681). The "node exporter" was deployed on all TPA hosts in mid-march 2019 (ticket 29683) and remaining traces of Munin were removed in early April 2019 (ticket 29682).

Resource requirements were researched in ticket 29388 and it was originally planned to retain 15 days of metrics. This was expanded to one year in November 2019 (ticket 31244) with the hope this could eventually be expanded further with a down-sampling server in the future.

Eventually, a second Prometheus/Grafana server was setup to monitor external resources (ticket 31159) because there were concerns about mixing internal and external monitoring on TPA's side. There were also concerns on the metrics team about exposing those metrics publicly.

It was originally thought Prometheus could completely replace Nagios as well issue 29864, but this turned out to be more difficult than planned.

The main difficulty is that Nagios checks come with builtin threshold of acceptable performance. But Prometheus metrics are just that: metrics, without thresholds... This made it more difficult to replace Nagios because a ton of alerts had to be rewritten to replace the existing ones.

This was performed in TPA-RFC-33, over the course of 2024 and 2025.

Security and risk assessment

There were no security review yet.

The shared password for accessing the web interface is a challenge. We intend to replace this soon with individual users.

There were no risk assessments done yet.

Technical debt and next steps

In progress projects:

  • merging external and internal monitoring servers
  • reimplementing some of the alerts that were in icinga

Proposed Solutions

TPA-RFC-33

TPA's monitoring infrastructure has been originally setup with Nagios and Munin. Nagios was eventually removed from Debian in 2016 and replaced with Icinga 1. Munin somehow "died in a fire" some time before anarcat joined TPA in 2019.

At that point, the lack of trending infrastructure was seen as a serious problem, so Prometheus and Grafana were deployed in 2019 as a stopgap measure.

A secondary Prometheus server (prometheus2) was setup with stronger authentication for service admins. The rationale was that those services were more privacy-sensitive and the primary TPA setup (prometheus1) was too open to the public, which could allow for side-channels attacks.

Those tools has been used for trending ever since, while keeping Icinga for monitoring.

During the March 2021 hack week, Prometheus' Alertmanager was deployed on the secondary Prometheus server to provide alerting to the Metrics and Anti-Censorship teams.

Munin replacement

The primary Prometheus server was decided in the Brussels 2019 developer meeting, before anarcat joined the team (ticket 29389). Secondary Prometheus server was approved in meeting/2019-04-08. Storage expansion was approved in meeting/2019-11-25.

Other alternatives

We considered retaining Nagios/Icinga as an alerting system, separate from Prometheus, but ultimately decided against it in TPA-RFC-33.

Alerting rules in Puppet

Alerting rules are currently stored in an external [prometheus-alerts.git repository][] that holds not only TPA's alerts, but also those of other teams. So the rules are not directly managed by puppet -- although puppet will ensure that the repository is checked out with the most recent commit on the Prometheus servers.

The rationale is that rule definitions should appear only once and we already had the above-mentioned repository that could be used to configure alerting rules.

We were concerned we would potentially have multiple sources of truth for alerting rules. We already have that for scrape targets, but that doesn't seem to be an issue. It did feel, however, critical for the more important alerting rules to have a single source of truth.

PuppetDB integration

Prometheus 2.31 and later added support for PuppetDB service discovery, through the puppetdb_sd_config parameter. The sample configuration file shows a bit what's possible.

This approach was considered during the bookworm upgrade but ultimately rejected because it introduces a dependency on PuppetDB, which becomes a possible single point of failure for the monitoring system.

We also have a lot of code in Puppet to handle the exported resources necessary for this, and it would take a lot of work to convert over.

Mobile notifications

Like others we do not intend on having on-call rotation yet, and will not ring people on their mobile devices at first. After all exporters have been deployed (priority "C", "nice to have") and alerts properly configured, we will evaluate the number of notifications that get sent out. If levels are acceptable (say, once a month or so), we might implement push notifications during business hours to consenting staff.

We have been advised to avoid Signal notifications as that setup is often brittle, signal.org frequently changing their API and leading to silent failures. We might implement alerts over Matrix depending on what messaging platform gets standardized in the Tor project.

Migrating from Munin

Here's a quick cheat sheet from people used to Munin and switching to Prometheus:

What Munin Prometheus
Scraper munin-update Prometheus
Agent munin-node Prometheus, node-exporter and others
Graphing munin-graph Prometheus or Grafana
Alerting munin-limits Prometheus, Alertmanager
Network port 4949 9100 and others
Protocol TCP, text-based HTTP, text-based
Storage format RRD Custom time series database
Down-sampling Yes No
Default interval 5 minutes 15 seconds
Authentication No No
Federation No Yes (can fetch from other servers)
High availability No Yes (alert-manager gossip protocol)

Basically, Prometheus is similar to Munin in many ways:

  • It "pulls" metrics from the nodes, although it does it over HTTP (to http://host:9100/metrics) instead of a custom TCP protocol like Munin

  • The agent running on the nodes is called prometheus-node-exporter instead of munin-node. It scrapes only a set of built-in parameters like CPU, disk space and so on, different exporters are necessary for different applications (like prometheus-apache-exporter) and any application can easily implement an exporter by exposing a Prometheus-compatible /metrics endpoint

  • Like Munin, the node exporter doesn't have any form of authentication built-in. We rely on IP-level firewalls to avoid leakage

  • The central server is simply called prometheus and runs as a daemon that wakes up on its own, instead of munin-update which is called from munin-cron and before that cron

  • Graphics are generated on the fly through the crude Prometheus web interface or by frontends like Grafana, instead of being constantly regenerated by munin-graph

  • Samples are stored in a custom "time series database" (TSDB) in Prometheus instead of the (ad-hoc) RRD standard

  • Prometheus performs no down-sampling like RRD and Prom relies on smart compression to spare disk space, but it uses more than Munin

  • Prometheus scrapes samples much more aggressively than Munin by default, but that interval is configurable

  • Prometheus can scale horizontally (by sharding different services to different servers) and vertically (by aggregating different servers to a central one with a different sampling frequency) natively - munin-update and munin-graph can only run on a single (and same) server

  • Prometheus can act as a high availability alerting system thanks to its alertmanager that can run multiple copies in parallel without sending duplicate alerts - munin-limits can only run on a single server

Migrating from Nagios/Icinga

Near the end of 2024, Icinga was replaced by Prometheus and Alertmanager, as part of TPA-RFC-33.

The project was split into three phases from A to C.

Before Icinga was retired, we performed an audit of the notifications sent from Icinga about our services (#41791) to see if we're missing coverage over something critical.

Overall, phase A covered most critical alerts we were worried about, but left out key components as well, which are not currently covered by monitoring.

In phase B we implemented more alerts, integrated more metrics that were necessary for some new alerts and did a lot of work on ensuring that we wouldn't be getting double alerts for the same problem. It is also planned to merge the external monitoring server in this phase.

Phase C concerns the setup of high availability between two prometheus servers, each with its own alertmanager instance, and to finalize implementing alerts.

Prometheus equivalence for Icinga/Nagios checks

This is an equivalence table between Nagios checks and their equivalent Prometheus metric, for checks that have been explicitly converted into Prometheus alerts and metrics as part of phase A.

Name Command Metric Severity Note
disk usage - * check_disk node_filesystem_avail_bytes warning / critical Critical when less than 24h to full
network service - nrpe check_tcp!5666 up warning
raid -DRBD dsa-check-drbd node_drbd_out_of_sync_bytes, node_drbd_connected warning
raid - sw raid dsa-check-raid-sw node_md_disks / node_md_state warning Not warning about arrays synchronization
apt - security updates dsa-check-statusfile apt_upgrades_* warning Incomplete
needrestart needrestart -p kernel_status, microcode_status warning Required patching upstream
network service - sshd check_ssh --timeout=40 probe_success warning Sanity check, overlaps with systemd check, but better be safe
network service - smtp check_smtp probe_success warning Incomplete, need end-to-end deliverability checks, scheduled for phase B
network service - submission check_smtp_port!587 probe_success warning
network service - smtps dsa_check_cert!465 probe_success warning
network service - http check_http probe_http_duration_seconds warning See also #40568 for phase B
network service - https check_https Idem warning Idem, see also #41731 for exhaustive coverage of HTTPS sites
https cert and smtps dsa_check_cert probe_ssl_earliest_cert_expiry warning Check for cert expiry for all sites, this is about "renewal failed"
backup - bacula - * dsa-check-bacula bacula_job_last_good_backup warning Based on WMF's [check_bacula.py][]
redis liveness Custom command probe_success warning Checks that the Redis tunnel works
postgresql backups dsa-check-backuppg tpa_backuppg_last_check_timestamp_seconds warning Built on top of NRPE check for now, see TPA-RFC-65 for long term

Actual alerting rules can be found in the [prometheus-alerts.git repository][].

High priority missing checks, phase B

Those checks are all scheduled in phase B, and are considered high priority, or at least specific due dates have been set in issues to make sure we don't miss (for example) the next certificate expiry dates.

Name Command Metric Severity Note
DNS - DS expiry dsa-check-statusfile TBD warning Drop DNSSEC? See #41795
Ganeti - cluster check_ganeti_cluster [ganeti-exporter][] warning Runs a full verify, costly, was already disabled
Ganeti - disks check_ganeti_instances Idem warning Was timing out and already disabled
Ganeti - instances check_ganeti_instances Idem warning Currently noisy: warns about retired hosts waiting for destruction, drop?
SSL cert - LE dsa-check-cert-expire-dir TBD warning Exhaustively check all certs, see #41731, possibly with critical severity for actual prolonged down times
SSL cert - db.torproject.org dsa-check-cert-expire TBD warning Checks local CA for expiry, on disk, /etc/ssl/certs/thishost.pem and db.torproject.org.pem on each host, see #41732
puppet - * catalog run(s) check_puppetdb_nodes [puppet-exporter][] warning
system - all services running systemctl is-system-running node_systemd_unit_state warning Sanity check, checks for failing timers and services

Those checks are covered by the priority "B" ticket (#41639), unless otherwise noted.

Low priority missing checks, phase B

Unless otherwise mentioned, most of those checks are noisy and generally do not indicate an actual failure, so they were not qualified as being priorities at all.

Name Command Metric Severity Note
DNS - delegation and signature expiry dsa-check-zone-rrsig-expiration-many [dnssec-exporter][] warning
DNS - key coverage dsa-check-statusfile TBD warning
DNS - security delegations dsa-check-dnssec-delegation TBD warning
DNS - zones signed properly dsa-check-zone-signature-all TBD warning
DNS SOA sync - * dsa_check_soas_add TBD warning Never actually failed
PING check_ping probe_success warning
load check_load node_pressure_cpu_waiting_seconds_total warning Sanity check, replace with the better pressure counters
mirror (static) sync - * dsa_check_staticsync TBD warning Never actually failed
network service - ntp peer check_ntp_peer node_ntp_offset_seconds warning
network service - ntp time check_ntp_time TBD warning Unclear how that differs from check_ntp_peer
setup - ud-ldap freshness dsa-check-udldap-freshness TBD warning
swap usage - * check_swap node_memory_SwapFree_bytes warning
system - filesystem check dsa-check-filesystems TBD warning
unbound trust anchors dsa-check-unbound-anchors TBD warning
uptime check dsa-check-uptime node_boot_time_seconds warning

Those are also covered by the priority "B" ticket (#41639), unless otherwise noted. In particular, all DNS issues are covered by issue #41794.

Retired checks

Name Command Rationale
users check_users Who has logged-in users??
processes - zombies check_procs -s Z Useless
processes - total check_procs 620 700 Too noisy, needed exclusions for builders
processes - * check_procs $foo Better to check systemd
unwanted processes - * check_procs $foo Basically the opposite of the above, useless
LE - chain Checks for flag file See #40052
CPU - intel ucode dsa-check-ucode-intel Overlaps with needrestart check
unexpected sw raid Checks for /proc/mdstat Needlessly noisy, just means an extra module is loaded, who cares
unwanted network service - * dsa_check_port_closed Needlessly noisy, if we really want this, use [lzr][]
network - v6 gw dsa-check-ipv6-default-gw Useless, see #41714 for analysis

check_procs, in particular, was generating a lot of noise in Icinga, as we were checking dozens of different processes, which would all explode at once when a host would go down and Icinga didn't notice the host being down.

Service admin checks

The following checks were not audited by TPA but checked by the respective team's service admins.

Check Team
bridges.tpo web service Anti-censorship
"mail queue" Anti-censorship
tor_check_collector Network health
tor-check-onionoo Network health

Other Alertmanager receivers

Alerts are typically sent over email, but Alertmanager also has builtin support for:

There's also a generic web hook receiver which is typically used to send notifications. Many other endpoints are implemented through that web hook, for example:

And that is only what was available at the time of writing, the [alertmanager-webhook][] and [alertmanager tags][] GitHub might have more.

The Alertmanager web interface is not shipped with the Debian package, because it depends on the Elm compiler which is not in Debian. It can be built by hand using the debian/generate-ui.sh script, but only in newer, post buster versions. Another alternative to consider is Crochet.