Tutorial

fab prometheus.query-to-series --expression 'up!=1'
Label	syntax	normal example	backup example	blackbox example
`job`	name of the job	`node`	`bacula`	`blackbox_https_2xx_or_3xx`
`team`	name of the team	`TPA`	`TPA`	`TPA`
`severity`	`warning` or `critical`	`warning`	`warning`	`warning`
`instance`	`host:port`	`web-fsn-01.torproject.org:9100`	`bacula-director-01.torproject.org:9133`	`localhost:9115`
`alias`	`host`	`web-fsn-01.torproject.org`	`web-fsn-01.torproject.org`	`web-fsn-01.torproject.org`
`target`	target used by blackbox	not produced	not produced	`www.torproject.org`
What	Munin	Prometheus
Scraper	`munin-update`	Prometheus
Agent	`munin-node`	Prometheus, `node-exporter` and others
Graphing	`munin-graph`	Prometheus or Grafana
Alerting	`munin-limits`	Prometheus, Alertmanager
Network port	4949	9100 and others
Protocol	TCP, text-based	HTTP, text-based
Storage format	RRD	Custom time series database
Down-sampling	Yes	No
Default interval	5 minutes	15 seconds
Authentication	No	No
Federation	No	Yes (can fetch from other servers)
High availability	No	Yes (alert-manager gossip protocol)
Name	Command	Metric	Severity	Note
`disk usage - *`	`check_disk`	`node_filesystem_avail_bytes`	`warning` / `critical`	Critical when less than 24h to full
`network service - nrpe`	`check_tcp!5666`	`up`	`warning`
`raid -DRBD`	`dsa-check-drbd`	`node_drbd_out_of_sync_bytes`, `node_drbd_connected`	`warning`
`raid - sw raid`	`dsa-check-raid-sw`	`node_md_disks` / `node_md_state`	`warning`	Not warning about arrays synchronization
`apt - security updates`	`dsa-check-statusfile`	`apt_upgrades_*`	`warning`	Incomplete
`needrestart`	`needrestart -p`	`kernel_status`, `microcode_status`	`warning`	Required patching upstream
`network service - sshd`	`check_ssh --timeout=40`	`probe_success`	`warning`	Sanity check, overlaps with systemd check, but better be safe
`network service - smtp`	`check_smtp`	`probe_success`	`warning`	Incomplete, need end-to-end deliverability checks, scheduled for phase B
`network service - submission`	`check_smtp_port!587`	`probe_success`	`warning`
`network service - smtps`	`dsa_check_cert!465`	`probe_success`	`warning`
`network service - http`	`check_http`	`probe_http_duration_seconds`	`warning`	See also #40568 for phase B
`network service - https`	`check_https`	Idem	`warning`	Idem, see also #41731 for exhaustive coverage of HTTPS sites
`https cert` and `smtps`	`dsa_check_cert`	`probe_ssl_earliest_cert_expiry`	`warning`	Check for cert expiry for all sites, this is about "renewal failed"
`backup - bacula - *`	`dsa-check-bacula`	`bacula_job_last_good_backup`	`warning`	Based on WMF's [`check_bacula.py`][]
`redis liveness`	Custom command	`probe_success`	`warning`	Checks that the Redis tunnel works
`postgresql backups`	`dsa-check-backuppg`	`tpa_backuppg_last_check_timestamp_seconds`	`warning`	Built on top of NRPE check for now, see TPA-RFC-65 for long term
Name	Command	Metric	Severity	Note
`DNS - DS expiry`	`dsa-check-statusfile`	TBD	`warning`	Drop DNSSEC? See #41795
`Ganeti - cluster`	`check_ganeti_cluster`	[`ganeti-exporter`][]	`warning`	Runs a full verify, costly, was already disabled
`Ganeti - disks`	`check_ganeti_instances`	Idem	`warning`	Was timing out and already disabled
`Ganeti - instances`	`check_ganeti_instances`	Idem	`warning`	Currently noisy: warns about retired hosts waiting for destruction, drop?
`SSL cert - LE`	`dsa-check-cert-expire-dir`	TBD	`warning`	Exhaustively check all certs, see #41731, possibly with `critical` severity for actual prolonged down times
`SSL cert - db.torproject.org`	`dsa-check-cert-expire`	TBD	`warning`	Checks local CA for expiry, on disk, `/etc/ssl/certs/thishost.pem` and `db.torproject.org.pem` on each host, see #41732
`puppet - * catalog run(s)`	`check_puppetdb_nodes`	[`puppet-exporter`][]	`warning`
`system - all services running`	`systemctl is-system-running`	`node_systemd_unit_state`	`warning`	Sanity check, checks for failing timers and services
Name	Command	Metric	Severity	Note
`DNS - delegation and signature expiry`	`dsa-check-zone-rrsig-expiration-many`	[`dnssec-exporter`][]	`warning`
`DNS - key coverage`	`dsa-check-statusfile`	TBD	`warning`
`DNS - security delegations`	`dsa-check-dnssec-delegation`	TBD	`warning`
`DNS - zones signed properly`	`dsa-check-zone-signature-all`	TBD	`warning`
`DNS SOA sync - *`	`dsa_check_soas_add`	TBD	`warning`	Never actually failed
`PING`	`check_ping`	`probe_success`	`warning`
`load`	`check_load`	`node_pressure_cpu_waiting_seconds_total`	`warning`	Sanity check, replace with the better pressure counters
`mirror (static) sync - *`	`dsa_check_staticsync`	TBD	`warning`	Never actually failed
`network service - ntp peer`	`check_ntp_peer`	`node_ntp_offset_seconds`	`warning`
`network service - ntp time`	`check_ntp_time`	TBD	`warning`	Unclear how that differs from `check_ntp_peer`
`setup - ud-ldap freshness`	`dsa-check-udldap-freshness`	TBD	`warning`
`swap usage - *`	`check_swap`	`node_memory_SwapFree_bytes`	`warning`
`system - filesystem check`	`dsa-check-filesystems`	TBD	`warning`
`unbound trust anchors`	`dsa-check-unbound-anchors`	TBD	`warning`
`uptime check`	`dsa-check-uptime`	`node_boot_time_seconds`	`warning`
Name	Command	Rationale
`users`	`check_users`	Who has logged-in users??
`processes - zombies`	`check_procs -s Z`	Useless
`processes - total`	`check_procs 620 700`	Too noisy, needed exclusions for builders
`processes - *`	`check_procs $foo`	Better to check systemd
`unwanted processes - *`	`check_procs $foo`	Basically the opposite of the above, useless
`LE - chain`	Checks for flag file	See #40052
`CPU - intel ucode`	`dsa-check-ucode-intel`	Overlaps with `needrestart` check
`unexpected sw raid`	Checks for `/proc/mdstat`	Needlessly noisy, just means an extra module is loaded, who cares
`unwanted network service - *`	`dsa_check_port_closed`	Needlessly noisy, if we really want this, use [`lzr`][]
`network - v6 gw`	`dsa-check-ipv6-default-gw`	Useless, see #41714 for analysis
Check	Team
`bridges.tpo web service`	Anti-censorship
"mail queue"	Anti-censorship
`tor_check_collector`	Network health
`tor-check-onionoo`	Network health