title: TPA-RFC-68: Idle canary servers costs: marginal approval: TPA affected users: TPA deadline: 2024-09-19 status: standard discussion: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41750


[[TOC]]

Summary: provision test servers that sit idle to monitor infrastructure and stage deployments

Background

In various recent incidents, it became apparent that we don't have a good place to test deployments or "normal" behavior on servers.

Examples:

  • While deploying the needrestart package (tpo/tpa/team#41633), we had to deploy on perdulce (AKA people.tpo) and test there. This had no negative impact.

  • While testing a workaround to mini-nag's deprecation (tpo/tpa/team#41734), perdulce was used again, but an operator error destroyed /dev/null, and the operator failed to recreate it. Impact was minor: some errors during a nightly job, which a reboot promptly fixed.

  • While diagnosing a network outage (e.g. tpo/tpa/team#41740), it can be hard to tell if issues are related to a server's exotic configuration or our baseline (in that case, single-stack IPv4 vs IPv6)

  • While diagnosing performance issues in Ganeti clusters, we can sometimes suffer from the "noisy neighbor" syndrome, where another VM in the cluster "pollutes" the server and causes bad performance

  • Rescue boxes were setup with not enough disk space, because we actually have no idea what our minimum space requirements are (tpo/tpa/team#41666)

We previously had a ipv6only.torproject.org server, which was retired in TPA-RFC-23 (tpo/tpa/team#40727) because it was undocumented and blocking deployment. It also didn't seem to have any sort of configuration management.

Proposal

Create a pair of "idle canary servers", one per cluster, named idle-fsn-01 and idle-dal-02.

Optionally deploy an idle-dal-ipv6only-03 and idle-dal-ipv4only-04 pair to test single-stack configuration for eventual dual-stack monitoring (tpo/tpa/team#41714).

Server specifications and usage

  • zero configuration in Puppet, unless specifically required for the role (e.g. an IPv4-only or IPv6 stack might be an acceptable configuration)
  • some test deployments are allowed, but should be reverted cleanly as much as possible. on total failure, a new host should be reinstalled from scratch instead of letting it drift into unmanaged chaos
  • files in /home and /tmp cleared out automatically on a weekly basis, motd clearly stating that fact

Hardware configuration

component current minimum proposed spec note
CPU count 1 1
RAM 960MiB 512MiB covers 25% of current servers
Swap 50MiB 100MiB covers 90% of current servers
Total Disk 10GiB ~5.6GiB
/ 3GiB 5GiB current median used size
/boot 270MiB 512MiB /boot often filling up on dal-rescue hosts
/boot/efi 124MiB N/A no EFI support in Ganeti clusters
/home 10GiB N/A /home on root filesystem
/srv 10GiB N/A same

Goals

  • identify "noisy neighbors" in each Ganeti cluster
  • keep a long term "minimum requirements" specification for servers, continuously validated throughout upgrades
  • provide a impact-less testing ground for upgrades, test deployments and environments
  • trace long-term usage trends, for example electric power usage (tpo/tpa/team#40163) or recurring jobs like unattended upgrades (tpo/tpa/team#40934) basic CPU usage cycles

Timeline

No fixed timeline. Those servers can be deployed in our precious free time, but it would be nice to actually have them deployed eventually. No rush.

Appendix

Some observations on current usage:

Memory usage

Sample query (25th percentile):

quantile(0.25, node_memory_MemTotal_bytes -
  node_memory_MemFree_bytes - (node_memory_Cached_bytes +
  node_memory_Buffers_bytes))
 ≈ 486 MiB
  • minimum is currently carinatum, at 228MiB, perdulce and ssh-dal are more around 300MiB
  • a quarter of servers use less than 512MiB of RAM, median is 1GiB, 90th %ile is 17GB
  • largest memory used is dal-node-01, at 310GiB used (out of 504GiB, 61.5%)
  • largest used ratio is colchicifolium at 94.2%, followed by gitlab-02 at 68%
  • largest memory size is ci-runner-x86-03 at 1.48TiB, followed by the dal-node cluster at 504GiB each, median is 8GiB, 90%ile is 74GB

Swap usage

Sample query (median used swap):

quantile(0.5, node_memory_SwapTotal_bytes-node_memory_SwapFree_bytes)
= 0 bytes
  • Median swap usage is zero, in other words, 50% of servers do not touch swap at all
  • median size is 2GiB
  • some servers have large swap space (tb-build-02 and -03 have 300GiB, -06 has 100GiB and gnt-fsn nodes have 64GiB)
Percentile Usage Size
50% 0 2GiB
75% 16MiB 4GiB
90% 100MiB N/A
95% 400MiB N/A
99% 1.2GiB N/A

Disk usage

Sample query (median root partition used space):

quantile(0.5,
  sum(node_filesystem_size_bytes{mountpoint="/"}) by (alias, mountpoint)
  - sum(node_filesystem_avail_bytes{mountpoint="/"}) by (alias,mountpoint)
)
≈ 5GiB
  • 90% of servers fit in 10GiB of disk space for the root, median around 5GiB filesystem usage
  • median /boot usage is actually much lower than our specification, at 139,4 MiB, but the problem is with edge cases, and we know we're having trouble at the 2^8MiB (256MiB) boundary, so we're simply doubling that

CPU usage

Sample query (median percentage with one decimal):

quantile(0.5,
  round(
    sum(
      rate(node_cpu_seconds_total{mode!="idle"}[24h])
    ) by (instance)
    / count(node_cpu_seconds_total{mode="idle"}) by (instance) * 1000)
  /10
)
≈ 2.5%

Servers sorted by CPU usage in the last 7 days:

sort_desc(
  round(
    sum(
       rate(node_cpu_seconds_total{mode!="idle"}[7d])
    ) by (instance)
    / count(node_cpu_seconds_total{mode="idle"}) by (instance) * 1000)
  /10
)
  • Half of servers use only 2.5% of CPU time per day over the last 24h.
  • median is, perhaps surprisingly, similar for the last 30 days.
  • metricsdb-01 used 76% of a CPU in the last 24h at the time of writing
  • over the last week, results vary more, relay-01 using 45%, colchicifolium and check-01 40%, metricsdb-01 33%...
Percentile last 24h usage ratio
50th (median) 2.5%
90th 22%
95th 32%
99th 45%