title: TPA-RFC-65: PostgreSQL backups costs: 3-4 weeks, +70EUR/mth optional approval: TPA, optionally accounting for a new server affected users: TPA deadline: 2024-06-04 status: proposed discussion: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40950


Summary: switch to barman for PostgreSQL backups, rebuild or resize bungei as needed to cover for metrics needs

[[TOC]]

Background

TPA currently uses a PostgreSQL backup system that uses point-in-time recovery (PITR) backups. This is really nice because it gives us full, incremental backup history with also easy "full" restores at periodic intervals.

Unfortunately, that is built using a set of scripts only used by TPA and DSA, which are hard to use and to debug.

We want to consider other alternatives and make a plan for that migration. In tpo/tpa/team#41557, we have setup a new backup server in the secondary point of presence and should use this to backup PostgreSQL servers from the first point of presence so we could more easily survive a total site failure as well.

In TPA-RFC-63: Storage server budget, we've already proposed using barman, but didn't mention geographic distribution or a migration plan.

The plan for that server was also to deal with the disk usage explosion on the network health team which is causing the current storage server to run out of space (tpo/tpa/team#41372) but we didn't realize the largest PostgreSQL server was in the same location as the new backup server, which means the new server might not actually solve the problem, as far as databases are concerned. For this, we might need to replace our existing storage server (bungei) which is anyways getting past its retirement age, as it was setup in March 2019 (so it is 5 years old at the time of writing).

Proposal

Switch to barman as our new PostgreSQL backups system. Migrate all servers in the gnt-fsn cluster to the new system on the new backup server, then convert the legacy on the old backup server.

If necessary, resize disks on the old backup server to make room for the metrics storage, or replace that aging server with a new rental server.

Goals

Must have

  • geographic redundancy: have database backups in a different provider and geographic location than their primary storage

  • solve space issues: we're constantly having issues with the storage server filling up, we need to solve this in the long term

Nice to have

  • well-established code base: use a more standard backup software not developed and maintained only by us and debian.org

Non-Goals

  • global backup policy review: we're not touching bacula or retention policies

  • high availability: we're not setting up extra database servers for high availability, this is only for backups

Migration plan

We're again pressed for time so we need to come up with a procedure that will give us some room on the backup server while simultaneously minimizing the risk to the backup integrity.

To do this, we're going to migrate a mix of small (at first) and large (quickly than we'd like) database servers at first

Phase I: alpha testing

Migrate the following backups from bungei to backup-storage-01:

  • [ ] weather-01 (12.7GiB)
  • [ ] rude (35.1GiB)
  • [ ] materculae (151.9GiB)

Phase II: beta testing

After a week, retire the above backups from bungei, then migrate the following servers:

  • [ ] gitlab-02 (34.9GiB)
  • [ ] polyanthum (20.3GiB)
  • [ ] meronense (505.1GiB)

Phase III: production

After another week, migrate the last backups from bungei:

  • [ ] bacula-director-01 (180.8GiB)

At this point, we should hopefully have enough room on the backup server to survive the holidays.

Phase IV: retire legacy, bungei replacement

At this point, the only backups using the legacy system are the ones from the gnt-dal cluster (4 servers). Rebuild those with the new service. Do not keep a copy of the legacy system on bungei (to save space, particularly for metricsdb-01) but possibly archive a copy of the legacy backups on backup-storage-01:

  • [ ] metricsdb-01 (1.6TiB)
  • [ ] puppetdb-01 (20.2GiB)
  • [ ] survey-01 (5.7GiB)
  • [ ] anonticket-01 (3.9GiB)

If we still run out of disk space on bungei, consider replacing the server entirely. The server is now 5 years old which is getting close to our current amortization time (6 years) and it's a rental server so it's relatively easy to replace, as we don't need to buy new hardware.

Alternatives considered

See the alternatives considered in our PostgreSQL documentation.

Costs

Staff estimates (3-4 weeks)

Task Time Complexity Estimate Days Note
pgbarman testing and manual setup 3 days high 1 week 6
pgbarman puppetization 3 days medium 1 week 4.5
migrate 12 servers 3 days high 1 week 4.5 assuming we can migrate 4 servers per day
legacy code cleanup 1 day low ~1 day 1.1
Sub-total 2 weeks ~medium 3 weeks 16.1
bungei replacement 3 days low ~3 days 3.3 optional
bungei resizing 1 day low ~1 day 1.1 optional
Total ~3 weeks ~medium ~4 weeks 20.5

Hosting costs (+70EUR/mth, optional)

bungei is a SX132 server, billed monthly at 175EUR. It has the following specifications:

  • Intel Xeon E5-1650 (12 Core, 3.5GHz)
  • RAM: 128GiB DDR4
  • Storage: 10x10TB SAS drives (100TB, HGST HUH721010AL)

A likely replacement would be the SX135 server, at 243EUR and a 94EUR setup fee:

  • AMD Ryzen 9 3900 (12 core, 3.1GHz)
  • RAM: 128GiB
  • Storage: 8x22TB SATA drives (176TB)

There's a cheaper server, the SX65 at 124EUR/mth, but it has less disk space (4x22TB, 88TB). It might be enough, that said, if we do not need to grow bungei and simply need to retire it.

References

Appendix

Backups inventory

here's the list of current psql databases on the storage server and their locations:

server location size note
anonticket-01 gnt-dal 3.9GiB
bacula-director-01 gnt-fsn 180.8GiB
gitlab-02 gnt-fsn 34.9GiB move to gnt-dal considered, #41431
materculae gnt-fsn 151.9GiB
meronense gnt-fsn 505.1GiB
metricsdb-01 gnt-dal 1.6TiB huge!
polyanthum gnt-fsn 20.3GiB
puppetdb-01 gnt-dal 20.2GiB
rude gnt-fsn 35.1GiB
survey-01 gnt-dal 5.7GiB
weather-01 gnt-fsn 12.7GiB

gnt-fsn servers

Same, but only for the servers at Hetzner, sorted by size:

server size
meronense 505.1GiB
bacula-director-01 180.8GiB
materculae 151.9GiB
rude 35.1GiB
gitlab-02 34.9GiB
polyanthum 20.3GiB
weather-01 12.7GiB

gnt-dal

Same for Dallas:

server size
metricsdb-01 1.6TiB
puppetdb-01 20.2GiB
survey-01 5.7GiB
anonticket-01 3.9GiB