title: TPA-RFC-65: PostgreSQL backups costs: 3-4 weeks, +70EUR/mth optional approval: TPA, optionally accounting for a new server affected users: TPA deadline: 2024-06-04 status: proposed discussion: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40950

Summary: switch to barman for PostgreSQL backups, rebuild or resize bungei as needed to cover for metrics needs

[[TOC]]

Background

TPA currently uses a PostgreSQL backup system that uses point-in-time recovery (PITR) backups. This is really nice because it gives us full, incremental backup history with also easy "full" restores at periodic intervals.

Unfortunately, that is built using a set of scripts only used by TPA and DSA, which are hard to use and to debug.

We want to consider other alternatives and make a plan for that migration. In tpo/tpa/team#41557, we have setup a new backup server in the secondary point of presence and should use this to backup PostgreSQL servers from the first point of presence so we could more easily survive a total site failure as well.

In TPA-RFC-63: Storage server budget, we've already proposed using barman, but didn't mention geographic distribution or a migration plan.

The plan for that server was also to deal with the disk usage explosion on the network health team which is causing the current storage server to run out of space (tpo/tpa/team#41372) but we didn't realize the largest PostgreSQL server was in the same location as the new backup server, which means the new server might not actually solve the problem, as far as databases are concerned. For this, we might need to replace our existing storage server (bungei) which is anyways getting past its retirement age, as it was setup in March 2019 (so it is 5 years old at the time of writing).

Proposal

Switch to barman as our new PostgreSQL backups system. Migrate all servers in the gnt-fsn cluster to the new system on the new backup server, then convert the legacy on the old backup server.

If necessary, resize disks on the old backup server to make room for the metrics storage, or replace that aging server with a new rental server.

Goals

Must have

geographic redundancy: have database backups in a different provider and geographic location than their primary storage
solve space issues: we're constantly having issues with the storage server filling up, we need to solve this in the long term

Nice to have

well-established code base: use a more standard backup software not developed and maintained only by us and debian.org

Non-Goals

global backup policy review: we're not touching bacula or retention policies
high availability: we're not setting up extra database servers for high availability, this is only for backups

Migration plan

We're again pressed for time so we need to come up with a procedure that will give us some room on the backup server while simultaneously minimizing the risk to the backup integrity.

To do this, we're going to migrate a mix of small (at first) and large (quickly than we'd like) database servers at first

Phase I: alpha testing

Migrate the following backups from bungei to backup-storage-01:

[ ] weather-01 (12.7GiB)
[ ] rude (35.1GiB)
[ ] materculae (151.9GiB)

Phase II: beta testing

After a week, retire the above backups from bungei, then migrate the following servers:

[ ] gitlab-02 (34.9GiB)
[ ] polyanthum (20.3GiB)
[ ] meronense (505.1GiB)

Phase III: production

After another week, migrate the last backups from bungei:

[ ] bacula-director-01 (180.8GiB)

At this point, we should hopefully have enough room on the backup server to survive the holidays.

Phase IV: retire legacy, bungei replacement

At this point, the only backups using the legacy system are the ones from the gnt-dal cluster (4 servers). Rebuild those with the new service. Do not keep a copy of the legacy system on bungei (to save space, particularly for metricsdb-01) but possibly archive a copy of the legacy backups on backup-storage-01:

[ ] metricsdb-01 (1.6TiB)
[ ] puppetdb-01 (20.2GiB)
[ ] survey-01 (5.7GiB)
[ ] anonticket-01 (3.9GiB)

If we still run out of disk space on bungei, consider replacing the server entirely. The server is now 5 years old which is getting close to our current amortization time (6 years) and it's a rental server so it's relatively easy to replace, as we don't need to buy new hardware.

Alternatives considered

See the alternatives considered in our PostgreSQL documentation.

Costs

Staff estimates (3-4 weeks)

Task	Time	Complexity	Estimate	Days	Note
pgbarman testing and manual setup	3 days	high	1 week	6
pgbarman puppetization	3 days	medium	1 week	4.5
migrate 12 servers	3 days	high	1 week	4.5	assuming we can migrate 4 servers per day
legacy code cleanup	1 day	low	~1 day	1.1
Sub-total	2 weeks	~medium	3 weeks	16.1
bungei replacement	3 days	low	~3 days	3.3	optional
bungei resizing	1 day	low	~1 day	1.1	optional
Total	~3 weeks	~medium	~4 weeks	20.5

Hosting costs (+70EUR/mth, optional)

bungei is a SX132 server, billed monthly at 175EUR. It has the following specifications:

Intel Xeon E5-1650 (12 Core, 3.5GHz)
RAM: 128GiB DDR4
Storage: 10x10TB SAS drives (100TB, HGST HUH721010AL)

A likely replacement would be the SX135 server, at 243EUR and a 94EUR setup fee:

AMD Ryzen 9 3900 (12 core, 3.1GHz)
RAM: 128GiB
Storage: 8x22TB SATA drives (176TB)

There's a cheaper server, the SX65 at 124EUR/mth, but it has less disk space (4x22TB, 88TB). It might be enough, that said, if we do not need to grow bungei and simply need to retire it.

References

Appendix

Backups inventory

here's the list of current psql databases on the storage server and their locations:

server	location	size	note
anonticket-01	gnt-dal	3.9GiB
bacula-director-01	gnt-fsn	180.8GiB
gitlab-02	gnt-fsn	34.9GiB	move to gnt-dal considered, #41431
materculae	gnt-fsn	151.9GiB
meronense	gnt-fsn	505.1GiB
metricsdb-01	gnt-dal	1.6TiB	huge!
polyanthum	gnt-fsn	20.3GiB
puppetdb-01	gnt-dal	20.2GiB
rude	gnt-fsn	35.1GiB
survey-01	gnt-dal	5.7GiB
weather-01	gnt-fsn	12.7GiB

gnt-fsn servers

Same, but only for the servers at Hetzner, sorted by size:

server	size
meronense	505.1GiB
bacula-director-01	180.8GiB
materculae	151.9GiB
rude	35.1GiB
gitlab-02	34.9GiB
polyanthum	20.3GiB
weather-01	12.7GiB

gnt-dal

Same for Dallas:

server	size
metricsdb-01	1.6TiB
puppetdb-01	20.2GiB
survey-01	5.7GiB
anonticket-01	3.9GiB

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search