title: TPA-RFC-57: Debian bookworm upgrade schedule costs: staff, 2-4 weeks+ approval: TPA, service admins affected users: TPA, service admins deadline: 2 weeks status: obsolete discussion: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41245


[[TOC]]

Summary: bookworm upgrades will start in the first weeks of September 2023, with the majority of servers upgraded by the end of October 2023, and should complete before the end of June 2024. Let us know if your service requires special handling. Beware that this includes a complete Python 2 removal, as announced in TPA-RFC-27.

Background

Debian 12 bookworm was released on on June 10th 2023). The previous stable release (Debian bullseye) will be supported until June 2024, so we hope to complete the migration before that date, or sooner.

We typically start upgrading our boxes when testing enter freeze, but unfortunately, we haven't been able to complete the bullseye upgrade in time for the freeze, as complex systems required more attention. See the bullseye post-mortem for a review of that approach.

Some of the new machines that were setup recently have already been installed in bookworm, as the installers were changed shortly after the release (tpo/tpa/team#41244). A few machines were upgraded manually without any ill effects and we do not consider this upgrade to be risky or dangerous, in general.

This work is part of the %Debian 12 bookworm upgrade milestone, itself part of the 2023 roadmap.

Proposal

The proposal, broadly speaking, is to upgrade all servers in three batches. The first two are somewhat equally sized and spread over September and October 2023. The remaining servers will happen at some time that will be announced later, individually, per server, but should happen no later than June 2024.

Affected users

All service admins are affected by this change. If you have shell access on any TPA server, you want to read this announcement.

Python 2 retirement

Developers still using Python 2 should especially be aware that Debian has completely removed all Python 2 versions from bookworm.

If you still are running code that is not compatible with Python 3, you will need to upgrade your scripts when this upgrade completes. And yes, there are still Python 2 programs out there, including inside TPA. We have already ported some, and the work is generally not hard. See the porting guide for more information.

Debian 12 bookworm ships with Python 3.11. From Debian 11 bullseye's Python 3.9, there are many exciting changes including exception groups, TOML in stdlib, "pipe" (|) for Union types, structural pattern matching, Self type, variadic generics, and major performance improvements.

Other notable changes

TPA keeps a page detailing notable changes that might be interesting to you, on top of the bookworm release notes in particular the known issues and what's new sections.

Upgrade schedule

The upgrade is split in multiple batches:

  • low complexity (mostly TPA services): 34 machines, September 2023 (issue 41251)
  • moderate complexity (service admins): 31 machines, October 2023 (issue 41252)
  • high complexity (hard stuff): 15 machines, to be announced separately, before June 2024 (issue 41321, issue 41254 for gnt-fsn and issue 41253 for gnt-dal)
  • to be retired or rebuilt servers: upgraded like any others
  • already completed upgrades: 4 machines
  • buster machines: high complexity or retirement for cupani (tpo/tpa/team#41217) and vineale (tpo/tpa/team#41218), 6 machines

The free time between the first two batches will also allow us to cover for unplanned contingencies: upgrades that could drag on and other work that will inevitably need to be performed.

The objective is to do the batches in collective "upgrade parties" that should be "fun" for the team. This policy has proven to be effective in the bullseye upgrade and we are eager to repeat it again.

Low complexity, batch 1: September 2023

A first batch of servers will be upgraded around the second or third week of September 2023, when everyone will be back from vacation. Hopefully most fires will be out at that point.

It's also long enough before the Year-End Campaign (YEC) to allow us to recover if critical issues come up during the upgrade.

Those machines are considered to be somewhat trivial to upgrade as they are mostly managed by TPA or that we evaluate that the upgrade will have minimal impact on the service's users.

archive-01.torproject.org
cdn-backend-sunet-02.torproject.org
chives.torproject.org
dal-rescue-01.torproject.org
dal-rescue-02.torproject.org
hetzner-hel1-02.torproject.org
hetzner-hel1-03.torproject.org
hetzner-nbg1-01.torproject.org
hetzner-nbg1-02.torproject.org
loghost01.torproject.org
mandos-01.torproject.org
media-01.torproject.org
neriniflorum.torproject.org
ns3.torproject.org
ns5.torproject.org
palmeri.torproject.org
perdulce.torproject.org
relay-01.torproject.org
static-gitlab-shim.torproject.org
static-master-fsn.torproject.org
staticiforme.torproject.org
submit-01.torproject.org
tb-build-04.torproject.org
tb-build-05.torproject.org
tb-pkgstage-01.torproject.org
tb-tester-01.torproject.org
tbb-nightlies-master.torproject.org
web-dal-07.torproject.org
web-dal-08.torproject.org
web-fsn-01.torproject.org
web-fsn-02.torproject.org

In the first batch of bullseye machines, we estimated this work to be 45 minutes per machine, that is 20 hours of work. It turned out taking about one hour per machine, so 27 hours.

The above is 34 machines, so it is estimated to take 34 hours, or about a full work week for one person. It should be possible to complete it in a single work week "party".

Other notable changes include staticiforme that is treated as low complexity instead of moderate complexity. The Tor Browser builders have been moved to moderate complexity as they are managed by service admins.

Feedback and coordination of this batch happens in issue 41251.

Moderate complexity, batch 2: October 2023

The second batch of "moderate complexity servers" happens in the last week of October 2023. The main difference with the first batch is that the second batch regroups services mostly managed by service admins, who are given a longer heads up before the upgrades are done.

The date was picked to be far enough away from the first batch to recover from problems with it, but also after the YEC (scheduled for the end of October).

Those are the servers which will be upgraded in that batch:

bacula-director-01.torproject.org
btcpayserver-02.torproject.org
bungei.torproject.org
carinatum.torproject.org
check-01.torproject.org
colchicifolium.torproject.org
collector-02.torproject.org
crm-ext-01.torproject.org
crm-int-01.torproject.org
dangerzone-01.torproject.org
donate-review-01.torproject.org
gayi.torproject.org
gitlab-02.torproject.org
henryi.torproject.org
majus.torproject.org
materculae.torproject.org
meronense.torproject.org
metrics-store-01.torproject.org
nevii.torproject.org
onionbalance-02.torproject.org
onionoo-backend-01.torproject.org
onionoo-backend-02.torproject.org
onionoo-frontend-01.torproject.org
onionoo-frontend-02.torproject.org
polyanthum.torproject.org
probetelemetry-01.torproject.org
rdsys-frontend-01.torproject.org
rude.torproject.org
survey-01.torproject.org
telegram-bot-01.torproject.org
weather-01.torproject.org

31 machines. Like the first batch of machines, the second batch of bullseye upgrades was slightly underestimated and should also take one hour per machine, so about 31 hours, again possible to fit in a work week.

Feedback and coordination of this batch happens in issue 41252.

High complexity, individually done

Those machines are harder to upgrade, due to some major upgrades of their core components, and will require individual attention, if not major work to upgrade.

All of those require individual decision and design, and specific announcements will be made for upgrades once a decision has been made for each service.

Those are the affected servers:

alberti.torproject.org
eugeni.torproject.org
hetzner-hel1-01.torproject.org
pauli.torproject.org

Most of those servers are actually running buster at the moment, and are scheduled to be upgraded to bullseye first. And as part of that process, they might be simplified and turned into moderate complexity projects.

See issue 41321 to track the bookworm upgrades of the high-complexity servers.

The two Ganeti clusters also fall under the "high complexity" umbrella. Those are the following 11 servers:

dal-node-01.torproject.org
dal-node-02.torproject.org
dal-node-03.torproject.org
fsn-node-01.torproject.org
fsn-node-02.torproject.org
fsn-node-03.torproject.org
fsn-node-04.torproject.org
fsn-node-05.torproject.org
fsn-node-06.torproject.org
fsn-node-07.torproject.org
fsn-node-08.torproject.org

Ganeti cluster upgrades are tracked in issue 41254 (gnt-fsn) and issue 41253 (gnt-dal). We may want to upgrade only one cluster first, possibly the smaller gnt-dal cluster.

Looking at the gnt-fsn upgrade ticket it seems like it took around 12 hours of work, so the estimate here is about two days.

Completed upgrades

Those machines have already been upgraded to (or installed as) Debian 12 bookworm:

forum-01.torproject.org
metricsdb-01.torproject.org
tb-build-06.torproject.org

Buster machines

Those machines are currently running buster and are either considered for retirement or will be "double-upgraded" to bookworm, either as part of the bullseye upgrade process, or separately.

alberti.torproject.org
cupani.torproject.org
eugeni.torproject.org
hetzner-hel1-01.torproject.org
pauli.torproject.org
vineale.torproject.org

In particular:

  • alberti is part of the "high complexity" batch and will be double-upgraded

  • cupani (tpo/tpa/team#41217) and vineale (tpo/tpa/team#41218) will be retired in early 2024, see TPA-RFC-36

  • eugeni is part of the "high complexity" batch, and its future is still uncertain, depends on the email plan

  • hetzner-hel1-01 (Icinga/Nagios) is possibly going to be retired, see TPA-RFC-33

  • pauli is part of the high complexity batch and should be double-upgraded

There is other work related to the bullseye upgrade that is mentioned in the %Debian 12 bookworm upgrade milestone.

Alternatives considered

Container images

This doesn't cover Docker container images upgrades. Each team is responsible for upgrading their image tags in GitLab CI appropriately and is strongly encouraged to keep a close eye on those in general. We may eventually consider enforcing stricter control over container images if this proves to be too chaotic to self-manage.

Upgrade automation

No specific work is set aside to further automate upgrades.

Retirements or rebuilds

We do not plan on dealing with the bookworm upgrade by retiring or rebuilding any server. This policy has not worked well for the bullseye upgrades and has been abandoned.

If a server is scheduled to be retired or rebuilt some time in the future and its turn in the batch comes, it should either be retired or rebuilt in time or just upgraded, unless it's a "High complexity" upgrade.

Costs

The first and second batches of work should take TPA about two weeks of full time work.

The remaining servers are a wild guess, probably a few weeks altogether, but probably more. They depend on other RFCs and their estimates are out of scope here.

Approvals required

This proposal needs approval from TPA team members, but service admins can request additional delay if they are worried about their service being affected by the upgrade.

Comments or feedback can be provided in issues linked above, or the general process can be commented on in issue tpo/tpa/team#41245.

References