Debian 12 bookworm entered freeze in January 19th 2023. TPA is in the process of studying the procedure and hopes to start immediately after the bullseye upgrade is completed. We have a hard deadline of one year after the stable release, which gives us a few years to complete this process. Typically, however, we try to upgrade during the freeze to report (and contribute to) issues we find during the upgrade, as those are easier to fix during the freeze than after. In that sense, the deadline is more like the third quarter of 2023.

It is an aggressive timeline, which will like be missed. It is tracked in the GitLab issue tracker under the % Debian 12 bookworm upgrade milestone. Upgrades will be staged in batches, see TPA-RFC-20 for details on how that was performed in bullseye.

As soon as when the bullseye upgrade is completed, we hope to phase out the bullseye installers so that new machines are setup with bullseye.

This page aims at documenting the upgrade procedure, known problems and upgrade progress of the fleet.

[[TOC]]

Procedure

This procedure is designed to be applied, in batch, on multiple servers. Do NOT follow this procedure unless you are familiar with the command line and the Debian upgrade process. It has been crafted by and for experienced system administrators that have dozens if not hundreds of servers to upgrade.

In particular, it runs almost completely unattended: configuration changes are not prompted during the upgrade, and just not applied at all, which will break services in many cases. We use a clean-conflicts script to do this all in one shot to shorten the upgrade process (without it, configuration file changes stop the upgrade at more or less random times). Then those changes get applied after a reboot. And yes, that's even more dangerous.

IMPORTANT: if you are doing this procedure over SSH (I had the privilege of having a console), you may want to upgrade SSH first as it has a longer downtime period, especially if you are on a flaky connection.

See the "conflicts resolution" section below for how to handle clean_conflicts output.

  1. Preparation:

    echo reset to the default locale &&
    export LC_ALL=C.UTF-8 &&
    echo install some dependencies &&
    sudo apt install ttyrec screen debconf-utils deborphan apt-forktracer &&
    echo create ttyrec file with adequate permissions &&
    sudo touch /var/log/upgrade-bookworm.ttyrec &&
    sudo chmod 600 /var/log/upgrade-bookworm.ttyrec &&
    sudo ttyrec -a -e screen /var/log/upgrade-bookworm.ttyrec
    
  2. Backups and checks:

    ( 
      umask 0077 &&
      tar cfz /var/backups/pre-bookworm-backup.tgz /etc /var/lib/dpkg /var/lib/apt/extended_states /var/cache/debconf $( [ -e /var/lib/aptitude/pkgstates ] && echo /var/lib/aptitude/pkgstates ) &&
      dpkg --get-selections "*" > /var/backups/dpkg-selections-pre-bookworm.txt &&
      debconf-get-selections > /var/backups/debconf-selections-pre-bookworm.txt
    ) &&
    : lock down puppet-managed postgresql version &&
    (
      if jq -re '.resources[] | select(.type=="Class" and .title=="Profile::Postgresql") | .title' < /var/lib/puppet/client_data/catalog/$(hostname -f).json; then
      echo "tpa_preupgrade_pg_version_lock: '$(/usr/share/postgresql-common/supported-versions)'" > /etc/facter/facts.d/tpa_preupgrade_pg_version_lock.yaml; fi
    ) &&
    : pre-upgrade puppet run
    ( puppet agent --test || true ) &&
    apt-mark showhold &&
    dpkg --audit &&
    echo look for dkms packages and make sure they are relevant, if not, purge. &&
    ( dpkg -l '*dkms' || true ) &&
    echo look for leftover config files &&
    /usr/local/sbin/clean_conflicts &&
    echo make sure backups are up to date in Bacula &&
    printf "End of Step 2\a\n"
    
  3. Enable module loading (for ferm) and test reboots:

    systemctl disable modules_disabled.timer &&
    puppet agent --disable "running major upgrade" &&
    shutdown -r +1 "bookworm upgrade step 3: rebooting with module loading enabled"
    
  4. Perform any pending upgrade and clear out old pins:

    export LC_ALL=C.UTF-8 &&
    sudo ttyrec -a -e screen /var/log/upgrade-bookworm.ttyrec
    
    apt update && apt -y upgrade &&
    echo Check for pinned, on hold, packages, and possibly disable &&
    rm -f /etc/apt/preferences /etc/apt/preferences.d/* &&
    rm -f /etc/apt/sources.list.d/backports.debian.org.list &&
    rm -f /etc/apt/sources.list.d/backports.list &&
    rm -f /etc/apt/sources.list.d/bookworm.list &&
    rm -f /etc/apt/sources.list.d/bullseye.list &&
    rm -f /etc/apt/sources.list.d/*-backports.list &&
    rm -f /etc/apt/sources.list.d/experimental.list &&
    rm -f /etc/apt/sources.list.d/incoming.list &&
    rm -f /etc/apt/sources.list.d/proposed-updates.list &&
    rm -f /etc/apt/sources.list.d/sid.list &&
    rm -f /etc/apt/sources.list.d/testing.list &&
    echo purge removed packages &&
    apt purge $(dpkg -l | awk '/^rc/ { print $2 }') &&
    apt purge '?obsolete' &&
    apt autoremove -y --purge &&
    echo possibly clean up old kernels &&
    dpkg -l 'linux-image-*' &&
    echo look for packages from backports, other suites or archives &&
    echo if possible, switch to official packages by disabling third-party repositories &&
    apt-forktracer &&
    printf "End of Step 4\a\n"
    
  5. Check free space (see this guide to free up space), disable auto-upgrades, and download packages:

    systemctl stop apt-daily.timer &&
    sed -i 's#bullseye-security#bookworm-security#' $(ls /etc/apt/sources.list /etc/apt/sources.list.d/*) &&
    sed -i 's/bullseye/bookworm/g' $(ls /etc/apt/sources.list /etc/apt/sources.list.d/*) &&
    apt update &&
    apt -y -d full-upgrade &&
    apt -y -d upgrade &&
    apt -y -d dist-upgrade &&
    df -h &&
    printf "End of Step 5\a\n"
    
  6. Actual upgrade run:

    echo put server in maintenance &&
    sudo touch /etc/nologin &&
    env DEBIAN_FRONTEND=noninteractive APT_LISTCHANGES_FRONTEND=none APT_LISTBUGS_FRONTEND=none UCF_FORCE_CONFFOLD=y \
        apt full-upgrade -y -o Dpkg::Options::='--force-confdef' -o Dpkg::Options::='--force-confold' &&
    printf "End of Step 6\a\n"
    
  7. Post-upgrade procedures:

    apt-get update --allow-releaseinfo-change &&
    puppet agent --enable &&
    puppet agent -t --noop &&
    printf "Press enter to continue, Ctrl-C to abort." &&
    read -r _ &&
    (puppet agent -t || true) &&
    echo deploy upgrades after possible Puppet sources.list changes &&
    apt update && apt upgrade -y &&
    rm -f /etc/default/bacula-fd.ucf-dist /etc/apache2/conf-available/security.conf.dpkg-dist /etc/apache2/mods-available/mpm_worker.conf.dpkg-dist /etc/default/puppet.dpkg-dist /etc/ntpsec/ntp.conf.dpkg-dist /etc/puppet/puppet.conf.dpkg-dist /etc/apt/apt.conf.d/50unattended-upgrades.dpkg-dist /etc/bacula/bacula-fd.conf.ucf-dist /etc/ca-certificates.conf.dpkg-old /etc/cron.daily/bsdmainutils.dpkg-remove /etc/default/prometheus-apache-exporter.dpkg-dist /etc/default/prometheus-node-exporter.dpkg-dist /etc/ldap/ldap.conf.dpkg-dist /etc/logrotate.d/apache2.dpkg-dist /etc/nagios/nrpe.cfg.dpkg-dist /etc/ssh/ssh_config.dpkg-dist /etc/ssh/sshd_config.ucf-dist /etc/sudoers.dpkg-dist /etc/syslog-ng/syslog-ng.conf.dpkg-dist /etc/unbound/unbound.conf.dpkg-dist /etc/systemd/system/fstrim.timer &&
    printf "\a" &&
    /usr/local/sbin/clean_conflicts &&
    systemctl start apt-daily.timer &&
    echo 'workaround for Debian bug #989720' &&
    sed -i 's/^allow-ovs/auto/' /etc/network/interfaces &&
    rm /etc/nologin &&
    printf "End of Step 7\a\n" &&
    shutdown -r +1 "bookworm upgrade step 7: removing old kernel image"
    
  8. Service-specific upgrade procedures

    If the server is hosting a more complex service, follow the right Service-specific upgrade procedures

  9. Post-upgrade cleanup:

    export LC_ALL=C.UTF-8 &&
    sudo ttyrec -a -e screen /var/log/upgrade-bookworm.ttyrec
    
    apt-mark manual bind9-dnsutils puppet-agent &&
    apt purge apt-forktracer &&
    echo purging removed packages &&
    apt purge $(dpkg -l | awk '/^rc/ { print $2 }') &&
    apt autopurge &&
    apt purge $(deborphan --guess-dummy) &&
    while deborphan -n | grep -q . ; do apt purge $(deborphan -n); done &&
    apt autopurge &&
    echo review obsolete and odd packages &&
    apt purge '?obsolete' && apt autopurge &&
    apt list "?narrow(?installed, ?not(?codename($(lsb_release -c -s | tail -1))))" &&
    apt clean &&
    echo review installed kernels: &&
    dpkg -l 'linux-image*' | less &&
    printf "End of Step 9\a\n" &&
    shutdown -r +1 "bookworm upgrade step 9: testing reboots one final time"
    

IMPORTANT: make sure you test the services at this point, or at least notify the admins responsible for the service so they do so. This will allow new problems that developed due to the upgrade to be found earlier.

Conflicts resolution

When the clean_conflicts script gets run, it asks you to check each configuration file that was modified locally but that the Debian package upgrade wants to overwrite. You need to make a decision on each file. This section aims to provide guidance on how to handle those prompts.

Those config files should be manually checked on each host:

     /etc/default/grub.dpkg-dist
     /etc/initramfs-tools/initramfs.conf.dpkg-dist

The grub config file, in particular, should be restored to the upstream default and host-specific configuration moved to the grub.d directory.

If other files come up, they should be added in the above decision list, or in an operation in step 2 or 7 of the above procedure, before the clean_conflicts call.

Files that should be updated in Puppet are mentioned in the Issues section below as well.

Service-specific upgrade procedures

PostgreSQL upgrades

Note: before doing the entire major upgrade procedure, it is worth considering upgrading PostgreSQL to "backports". There are no officiel "Debian backports" of PostgreSQL, but there is an https://apt.postgresql.org/ repo which is supposedly compatible with the official Debian packages. The only (currently known) problem with that repo is that it doesn't use the tilde (~) version number, so that when you do eventually do the major upgrade, you need to manually upgrade those packages as well.

PostgreSQL is special and needs to be upgraded manually.

  1. make a full backup of the old cluster:

    ssh -tt bungei.torproject.org 'sudo -u torbackup postgres-make-one-base-backup $(grep ^meronense.torproject.org $(which postgres-make-base-backups ))'
    

    The above assumes the host to backup is meronense and the backup server is bungei. See howto/postgresql for details of that procedure.

  2. Once the backup completes, on the database server, possibly stop users of the database, because it will have to be stopped for the major upgrade.

    on the Bacula director, in particular, this probably means waiting for all backups to complete and stopping the director:

    service bacula-director stop
    

    this will mean other things on other servers! failing to stop writes to the database will lead to problems with the backup monitoring system. an alternative is to just stop PostgreSQL altogether:

    service postgresql@13-main stop
    

    This also involves stopping Puppet so that it doesn't restart services:

    puppet agent --disable "PostgreSQL upgrade"
    
  3. On the storage server, move the directory out of the way and recreate it:

    ssh bungei.torproject.org "mv /srv/backups/pg/meronense /srv/backups/pg/meronense-13 && sudo -u torbackup mkdir /srv/backups/pg/meronense"
    
  4. on the database server, do the actual cluster upgrade:

    export LC_ALL=C.UTF-8 &&
    printf "about to stop and destroy cluster main on postgresql-15, press enter to continue" &&
    read _ &&
    port15=$(grep ^port /etc/postgresql/15/main/postgresql.conf  | sed 's/port.*= //;s/[[:space:]].*$//')
    if psql -P $port15 --no-align --tuples-only \
           -c "SELECT datname FROM pg_database WHERE datistemplate = false and datname != 'postgres';"  \
           | grep .; then
        echo "ERROR: database cluster 15 not empty"
    else
        pg_dropcluster --stop 15 main &&
        pg_upgradecluster -m upgrade -k 13 main &&
        rm -f /etc/facter/facts.d/tpa_preupgrade_pg_version_lock.yaml
    fi
    

    Yes, that implies DESTROYING the NEW version but the point is we then recreate it from the old one.

    TODO: this whole procedure needs to be moved into fabric, for sanity.

  5. run puppet on the server and on the storage server to update backup configuration files; this should also restart any services stopped at step 1

    puppet agent --enable && pat
    ssh bungei.torproject.org pat
    
  6. make a new full backup of the new cluster:

    ssh -tt bungei.torproject.org 'sudo -u torbackup postgres-make-one-base-backup $(grep ^meronense.torproject.org $(which postgres-make-base-backups ))'
    
  7. make sure you check for gaps in the write-ahead log, see tpo/tpa/team#40776 for an example of that problem and the WAL-MISSING-AFTER PostgreSQL playbook for recovery.

  8. purge the old backups directory after 3 weeks:

    ssh bungei.torproject.org "echo 'rm -r /srv/backups/pg/meronense-13/' | at now + 21day"
    

The old PostgreSQL packages will be automatically cleaned up and purged at step 9 of the general upgrade procedure.

It is also wise to read the release notes for the relevant release to see if there are any specific changes that are needed at the application level, for service owners. In general, the above procedure does use pg_upgrade so that's already covered.

RT upgrades

Request Tracker was upgraded from version 4.4.6 (bullseye) to 5.0.3. The Debian package is now request-tracker5. To implement this transition, a manual database upgrade was executed, and the Puppet profile was updated to reflect the new package and executable names, and configuration options.

https://docs.bestpractical.com/rt/5.0.3/UPGRADING-5.0.html

Ganeti upgrades

So far it seems there is no significant upgrade on the Ganeti clusters, at least as far as Ganeti itself is concerned. In fact, there hasn't been a release upstream since 2022, which is a bit concerning.

There was a bug with the newer Haskell code in bookworm but the 3.0.2-2 package already has a patch (really a workaround) to fix that. Also, there was a serious regression in the Linux kernel which affected Haskell programs (1036755). The fix for this issue was released to bookworm in July 2023, in kernel 6.1.38.

No special procedure seems to be required for the Ganeti upgrade this time around, follow the normal upgrade procedures.

Puppet server upgrade

In my (anarcat) home lab, I had to apt install postgresql puppetdb puppet-terminus-puppetdb and follow the connect instructions, as I was using the redis terminus before (probably not relevant for TPA).

I also had to adduser puppetdb puppet for it to be able to access the certs, and add the certs to the jetty config. Basically:

certname="$(puppet config print certname)"
hostcert="$(puppet config print hostcert)"
hostkey="$(puppet config print hostprivkey)"
cacert="$(puppet config print cacert)"

adduser puppetdb puppet

cat >>/etc/puppetdb/conf.d/jetty.ini <<-EOF
    ssl-host = 0.0.0.0
    ssl-port = 8081
    ssl-key = ${hostkey}
    ssl-cert = ${hostcert}
    ssl-ca-cert = ${cacert}
EOF

echo "Starting PuppetDB ..."
systemctl start puppetdb

cp /usr/share/doc/puppet-terminus-puppetdb/routes.yaml.example /etc/puppet/routes.yaml
cat >/etc/puppet/puppetdb.conf <<-EOF
    [main]
    server_urls = https://${certname}:8081

also:

apt install puppet-module-puppetlabs-cron-core puppet-module-puppetlabs-augeas-core puppet-module-puppetlabs-sshkeys-core
puppetserver gem install trocla:0.4.0 --no-document

Notable changes

Here is a list of notable changes from a system administration perspective:

See also the wiki page about bookworm for another list.

New packages

This is a curated list of packages that were introduced in bookworm. There are actually thousands of new packages in the new Debian release, but this is a small selection of projects I found particularly interesting:

  • OpenSnitch - interactive firewall inspired by Little Snitch (on Mac)

Updated packages

This table summarizes package changes that could be interesting for our project.

Package Bullseye Bookworm Notes
Ansible 2.10 2.14
Bind 9.16 9.18 DoT, DoH, XFR-over-TLS,
GCC 10 12 see GCC 11 and GCC 12 release notes
Emacs 27.1 28.1 native compilation, seccomp, better emoji support, 24-bit true color support in terminals, C-x 4 4 to display next command in a new window, xterm-mouse-mode, context-menu-mode, repeat-mode
Firefox 91.13 102.11 91.13 already in buster-security
Git 2.30 2.39 rebase --update-refs, merge ort strategy, stash --staged, sparse index support, SSH signatures, help.autoCorrect=prompt, maintenance start, clone.defaultRemoteName, git rev-list --disk-usage
Golang 1.15 1.19 generics, fuzzing, SHA-1, TLS 1.0, and 1.1 disabled by default, performance improvements, embed package, Apple ARM support
Linux 5.10 6.1 mainline Rust, multi-generational LRU, KMSAN, KFENCE, maple trees, guest memory encryption, AMD Zen performance improvements, C11, Blake-2 RNG, NTFS write support, Samba 3, Landlock, Apple M1, and much more
LLVM 13 15 see LLVM 14 and LLVM 15 release notes
OpenJDK 11 17 see this list for release notes
OpenLDAP 2.4 2.5 2FA, load balancer support
OpenSSL 1.1.1 3.0 FIPS 140-3 compliance, MD2, DES disabled by default, AES-SIV, KDF-SSH, KEM-RSAVE, HTTPS client, Linux KTLS support
OpenSSH 8.4 9.2 scp now uses SFTP, NTRU quantum-resistant key exchange, SHA-1 disabled EnableEscapeCommandline
Podman 3.0 4.3 GitLab runner, sigstore support, Podman Desktop, volume mount, container clone, pod clone, Netavark network stack rewrite, podman-restart.service to restart all containers, digest support for pull, and lots more
Postgresql 13 15 stats collector optimized out, UNIQUE NULLS NOT DISTINCT, MERGE, zstd/lz4 compression for WAL files, also in pg_basebackup, see also feature matrix
Prometheus 2.24 2.42 keep_firing_for alerts, @ modifier, classic UI removed, promtool check service-discovery command, feature flags which include native histograms, agent mode, snapshot-on-shutdown for faster restarts, generic HTTP service discovery, dark theme, Alertmanager v2 API default
Python 3.9.2 3.11 exception groups, TOML in stdlib, "pipe" for Union types, structural pattern matching, Self type, variadic generatics, major performance improvements, Python 2 removed completely
Puppet 5.5.22 7.23 major work from colleagues and myself
Rustc 1.48 1.63 Rust 2021, I/O safety, scoped threads, cargo add, --timings, inline assembly, bare-metal x86, captured identifiers in format strings, binding @ pattern, Open range patterns, IntoIterator for arrays, Or patterns, Unicode identifiers, const generics, arm64 tier-1 incremental compilation turned off and on a few times
Vim 8.2 9.0 Vim9 script

See the official release notes for the full list from Debian.

Removed packages

TODO

Python 2 was completely removed from Debian, a long-term task that had already started with bullseye, but not completed.

See also the noteworthy obsolete packages list.

Deprecation notices

TODO

Issues

See also the official list of known issues.

sudo -i stops working

Note: This issue has been resolved

After upgrading to bookworm, sudo -i started rejecting valid passwords on many machines. This is because bookworm introduced a new /etc/pam.d/sudo-i file. Anarcat fixed this in puppet with a new sudo-i file that TPA vendors.

If you're running into this issue, check that puppet has deployed the correct file in /etc/pamd./sudo-i

Pending

  • there's a regression in the bookworm Linux kernel (1036755) which causes crashes in (some?) Haskell programs which should be fixed before we start deploying Ganeti upgrades, in particular

  • Schleuder (and Rails, in general) have issues upgrading between bullseye and bookworm (1038935)

See also the official list of known issues.

grub-pc failures

On some hosts, grub-pc failed to configure correctly:

Setting up grub-pc (2.06-13) ...
grub-pc: Running grub-install ...
/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_disk-7f3a5ef1-b522-4726 does not exist, so cannot grub-install to it!
You must correct your GRUB install devices before proceeding:

  DEBIAN_FRONTEND=dialog dpkg --configure grub-pc
  dpkg --configure -a
dpkg: error processing package grub-pc (--configure):
 installed grub-pc package post-installation script subprocess returned error exit status 1

The fix is, as described, to run dpkg --configure grub-pc and pick the disk with a partition to install grub on. It's unclear what a preemptive fix for that is.

NTP configuration to be ported

We have some slight diffs in our Puppet-managed NTP configuration:

Notice: /Stage[main]/Ntp/File[/etc/ntpsec/ntp.conf]/content:
--- /etc/ntpsec/ntp.conf        2023-09-26 14:41:08.648258079 +0000
+++ /tmp/puppet-file20230926-35001-x7hntz       2023-09-26 14:47:56.547991158 +0000
@@ -4,13 +4,13 @@

 # /etc/ntp.conf, configuration for ntpd; see ntp.conf(5) for help

-driftfile /var/lib/ntpsec/ntp.drift
+driftfile /var/lib/ntp/ntp.drift

 # Leap seconds definition provided by tzdata
 leapfile /usr/share/zoneinfo/leap-seconds.list

 # Enable this if you want statistics to be logged.
-#statsdir /var/log/ntpsec/
+#statsdir /var/log/ntpstats/

 statistics loopstats peerstats clockstats
 filegen loopstats file loopstats type day enable

Notice: /Stage[main]/Ntp/File[/etc/ntpsec/ntp.conf]/content: content changed '{sha256}c5d627a596de1c67aa26dfbd472a4f07039f4664b1284cf799d4e1eb43c92c80' to '{sha256}18de87983c2f8491852390acc21c466611d6660083b0d0810bb6509470949be3'
Notice: /Stage[main]/Ntp/File[/etc/ntpsec/ntp.conf]/mode: mode changed '0644' to '0444'
Info: /Stage[main]/Ntp/File[/etc/ntpsec/ntp.conf]: Scheduling refresh of Exec[service ntpsec restart]
Info: /Stage[main]/Ntp/File[/etc/ntpsec/ntp.conf]: Scheduling refresh of Exec[service ntpsec restart]
Notice: /Stage[main]/Ntp/File[/etc/default/ntpsec]/content:
--- /etc/default/ntpsec 2023-07-29 20:51:53.000000000 +0000
+++ /tmp/puppet-file20230926-35001-d4tltp       2023-09-26 14:47:56.579990910 +0000
@@ -1,9 +1 @@
-NTPD_OPTS="-g -N"
-
-# Set to "yes" to ignore DHCP servers returned by DHCP.
-IGNORE_DHCP=""
-
-# If you use certbot to obtain a certificate for ntpd, provide its name here.
-# The ntpsec deploy hook for certbot will handle copying and permissioning the
-# certificate and key files.
-NTPSEC_CERTBOT_CERT_NAME=""
+NTPD_OPTS='-g'

Notice: /Stage[main]/Ntp/File[/etc/default/ntpsec]/content: content changed '{sha256}26bcfca8526178fc5e0df1412fbdff120a0d744cfbd023fef7b9369e0885f84b' to '{sha256}1bb4799991836109d4733e4aaa0e1754a1c0fee89df225598319efb83aa4f3b1'
Notice: /Stage[main]/Ntp/File[/etc/default/ntpsec]/mode: mode changed '0644' to '0444'
Info: /Stage[main]/Ntp/File[/etc/default/ntpsec]: Scheduling refresh of Exec[service ntpsec restart]
Info: /Stage[main]/Ntp/File[/etc/default/ntpsec]: Scheduling refresh of Exec[service ntpsec restart]
Notice: /Stage[main]/Ntp/Exec[service ntpsec restart]: Triggered 'refresh' from 4 events

Note that this is a "reverse diff", that is Puppet restoring the old bullseye config, so we should apply the reverse of this in Puppet.

sudo configuration lacks limits.conf?

Just notice this diff on all hosts:

--- /etc/pam.d/sudo     2021-12-14 19:59:20.613496091 +0000
+++ /etc/pam.d/sudo.dpkg-dist   2023-06-27 11:45:00.000000000 +0000
@@ -1,12 +1,8 @@
-##
-## THIS FILE IS UNDER PUPPET CONTROL. DON'T EDIT IT HERE.
-##
 #%PAM-1.0

-# use the LDAP-derived password file for sudo access
-auth    requisite        pam_pwdfile.so pwdfile=/var/lib/misc/thishost/sudo-passwd
+# Set up user limits from /etc/security/limits.conf.
+session    required   pam_limits.so

-# disable /etc/password for sudo authentication, see #6367
-#@include common-auth
+@include common-auth
 @include common-account
 @include common-session-noninteractive

Why don't we have pam_limits setup? Historical oddity? To investigatte.

Resolved

libc configuration failure on skip-upgrade

The alberti upgrade failed with:

/usr/bin/perl: error while loading shared libraries: libcrypt.so.1: cannot open shared object file: No such file 
or directory
dpkg: error processing package libc6:amd64 (--configure):
 installed libc6:amd64 package post-installation script subprocess returned error exit status 127
Errors were encountered while processing:
 libc6:amd64
perl: error while loading shared libraries: libcrypt.so.1: cannot open shared object file: No such file or direct
ory
needrestart is being skipped since dpkg has failed
E: Sub-process /usr/bin/dpkg returned an error code (1)

The solution is:

dpkg -i libc6_2.36-9+deb12u1_amd64.deb libpam0g_1.5.2-6_amd64.deb  libcrypt1_1%3a4.4.33-2_amd64.deb
apt install -f

This happened because I mistakenly followed this procedure instead of the bullseye procedure when upgrading it to bullseye, in other words, doing a "skip upgrade", directly upgrading from buster to bookworm, see this ticket for more context.x

Could not enable fstrim.timer

During and after the upgrade to bookworm, this error may be shown during Puppet runs:

Error: Could not enable fstrim.timer
Error: /Stage[main]/Torproject_org/Service[fstrim.timer]/enable: change from 'false' to 'true' failed: Could not enable fstrim.timer:  (corrective)

The solution is to run:

rm /etc/systemd/system/fstrim.timer
systemctl reload-daemon

This removes an obsolete symlink which systemd gets annoyed about.

unable to connect via ssh with nitrokey start token

Connecting to, or via, a bookworm server fails when using a Nitrokey Start token:

sign_and_send_pubkey: signing failed for ED25519 "(none)" from agent: agent refused operation

This is caused by an incompatibility introduced in recent versions of OpenSSH.

The fix is to upgrade the token's firmware. Several workarounds are documented in this ticket: https://dev.gnupg.org/T5931

Troubleshooting

Upgrade failures

Instructions on errors during upgrades can be found in the release notes troubleshooting section.

Reboot failures

If there's any trouble during reboots, you should use some recovery system. The release notes actually have good documentation on that, on top of "use a live filesystem".

References

Fleet-wide changes

The following changes need to be performed once for the entire fleet, generally at the beginning of the upgrade process.

installer changes

The installer need to be changed to support the new release. This includes:

  • the Ganeti installers (add a gnt-instance-debootstrap variant, modules/profile/manifests/ganeti.pp in tor-puppet.git, see commit 4d38be42 for an example)
  • the (deprecated) libvirt installer (modules/roles/files/virt/tor-install-VM, in tor-puppet.git)
  • the wiki documentation:
  • create a new page like this one documenting the process, linked from howto/upgrades
  • make an entry in the data.csv to start tracking progress (see below), copy the Makefile as well, changing the suite name
  • change the Ganeti procedure so that the new suite is used by default
  • change the Hetzner robot install procedure
  • fabric-tasks and the fabric installer (TODO)

Debian archive changes

The Debian archive on db.torproject.org (currently alberti) need to have a new suite added. This can be (partly) done by editing files /srv/db.torproject.org/ftp-archive/. Specifically, the two following files need to be changed:

  • apt-ftparchive.config: a new stanza for the suite, basically copy-pasting from a previous entry and changing the suite
  • Makefile: add the new suite to the for loop

But it is not enough: the directory structure need to be crafted by hand as well. A simple way to do so is to replicate a previous release structure:

cd /srv/db.torproject.org/ftp-archive
rsync -a --include='*/' --exclude='*' archive/dists/bullseye/  archive/dists/bookworm/

Per host progress

Note that per-host upgrade policy is in howto/upgrades.

When a critical mass of servers have been upgraded and only "hard" ones remain, they can be turned into tickets and tracked in GitLab. In the meantime...

A list of servers to upgrade can be obtained with:

curl -s -G http://localhost:8080/pdb/query/v4 --data-urlencode 'query=nodes { facts { name = "lsbdistcodename" and value != "bullseye" }}' | jq .[].certname | sort

Or in Prometheus:

count(node_os_info{version_id!="11"}) by (alias)

Or, by codename, including the codename in the output:

count(node_os_info{version_codename!="bullseye"}) by (alias,version_codename)
graph showing planned completion date, currently around July 2024
The above graphic shows the progress of the migration between major releases. It can be regenerated with the [predict-os](https://gitlab.com/anarcat/predict-os) script. It pulls information from [puppet](howto/puppet) to update a [CSV file](data.csv) to keep track of progress over time. WARNING: the graph may be incorrect or missing as the upgrade procedure ramps up. The following graph will be converted into a Grafana dashboard to fix that, see [issue 40512](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40512).