[[TOC]]

Tutorial

Most work on Bacula happens on the director, which is where backups are coordinated. Actual data is stored on the storage daemon, but the director is where we can issue commands and everything.

All commands below are ran from the bconsole shell, which can be ran on the director with:

root@bacula-director-01:~# bconsole 
Connecting to Director bacula-director-01.torproject.org:9101
1000 OK: 103 torproject-dir Version: 9.4.2 (04 February 2019)
Enter a period to cancel a command.
*

Then you end up with a shell with * as a prompt where you can issue commands.

Checking last jobs

To see the last jobs ran, you can check the status of the director:

*status director
torproject-dir Version: 9.4.2 (04 February 2019) x86_64-pc-linux-gnu debian 9.7
Daemon started 22-Jul-19 10:30, conf reloaded 23-Jul-2019 12:43:41
 Jobs: run=868, running=1 mode=0,0
 Heap: heap=7,536,640 smbytes=701,360 max_bytes=21,391,382 bufs=4,518 max_bufs=8,576
 Res: njobs=74 nclients=72 nstores=73 npools=291 ncats=1 nfsets=2 nscheds=2

Scheduled Jobs:
Level          Type     Pri  Scheduled          Job Name           Volume
===================================================================================
Full           Backup    15  03-Aug-19 02:10    BackupCatalog      *unknown*
====

Running Jobs:
Console connected using TLS at 02-Aug-19 15:41
 JobId  Type Level     Files     Bytes  Name              Status
======================================================================
107689  Back Full          0         0  chiwui.torproject.org is waiting for its start time (02-Aug 19:32)
====

Terminated Jobs:
 JobId  Level      Files    Bytes   Status   Finished        Name 
====================================================================
107680  Incr      51,879    2.408 G  OK       02-Aug-19 13:16 rouyi.torproject.org
107682  Incr         355    361.2 M  OK       02-Aug-19 13:33 henryi.torproject.org
107681  Diff      12,864    715.9 M  OK       02-Aug-19 13:34 pauli.torproject.org
107683  Incr         274    30.78 M  OK       02-Aug-19 13:50 forrestii.torproject.org
107684  Incr       3,423    2.398 G  OK       02-Aug-19 13:55 meronense.torproject.org
107685  Incr         288    32.24 M  OK       02-Aug-19 14:12 nevii.torproject.org
107686  Incr         341    69.64 M  OK       02-Aug-19 14:51 getulum.torproject.org
107687  Incr         289    26.24 M  OK       02-Aug-19 15:11 dictyotum.torproject.org
107688  Incr         376    57.62 M  OK       02-Aug-19 15:22 kvm5.torproject.org
107690  Incr         238    20.88 M  OK       02-Aug-19 15:32 opacum.torproject.org

====

Here we see that no backups are running, and the last ones succeeded correctly.

You can also check the status of individual clients with status client.

Checking messages

The messages command shows the latest messages on the bconsole. It's useful to run this command when you start your session as it will flush the (usually quite long) buffer of messages. That way the next time you call the command, you will only see the result of your latest jobs.

Running a backup

Backups are ran regularly by a cron job, but if you need to run a backup immediately, this can be done in the bconsole.

The short version is to just run the run command and pick the server to backup.

Longer version:

  1. enter the console on the Bacula director:

    ssh -tt bacula-director-01.torproject.org bconsole
    
  2. enter the run dialog:

    *run
    A job name must be specified.
    The defined Job resources are:
         1: RestoreFiles
         2: alberti.torproject.org
         3: archive-01.torproject.org
         [...]
         59: peninsulare.torproject.org
    
  3. pick a server, for example peninsulare.torproject.org above is 59, so enter 59 and confirm by entering yes:

    Select Job resource (1-77): 59
    Run Backup job
    JobName:  peninsulare.torproject.org
    Level:    Incremental
    Client:   peninsulare.torproject.org-fd
    FileSet:  Standard Set
    Pool:     poolfull-torproject-peninsulare.torproject.org (From Job resource)
    Storage:  File-peninsulare.torproject.org (From Pool resource)
    When:     2019-10-11 20:57:09
    Priority: 10
    OK to run? (yes/mod/no): yes
    Job queued. JobId=113225
    
  4. Bacula confirms the job is queued. you can see the status of the job with status director, which should show set of lines like this in the middle:

    JobId  Type Level     Files     Bytes  Name              Status
    ======================================================================
    113226  Back Incr          0         0  peninsulare.torproject.org is running
    
  5. this will take more or less time depending on the size of the server. you can call status director repeatedly to follow progress (for example, with watch "echo status director | bconsole" in another shell or run the mess command to see new messages as they progress. when the backup completes, you should see something like this in the status director output:

    Terminated Jobs:
     JobId  Level      Files    Bytes   Status   Finished        Name 
    ====================================================================
    113225  Incr          33    11.67 M  OK       11-Oct-19 20:59 peninsulare.torproject.org
    

That's it, new files were sucked in and you're good to do whatever nasty things you were about to do.

How-to

This section is more in-depths and will explain more concepts as we go. Relax, take a deep breath, it should go fine.

Configure backups on new machines

Backups for new machines should be automatically configured by Puppet using the bacula::client class, included everywhere (through hiera/common.yaml).

There are special configurations required for MySQL and PostgreSQL databases, see the design section for more information on those.

Restore files

Short version:

$ ssh -tt bacula-director-01.torproject.org bconsole
*restore

... and follow instructions.

Reminder: by default, backups are restored to the originating server, as the bacula user, in /var/tmp/bacula-restores. Make sure that directory has enough disk space, and if you need to restore specific ownership settings, you want to run bacula-fd as root, see the restoring files as root section.

Use the llist jobid=N and messages commands to follow progress.

The bconsole program has a pretty good interactive restore mode which you can just call with restore. It needs to know which "jobs" you want to restore from. As a given backup job is typically an incremental job, you normally mean multiple jobs to restore to a given point in time.

The first thing to know is that restores are done from the server to the client, i.e. they are restored directly on the machine that is backed up. Note that by default files will be owned by the bacula user because the file daemon runs as bacula in our configuration. If that's a problem for large backups, the override (in /etc/systemd/system/bacula-fd.service.d/override.conf) can be temporarily disabled by simply removing the file and restarting the service:

rm /etc/systemd/system/bacula-fd.service.d/override.conf
systemctl restart bacula-fd

And then restarting the restore procedure. Note that this file well be re-created by Puppet the next time it runs, so maybe you also want to run puppet --disable 'to respect the bacula-fd override'. In this configuration, however, the file daemon can overwrite any file, so be careful in this case.

A simple way of restoring a given client to a given point in time is to use the option. So:

  1. enter bconsole in a shell on the director

  2. call the restore command:

    *restore
    Automatically selected Catalog: MyCatalog
    Using Catalog "MyCatalog"
    
    First you select one or more JobIds that contain files
    to be restored. You will be presented several methods
    of specifying the JobIds. Then you will be allowed to
    select which files from those JobIds are to be restored.
    
  3. you now have a list of possible ways of restoring, choose: 5: Select the most recent backup for a client:

    To select the JobIds, you have the following choices:
         1: List last 20 Jobs run
         2: List Jobs where a given File is saved
         3: Enter list of comma separated JobIds to select
         4: Enter SQL list command
         5: Select the most recent backup for a client
         6: Select backup for a client before a specified time
         7: Enter a list of files to restore
         8: Enter a list of files to restore before a specified time
         9: Find the JobIds of the most recent backup for a client
        10: Find the JobIds for a backup for a client before a specified time
        11: Enter a list of directories to restore for found JobIds
        12: Select full restore to a specified Job date
        13: Cancel
    Select item:  (1-13): 5
    
  4. you will see a list of machines, pick the machine you want to restore from by entering its number:

    Defined Clients:
         1: alberti.torproject.org-fd
    [...]
       117: yatei.torproject.org-fd
    Select the Client (1-117): 87
    
  5. you now get dropped in a file browser where you use the mark and unmark commands to mark and un-mark files for restore. the commands support wildcards like *. use mark * to mark all files in the current directory, see also the full list of commands:

    Automatically selected FileSet: Standard Set
    +---------+-------+----------+-----------------+---------------------+----------------------------------------------------------+
    | jobid   | level | jobfiles | jobbytes        | starttime           | volumename                                               |
    +---------+-------+----------+-----------------+---------------------+----------------------------------------------------------+
    | 106,348 | F     |  363,125 | 157,545,039,843 | 2019-07-16 09:42:43 | torproject-full-perdulce.torproject.org.2019-07-16_09:42 |
    | 107,033 | D     |    9,136 |     691,803,964 | 2019-07-25 06:30:15 | torproject-diff-perdulce.torproject.org.2019-07-25_06:30 |
    | 107,107 | I     |    4,244 |     214,271,791 | 2019-07-26 06:11:30 | torproject-inc-perdulce.torproject.org.2019-07-26_06:11  |
    | 107,181 | I     |    4,285 |     197,548,921 | 2019-07-27 05:30:51 | torproject-inc-perdulce.torproject.org.2019-07-27_05:30  |
    | 107,257 | I     |    4,273 |     197,739,452 | 2019-07-28 04:52:15 | torproject-inc-perdulce.torproject.org.2019-07-28_04:52  |
    | 107,334 | I     |    4,302 |     218,259,369 | 2019-07-29 04:58:23 | torproject-inc-perdulce.torproject.org.2019-07-29_04:58  |
    | 107,423 | I     |    4,400 |     287,819,534 | 2019-07-30 05:42:09 | torproject-inc-perdulce.torproject.org.2019-07-30_05:42  |
    | 107,504 | I     |    4,278 |     413,289,422 | 2019-07-31 06:11:49 | torproject-inc-perdulce.torproject.org.2019-07-31_06:11  |
    | 107,587 | I     |    4,401 |     700,613,429 | 2019-08-01 07:51:52 | torproject-inc-perdulce.torproject.org.2019-08-01_07:51  |
    | 107,653 | I     |      471 |      63,370,161 | 2019-08-02 06:01:35 | torproject-inc-perdulce.torproject.org.2019-08-02_06:01  |
    +---------+-------+----------+-----------------+---------------------+----------------------------------------------------------+
    You have selected the following JobIds: 106348,107033,107107,107181,107257,107334,107423,107504,107587,107653
    
    Building directory tree for JobId(s) 106348,107033,107107,107181,107257,107334,107423,107504,107587,107653 ...  mark etc
    ++++++++++++++++++++++++++++++++++++++++++++++
    335,060 files inserted into the tree.
    
    You are now entering file selection mode where you add (mark) and
    remove (unmark) files to be restored. No files are initially added, unless
    you used the "all" keyword on the command line.
    Enter "done" to leave this mode.
    
    cwd is: /
    $ mark etc
    1,921 files marked.
    

    Do not use the estimate command as it can take a long time to run and will freeze the shell.

  6. when done selecting files, call the done command

    $ done
    [...]
    
  7. this will drop you in a confirmation dialog showing what will happen. note the Where parameter which shows where the files will be restored, on the RestoreClient. Make sure that location has enough space for the restore to complete.

    Bootstrap records written to /var/lib/bacula/torproject-dir.restore.6.bsr
    
    The Job will require the following (*=>InChanger):
       Volume(s)                 Storage(s)                SD Device(s)
    ===========================================================================
    
        torproject-full-perdulce.torproject.org.2019-07-16_09:42 File-perdulce.torproject.org FileStorage-perdulce.torproject.org
        torproject-diff-perdulce.torproject.org.2019-07-25_06:30 File-perdulce.torproject.org FileStorage-perdulce.torproject.org
        torproject-inc-perdulce.torproject.org.2019-07-26_06:11 File-perdulce.torproject.org FileStorage-perdulce.torproject.org
        torproject-inc-perdulce.torproject.org.2019-07-27_05:30 File-perdulce.torproject.org FileStorage-perdulce.torproject.org
        torproject-inc-perdulce.torproject.org.2019-07-29_04:58 File-perdulce.torproject.org FileStorage-perdulce.torproject.org
        torproject-inc-perdulce.torproject.org.2019-07-31_06:11 File-perdulce.torproject.org FileStorage-perdulce.torproject.org
        torproject-inc-perdulce.torproject.org.2019-08-01_07:51 File-perdulce.torproject.org FileStorage-perdulce.torproject.org
        torproject-inc-perdulce.torproject.org.2019-08-02_06:01 File-perdulce.torproject.org FileStorage-perdulce.torproject.org
    
    Volumes marked with "*" are in the Autochanger.
    
    1,921 files selected to be restored.
    
    Using Catalog "MyCatalog"
    Run Restore job
    JobName:         RestoreFiles
    Bootstrap:       /var/lib/bacula/torproject-dir.restore.6.bsr
    Where:           /var/tmp/bacula-restores
    Replace:         Always
    FileSet:         Standard Set
    Backup Client:   perdulce.torproject.org-fd
    Restore Client:  perdulce.torproject.org-fd
    Storage:         File-perdulce.torproject.org
    When:            2019-08-02 16:43:08
    Catalog:         MyCatalog
    Priority:        10
    Plugin Options:  *None*
    
  8. this doesn't restore the backup immediately, but schedules a job that does so, like such:

    OK to run? (yes/mod/no): yes
    Job queued. JobId=107693
    

You can see the status of the jobs on the director with the status director, but also see specifically the status of that job with llist jobid=107693:

*llist JobId=107697
           jobid: 107,697
             job: RestoreFiles.2019-08-02_16.43.40_17
            name: RestoreFiles
     purgedfiles: 0
            type: R
           level: F
        clientid: 9
      clientname: dictyotum.torproject.org-fd
       jobstatus: R
       schedtime: 2019-08-02 16:43:08
       starttime: 2019-08-02 16:43:42
         endtime: 
     realendtime: 
        jobtdate: 1,564,764,222
    volsessionid: 0
  volsessiontime: 0
        jobfiles: 0
        jobbytes: 0
       readbytes: 0
       joberrors: 0
 jobmissingfiles: 0
          poolid: 0
        poolname: 
      priorjobid: 0
       filesetid: 0
         fileset: 
         hasbase: 0
        hascache: 0
         comment:

The JobStatus column is an internal database field that will show T ("terminated normally") when completed or R or C when still running or not started, and anything else if, well, anything else is happening. The full list of possible statuses is hidden deep in the developer documentation, obviously.

The messages command also provides for a good way of showing the latest status, although it will flood your terminal if it wasn't ran for a long time. You can hit "enter" to see if there are new messages.

*messages
[...]
02-Aug 16:43 torproject-sd JobId 107697: Ready to read from volume "torproject-inc-perdulce.torproject.org.2019-08-02_06:01" on File device "FileStorage-perdulce.torproject.org" (/srv/backups/bacula/perdulce.torproject.org).
02-Aug 16:43 torproject-sd JobId 107697: Forward spacing Volume "torproject-inc-perdulce.torproject.org.2019-08-02_06:01" to addr=328
02-Aug 16:43 torproject-sd JobId 107697: Elapsed time=00:00:03, Transfer rate=914.8 K Bytes/second
02-Aug 16:43 torproject-dir JobId 107697: Bacula torproject-dir 9.4.2 (04Feb19):
  Build OS:               x86_64-pc-linux-gnu debian 9.7
  JobId:                  107697
  Job:                    RestoreFiles.2019-08-02_16.43.40_17
  Restore Client:         bacula-director-01.torproject.org-fd
  Where:                  /var/tmp/bacula-restores
  Replace:                Always
  Start time:             02-Aug-2019 16:43:42
  End time:               02-Aug-2019 16:43:50
  Elapsed time:           8 secs
  Files Expected:         1,921
  Files Restored:         1,921
  Bytes Restored:         2,528,685 (2.528 MB)
  Rate:                   316.1 KB/s
  FD Errors:              0
  FD termination status:  OK
  SD termination status:  OK
  Termination:            Restore OK

Once the job is done, the files will be present in the chosen location (Where) on the given server (RestoreClient).

See the upstream manual more information about the restore command.

Restoring files as root

Note that the above procedure restores files as the bacula user, which means it cannot create files owned by another user. For basic use cases that doesn't matter: files can be changed owner by hand after restore.

But if you're restoring a full system or more complex configuration than just a single service, that will likely not work because you'd need to manually change files individually, and that doesn't scale.

So the solution is to run bacula-fd as root, temporarily. Follow this procedure:

  1. Disable Puppet

    systemctl stop puppet-run.timer
    
  2. Remove the configuration file that tells bacula-fd to run as the bacula user:

    rm /etc/systemd/system/bacula-fd.service.d/bacula-fd-groups.conf
    
  3. Reload the daemon:

    systemctl daemon-reload
    systemctl restart bacula-fd
    
  4. On the director, run the restore (follow the above procedure):

    bconsole
    [...]
    
  5. Files in /var/tmp/bacula-restores (or wherever you put them) should now have the right owner

  6. Reset Puppet:

    pat
    

The last step will restart the systemd timer, reset the configuration file, and restart all daemons (in theory).

TODO: Note that this procedure was never tested, but this, in theory, works. Please confirm and update this section after success or failure.

Restore a host that has been offline for a long time

If a host has been offline for a long time its storage configuration might have been expired by puppet. You will notice that when you are trying to restore some file to a different host you will get the following error after having selected the files:

Unable to find Storage resource for
MediaType "File-<hostname>.torproject.org", needed by the Jobs you 
          selected.

In this case you might have to manually recreate a few configuration files both on bacula-director-01 and on bungei.

On bungei:

cd /etc/bacula/storage-conf.d

Create a configuration file like the following (you can also copy and edit one of the files from the other hosts):

##
## THIS FILE IS UNDER PUPPET CONTROL. DON'T EDIT IT HERE.
## USE: git clone git+ssh://$USER@puppet.debian.org/srv/puppet.debian.org/git/dsa-puppet.git
##

Device {
  Name = "FileStorage-<hostname>.torproject.org"
  Media Type = "File-<hostname>.torproject.org"
  Archive Device = "/srv/backups/bacula/<hostname>.torproject.org"
  LabelMedia = yes;
  Random Access = Yes;
  AutomaticMount = yes;
  RemovableMedia = no;
  AlwaysOpen = no;
}

Disable puppet and restart bacula-sd:

puppet agent --disable 'adding conf for <hostname> manually'
service bacula-sd restart

On bacula-director-01:

cd /etc/conf.d

Create two configuration files:

<hostname>.torproject.org.conf
<hostname>.torproject.org_storage.conf

You can copy and edit the hostname from the conf file of some other host.

cd /etc/bacula/storages-list.d

Create a file <hostname>.torproject.org.storage, with the following line:

File-<hostname>.torproject.org

Disable puppet and restart bacula-director:

puppet agent --disable 'adding conf for <hostname> manually'
service bacula-director restart

Go ahead with the restore procedure, it should work now.

Get files without a director

If you want to get to files stored on the Bacula storage host without involving the director, they can be accessed directly as well. Remember that to Bacula everything is a tape, and /srv/backups/bacula is full of directories of tapes. You can see the contents of a tape using bls, that is, bls <file>, with a fully qualified filename, i.e. involving all the paths. bls $(readlink -f <filename>) is a handy way to get that.

root@bungei:/srv/backups/bacula/dictyotum.torproject.org# bls `readlink -f torproject-inc-dictyotum.torproject.org.2019-09-25_11:53` | head
bls: butil.c:292-0 Using device: "/srv/backups/bacula/dictyotum.torproject.org" for reading.
25-Sep 13:48 bls JobId 0: Ready to read from volume "torproject-inc-dictyotum.torproject.org.2019-09-25_11:53" on File device "FileStorage-dictyotum.torproject.org" (/srv/backups/bacula/dictyotum.torproject.org).
bls JobId 0: drwxr-xr-x   4 root     root                   1024 2019-09-07 17:01:03  /boot/
bls JobId 0: drwxr-xr-x  24 root     root                    800 2019-09-25 11:33:53  /run/
bls JobId 0: -rw-r--r--   1 root     root                  12288 2019-09-25 11:51:17  /etc/postfix/debian.db
bls JobId 0: -rw-r--r--   1 root     root                   4732 2019-09-25 11:51:17  /etc/postfix/debian
bls JobId 0: -r--r--r--   1 root     root                  28161 2019-09-25 00:55:50  /etc/ssl/torproject-auto/crls/ca.crl
...

You can then extract files from there bextract:

bextract /srv/backups/bacula/dictyotum.torproject.org/torproject-inc-dictyotum.torproject.org.2019-09-25_11:53 /var/tmp/restore

This will extract the entire tape to /var/tmp/restore. If you want only a few files, put their names into a file such as include and call bextract with -i:

bextract -i ~/include /srv/backups/bacula/dictyotum.torproject.org/torproject-inc-dictyotum.torproject.org.2019-09-25_11:53 /var/tmp/restore

Restore PostgreSQL databases

See the PostgreSQL documentation for restore instructions on PostgreSQL databases.

Restore MySQL databases

MySQL restoration should be fairly straightforward. Install MySQL:

apt install mysql-server

Load the database dump:

zstd -d /var/backups/mysql/dump.sql.zst | mysql

This restores the entire server. To restore a single database:

zstd -d /var/backups/mysql/dump.sql.zst | mysql --one-database dbname

Only two backups are kept on the server. To access previous backups, restore them from Bacula first.

See the MySQL backup system documentation for more information.

Restore LDAP databases

See howto/ldap for LDAP-specific procedures.

Listing latest backups from a host

To see when backups have ran for a host, use the list jobs command. For example:

list job=corsicum.torproject.org

Will list all the jobs ran on corsicum that are still valid (?). Note that it will also dump all the other jobs, for some reason, but at least the corsicum jobs will be shown last so they can be found more easily.

Listing job details including duration time

The list command (above) shows summary information, but notoriously does not show the duration of a job. For this, you need the llist command. You can run it like the list command:

llist job=corsicum.torproject.org

... but that produces a lot of output. Typically, you would list only the job you are interested in:

llist jobid=183393

But then you still have to compute all those duration times (end-start) in your head, that's annoying. To workaround that problem, you can talk directly with the PostgreSQL database. The magic query is:

SELECT level, jobstatus, starttime, endtime,
       (CASE WHEN endtime IS NULL THEN NOW()
       ELSE endtime END)-starttime AS duration,
       jobfiles, pg_size_pretty(jobbytes)
       FROM job
       WHERE name='colchicifolium.torproject.org'
       ORDER by starttime;

Here's an example:

root@bacula-director-01:~# sudo -u postgres psql bacula
could not change directory to "/root": Permission denied
psql (11.14 (Debian 11.14-0+deb10u1))
Type "help" for help.

bacula=# SELECT level, jobstatus, starttime, endtime, (CASE WHEN endtime IS NULL THEN NOW() ELSE endtime END)-starttime AS duration, jobfiles, pg_size_pretty(jobbytes) FROM job WHERE name='colchicifolium.torproject.org' ORDER by starttime;
level | jobstatus |      starttime      |       endtime       |       duration        | jobfiles |   jobbytes   
-------+-----------+---------------------+---------------------+-----------------------+----------+--------------
 I     | f         | 2015-12-03 09:57:02 | 2015-12-03 09:57:02 | 00:00:00              |        0 |            0
 D     | f         | 2017-12-09 00:35:08 | 2017-12-09 00:35:08 | 00:00:00              |        0 |            0
 D     | f         | 2019-03-05 18:15:28 | 2019-03-05 18:15:28 | 00:00:00              |        0 |            0
 F     | f         | 2019-07-22 14:06:13 | 2019-07-22 14:06:13 | 00:00:00              |        0 |            0
 I     | f         | 2019-09-07 20:02:52 | 2019-09-07 20:02:52 | 00:00:00              |        0 |            0
 I     | f         | 2020-12-11 02:06:57 | 2020-12-11 02:06:57 | 00:00:00              |        0 |            0
 F     | T         | 2021-10-30 04:18:48 | 2021-10-31 05:32:59 | 1 day 01:14:11        |  2973523 | 409597402632
 F     | T         | 2021-12-10 06:06:18 | 2021-12-12 01:41:37 | 1 day 19:35:19        |  3404504 | 456273938172
 D     | E         | 2022-01-12 15:03:53 | 2022-01-14 21:57:32 | 2 days 06:53:39       |  5029124 | 123658942337
 D     | T         | 2022-01-15 01:57:38 | 2022-01-17 17:24:20 | 2 days 15:26:42       |  5457677 | 130269432219
 F     | T         | 2022-01-19 22:33:54 | 2022-01-22 14:41:49 | 2 days 16:07:55       |  4336473 | 516207537019
 I     | T         | 2022-01-26 14:12:52 | 2022-01-26 16:25:40 | 02:12:48              |   185016 |   7712392837
 I     | T         | 2022-01-27 14:06:35 | 2022-01-27 16:47:50 | 02:41:15              |   188625 |   8433225061
 D     | T         | 2022-01-28 06:21:56 | 2022-01-28 18:13:24 | 11:51:28              |  1364571 |  28815354895
 I     | T         | 2022-01-29 06:41:31 | 2022-01-29 10:12:46 | 03:31:15              |   178896 |  33790932680
 I     | T         | 2022-01-30 04:46:21 | 2022-01-30 07:10:41 | 02:24:20              |   177074 |   7298789209
 I     | T         | 2022-01-31 04:19:19 | 2022-01-31 13:18:59 | 08:59:40              |   203085 |  37604120762
 I     | T         | 2022-02-01 04:11:16 | 2022-02-01 07:11:08 | 02:59:52              |   195922 |  41592974842
 I     | T         | 2022-02-02 04:30:15 | 2022-02-02 06:39:15 | 02:09:00              |   190243 |   8548513453
 I     | T         | 2022-02-03 02:55:37 | 2022-02-03 06:25:57 | 03:30:20              |   186250 |   6138223644
 I     | T         | 2022-02-04 01:06:54 | 2022-02-04 04:19:46 | 03:12:52              |   187868 |   8892468359
 I     | T         | 2022-02-05 01:46:11 | 2022-02-05 04:09:50 | 02:23:39              |   194623 |   8754299644
 I     | T         | 2022-02-06 01:45:57 | 2022-02-06 08:02:29 | 06:16:32              |   208416 |   9582975941
 D     | T         | 2022-02-06 21:07:00 | 2022-02-11 12:31:37 | 4 days 15:24:37       |  3428690 |  57424284749
 I     | T         | 2022-02-11 12:38:30 | 2022-02-11 18:52:52 | 06:14:22              |   590289 |  18987945922
 I     | T         | 2022-02-12 14:03:10 | 2022-02-12 16:36:49 | 02:33:39              |   190798 |   6760825592
 I     | T         | 2022-02-13 13:45:42 | 2022-02-13 15:34:05 | 01:48:23              |   189130 |   7132469485
 I     | T         | 2022-02-14 15:19:05 | 2022-02-14 18:58:24 | 03:39:19              |   199895 |   6797607219
 I     | T         | 2022-02-15 15:25:05 | 2022-02-15 19:40:27 | 04:15:22              |   199052 |   8115940960
 D     | T         | 2022-02-15 20:24:17 | 2022-02-19 06:54:49 | 3 days 10:30:32       |  4967994 |  77854030910
 I     | T         | 2022-02-19 07:02:32 | 2022-02-19 18:23:59 | 11:21:27              |   496812 |  24270098875
 I     | T         | 2022-02-20 07:45:46 | 2022-02-20 10:45:13 | 02:59:27              |   174086 |   7179666980
 I     | T         | 2022-02-21 06:57:49 | 2022-02-21 11:51:18 | 04:53:29              |   182035 |  15512560970
 I     | T         | 2022-02-22 05:10:39 | 2022-02-22 07:57:01 | 02:46:22              |   172397 |   7210544658
 I     | T         | 2022-02-23 06:36:44 | 2022-02-23 13:17:10 | 06:40:26              |   211809 |  29150059606
 I     | T         | 2022-02-24 05:39:43 | 2022-02-24 09:57:25 | 04:17:42              |   179419 |   7469834934
 I     | T         | 2022-02-25 05:30:58 | 2022-02-25 12:32:09 | 07:01:11              |   202945 |  30792174057
 D     | f         | 2022-02-25 12:33:48 | 2022-02-25 12:33:48 | 00:00:00              |        0 |            0
 D     | R         | 2022-02-27 18:37:53 |                     | 4 days 03:04:58.45685 |        0 |            0
(39 rows)

Here's another query showing the last 25 "Full" jobs regardless of the host:

SELECT name, jobstatus, starttime, endtime,
       (CASE WHEN endtime IS NULL THEN NOW()
       ELSE endtime END)-starttime AS duration,
       jobfiles, pg_size_pretty(jobbytes)
       FROM job
       WHERE level='F'
       ORDER by starttime DESC
       LIMIT 25;

Listing files from backups

To see which files are in a given host, you can use:

echo list files jobid=210810 | bconsole > list

Note that sometimes, for some obscure reason, the file list is not actually generated and the job details are listed instead:

*list files jobid=206287
Automatically selected Catalog: MyCatalog
Using Catalog "MyCatalog"
+---------+--------------------------------+---------------------+------+-------+----------+-----------------+-----------+
| jobid   | name                           | starttime           | type | level | jobfiles | jobbytes        | jobstatus |
+---------+--------------------------------+---------------------+------+-------+----------+-----------------+-----------+
| 206,287 | hetzner-nbg1-01.torproject.org | 2022-08-31 12:42:46 | B    | F     |   81,173 | 133,449,382,067 | T         |
+---------+--------------------------------+---------------------+------+-------+----------+-----------------+-----------+
*

It's unclear why this happens. It's possible that inspecting the PostgreSQL database directly would work. Meanwhile, try the latest full backup instead, which, in this case, did work:

root@bacula-director-01:~# echo list files jobid=206287 | bconsole | wc -l 
11
root@bacula-director-01:~# echo list files jobid=210810 | bconsole | wc -l 
81599
root@bacula-director-01:~#

This query will list the jobs having the given file:

SELECT jobid, job.name,type,level,starttime, path.path || filename.name AS path FROM path 
  JOIN file USING (pathid) 
  JOIN filename USING (filenameid) 
  JOIN job USING (jobid)
  WHERE path.path='/var/log/gitlab/gitlab-rails/'
    AND filename.name LIKE 'production_json.log%' 
  ORDER BY starttime DESC
  LIMIT 10;

This would list 10 files out of the backup job 251481:

SELECT jobid, job.name,type,level,starttime, path.path || filename.name AS path FROM path 
  JOIN file USING (pathid) 
  JOIN filename USING (filenameid) 
  JOIN job USING (jobid)
  WHERE jobid=251481
  ORDER BY starttime DESC
  LIMIT 10;

This will list the 10 oldest files backed up on host submit-01.torproject.org:

SELECT jobid, job.name,type,level,starttime, path.path || filename.name AS path FROM path 
  JOIN file USING (pathid) 
  JOIN filename USING (filenameid) 
  JOIN job USING (jobid)
  WHERE job.name='submit-01.torproject.org'
  ORDER BY starttime ASC
  LIMIT 10;

Excluding files from backups

Bacula has a list of files excluded from backups, mostly things like synthetic file systems (/dev, /proc, etc), cached files (e.g. /var/cache/apt), and so on.

Other files or directories can be excluded in two ways:

  1. drop a .nobackup file in a directory to exclude the entire directory (and subdirectories)

  2. add the file(s) to the /etc/bacula/local-exclude configuration file (lines that start with # are comments, one file per line)

The latter is managed by Puppet, use a file_line resource to add entries in there, for example see the profile::discourse class which does something like:

file_line { "discourse_exclude_logs":
  path => '/etc/bacula/local-exclude',
  line => "/srv/discourse/shared/standalone/logs",
}

The .nobackup file should also be managed by Puppet. Use a .nobackup file when you are deploying a host where you control the directory, and a local-exclude when you do not. In the above example, Discourse manages the /srv/discourse/shared/standalone directory so we cannot assume a .nobackup file will survive upgrades and reconfiguration by Discourse.

How include/exclude patterns work

The exclude configuration is made in the modules/bacula/templates/bacula-dir.conf.erb Puppet template, deployed in /etc/bacula/bacula-dir.conf on the director.

The files to be included in the backups are basically "any mounted filesystem that is not a bind mount and one of ext{2,3,4}, xfs or jfs". That logic is defined in the modules/bacula/files/bacula-backup-dirs Puppet file, deployed in /usr/local/sbin/bacula-backup-dirs on backup clients.

Retiring a client

Clients are managed by Puppet and their configuration will be automatically removed from the host is removed from Puppet. This is normally part of the host retirement procedure.

The procedure also takes care of removing the data from the backup storage server (in /srv/backups/bacula/, currently on bungei), but not PostgreSQL backups or the catalog on the director.

Incredibly, it seems like no one really knows how to remove a client from the catalog on the director, once they are gone. Removing the configuration is one thing, but the client is then still in the database. There are many, many, many, many, questions about this everywhere, and everyone gets it wrong or doesn't care. Recommendations range from "doing nothing" (takes a lot of disk space and slows down PostgreSQL) to "dbcheck will fix this" (it didn't), neither of which worked in our case.

Amazingly, the solution is simply to call this command in bconsole:

delete client=$FQDN-fd

For example:

delete archeotrichon.torproject.org-fd

This will remove all jobs related to the client, and then the client itself. This is now part of the host retirement procedure.

Pager playbook

Hint: see also the PostgreSQL pager playbook documentation for the backup procedures specific to that database.

Out of disk scenario

The storage server disk space can (and has) filled up, which will lead to backup jobs failing. A first sign of this is Prometheus warning about disk about to fill up.

Note that the disk can fill up quicker than alerting can pick up. In October 2023, 5TB was filled up in less than 24 hours (tpo/tpa/team#41361), leading to a critical notification.

Then jobs started failing:

Date: Wed, 18 Oct 2023 17:15:47 +0000
From: bacula-service@torproject.org
To: bacula-service@torproject.org
Subject: Bacula: Intervention needed for archive-01.torproject.org.2023-10-18_13.15.43_59

18-Oct 17:15 bungei.torproject.org-sd JobId 246219: Job archive-01.torproject.org.2023-10-18_13.15.43_59 is waiting. Cannot find any appendable volumes.
Please use the "label" command to create a new Volume for:
    Storage:      "FileStorage-archive-01.torproject.org" (/srv/backups/bacula/archive-01.torproject.org)
    Pool:         poolfull-torproject-archive-01.torproject.org
    Media type:   File-archive-01.torproject.org

Eventually, an email with the following first line goes out:

18-Oct 18:15 bungei.torproject.org-sd JobId 246219: Please mount append Volume "torproject-archive-01.torproject.org-full.2023-10-18_18:10" or label a new one for:

At this point, space need to be made on the backup server. Normally, there's extra space on the volume group available in LVM that can be allocated to deal with such situation. See the output of the vgs command and follow the resize procedures in the LVM docs in that case.

If there isn't any space available on the volume group, it may be acceptable to manually remove old, large files from the storage server, but that is generally not recommended. That said, old archive-01 full backups were purged from the storage server in November 2021, without ill effects (see tpo/tpa/team/-/issues/40477), with a command like:

find /srv/backups/bacula/archive-01.torproject.org-OLD -mtime +40 -delete

One disk space is available again, there will be pending jobs listed in bconsole's status director:

JobId  Type Level     Files     Bytes  Name              Status
======================================================================
246219  Back Full    723,866    5.763 T archive-01.torproject.org is running
246222  Back Incr          0         0  dangerzone-01.torproject.org is waiting for a mount request
246223  Back Incr          0         0  ns5.torproject.org is waiting for a mount request
246224  Back Incr          0         0  tb-build-05.torproject.org is waiting for a mount request
246225  Back Incr          0         0  crm-ext-01.torproject.org is waiting for a mount request
246226  Back Incr          0         0  media-01.torproject.org is waiting for a mount request
246227  Back Incr          0         0  weather-01.torproject.org is waiting for a mount request
246228  Back Incr          0         0  neriniflorum.torproject.org is waiting for a mount request
246229  Back Incr          0         0  tb-build-02.torproject.org is waiting for a mount request
246230  Back Incr          0         0  survey-01.torproject.org is waiting for a mount request

In the above, the archive-01 job was the one which took up all free space. The job was restarted and was then running, above, but all the other ones were waiting for a mount request. The solution there is to just do that mount, with their job ID, for example, for the dangerzone-01 job above:

bconsole> mount jobid=24622

This should resume all jobs and eventually fix the warnings from monitoring.

Note that when that available space becomes too low (say less than 10% of the volume size), plans should be made to order new hardware, so in the emergency subsides, a ticket should be created for followup.

Out of date backups

If a job is behaving strangely, you can inspect its job log to see what's going on. First, you'll need to listing latest backups from a host for that host:

list job=FQDN

Then you can list the job log with (bconsole can output the JOBID values with commas every 3 digits. you need to remove those in the command below):

list joblog jobid=JOBID

If this is a new server, it's possible the storage server doesn't know about it. In this case, the jobs will try to run but fail, and you will get warnings by email, see the unavailable storage scenario for details.

See below for more examples.

Slow jobs

Looking at the Bacula director status, it says this:

Console connected using TLS at 10-Jan-20 18:19
 JobId  Type Level     Files     Bytes  Name              Status
======================================================================
120225  Back Full    833,079    123.5 G colchicifolium.torproject.org is running
120230  Back Full  4,864,515    218.5 G colchicifolium.torproject.org is waiting on max Client jobs
120468  Back Diff     30,694    3.353 G gitlab-01.torproject.org is running
====

Which is strange because those JobId numbers are very low compared to (say) the GitLab backup job. To inspect the job log, you use the list command:

*list joblog jobid=120225
+----------------------------------------------------------------------------------------------------+
| logtext                                                                                              |
+----------------------------------------------------------------------------------------------------+
| bacula-director-01.torproject.org-dir JobId 120225: Start Backup JobId 120225, Job=colchicifolium.torproject.org.2020-01-07_17.00.36_03 |
| bacula-director-01.torproject.org-dir JobId 120225: Created new Volume="torproject-colchicifolium.torproject.org-full.2020-01-07_17:00", Pool="poolfull-torproject-colchicifolium.torproject.org", MediaType="File-colchicifolium.torproject.org" in catalog. |
[...]
| bacula-director-01.torproject.org-dir JobId 120225: Fatal error: Network error with FD during Backup: ERR=No data available |
| bungei.torproject.org-sd JobId 120225: Fatal error: append.c:170 Error reading data header from FD. n=-2 msglen=0 ERR=No data available |
| bungei.torproject.org-sd JobId 120225: Elapsed time=00:03:47, Transfer rate=7.902 M Bytes/second     |
| bungei.torproject.org-sd JobId 120225: Sending spooled attrs to the Director. Despooling 14,523,001 bytes ... |
| bungei.torproject.org-sd JobId 120225: Fatal error: fd_cmds.c:225 Command error with FD msg="", SD hanging up. ERR=Error getting Volume info: 1998 Volume "torproject-colchicifolium.torproject.org-full.2020-01-07_17:00" catalog status is Used, but should be Append, Purged or Recycle. |
| bacula-director-01.torproject.org-dir JobId 120225: Fatal error: No Job status returned from FD.     |
[...]
| bacula-director-01.torproject.org-dir JobId 120225: Rescheduled Job colchicifolium.torproject.org.2020-01-07_17.00.36_03 at 07-Jan-2020 17:09 to re-run in 14400 seconds (07-Jan-2020 21:09). |
| bacula-director-01.torproject.org-dir JobId 120225: Error: openssl.c:68 TLS shutdown failure.: ERR=error:14094123:SSL routines:ssl3_read_bytes:application data after close notify |
| bacula-director-01.torproject.org-dir JobId 120225: Job colchicifolium.torproject.org.2020-01-07_17.00.36_03 waiting 14400 seconds for scheduled start time. |
| bacula-director-01.torproject.org-dir JobId 120225: Restart Incomplete Backup JobId 120225, Job=colchicifolium.torproject.org.2020-01-07_17.00.36_03 |
| bacula-director-01.torproject.org-dir JobId 120225: Found 78113 files from prior incomplete Job.     |
| bacula-director-01.torproject.org-dir JobId 120225: Created new Volume="torproject-colchicifolium.torproject.org-full.2020-01-10_12:11", Pool="poolfull-torproject-colchicifolium.torproject.org", MediaType="File-colchicifolium.torproject.org" in catalog. |
| bacula-director-01.torproject.org-dir JobId 120225: Using Device "FileStorage-colchicifolium.torproject.org" to write. |
| bacula-director-01.torproject.org-dir JobId 120225: Sending Accurate information to the FD.          |
| bungei.torproject.org-sd JobId 120225: Labeled new Volume "torproject-colchicifolium.torproject.org-full.2020-01-10_12:11" on File device "FileStorage-colchicifolium.torproject.org" (/srv/backups/bacula/colchicifolium.torproject.org). |
| bungei.torproject.org-sd JobId 120225: Wrote label to prelabeled Volume "torproject-colchicifolium.torproject.org-full.2020-01-10_12:11" on File device "FileStorage-colchicifolium.torproject.org" (/srv/backups/bacula/colchicifolium.torproject.org) |
| bacula-director-01.torproject.org-dir JobId 120225: Max Volume jobs=1 exceeded. Marking Volume "torproject-colchicifolium.torproject.org-full.2020-01-10_12:11" as Used. |
| colchicifolium.torproject.org-fd JobId 120225:      /run is a different filesystem. Will not descend from / into it. |
| colchicifolium.torproject.org-fd JobId 120225:      /home is a different filesystem. Will not descend from / into it. |
+----------------------------------------------------------------------------------------------------+
+---------+-------------------------------+---------------------+------+-------+----------+---------------+-----------+
| jobid   | name                          | starttime           | type | level | jobfiles | jobbytes      | jobstatus |
+---------+-------------------------------+---------------------+------+-------+----------+---------------+-----------+
| 120,225 | colchicifolium.torproject.org | 2020-01-10 12:11:51 | B    | F     |   77,851 | 1,759,625,288 | R         |
+---------+-------------------------------+---------------------+------+-------+----------+---------------+-----------+

So that job failed three days ago, but now it's actually running. In this case, it might be safe to just ignore the warnings from monitoring and hope that the rescheduled backup will eventually go through. The duplicate job is also fine: worst case there is it will just run after the first one does, resulting in a bit more I/O than we'd like.

"waiting to reserve a device"

This can happen in two cases: if a job is hung and blocking the storage daemon, or if the storage daemon is not aware of the host to backup.

If the job is repeatedly outputting:

waiting to reserve a device

It's the first, "hung job" scenario.

If you have the error:

Storage daemon didn't accept Device "FileStorage-rdsys-test-01.torproject.org" command.

It's the second, "unavailable storage" scenario.

Hung job scenario

If a job is continuously reporting an error like:

07-Dec 16:38 bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device.

It is because the backup volume is already used by a job. Normally our scheduler should avoid overlapping jobs like this, but it can happen that a job is left over when the director is rebooted while jobs are still running.

In this case, we looked at the storage status for more information:

root@bacula-director-01:~# bconsole
Connecting to Director bacula-director-01.torproject.org:9101
1000 OK: 103 bacula-director-01.torproject.org-dir Version: 9.4.2 (04 February 2019)
Enter a period to cancel a command.
*status
Status available for:
     1: Director
     2: Storage
     3: Client
     4: Scheduled
     5: Network
     6: All
Select daemon type for status (1-6): 2
Automatically selected Storage: File-alberti.torproject.org
Connecting to Storage daemon File-alberti.torproject.org at bungei.torproject.org:9103

bungei.torproject.org-sd Version: 9.4.2 (04 February 2019) x86_64-pc-linux-gnu debian 10.5
Daemon started 21-Nov-20 17:58. Jobs: run=1280, running=2.
 Heap: heap=331,776 smbytes=3,226,693 max_bytes=943,958,428 bufs=1,008 max_bufs=5,349,436
 Sizes: boffset_t=8 size_t=8 int32_t=4 int64_t=8 mode=0,0 newbsr=0
 Res: ndevices=79 nautochgr=0

Running Jobs:
Writing: Differential Backup job colchicifolium.torproject.org JobId=146826 Volume="torproject-colchicifolium.torproject.org-diff.2020-12-07_04:52"
    pool="pooldiff-torproject-colchicifolium.torproject.org" device="FileStorage-colchicifolium.torproject.org" (/srv/backups/bacula/colchicifolium.torproject.org)
    spooling=0 despooling=0 despool_wait=0
    Files=585,044 Bytes=69,749,764,302 AveBytes/sec=1,691,641 LastBytes/sec=2,204,539
    FDReadSeqNo=4,517,231 in_msg=3356877 out_msg=6 fd=10
Writing: Differential Backup job corsicum.torproject.org JobId=146831 Volume="torproject-corsicum.torproject.org-diff.2020-12-07_15:18"
    pool="pooldiff-torproject-corsicum.torproject.org" device="FileStorage-corsicum.torproject.org" (/srv/backups/bacula/corsicum.torproject.org)
    spooling=0 despooling=0 despool_wait=0
    Files=2,275,005 Bytes=99,866,623,456 AveBytes/sec=25,966,360 LastBytes/sec=30,624,588
    FDReadSeqNo=15,048,645 in_msg=10505635 out_msg=6 fd=13
Writing: Differential Backup job colchicifolium.torproject.org JobId=146833 Volume="torproject-corsicum.torproject.org-diff.2020-12-07_15:18"
    pool="pooldiff-torproject-colchicifolium.torproject.org" device="FileStorage-colchicifolium.torproject.org" (/srv/backups/bacula/colchicifolium.torproject.org)
    spooling=0 despooling=0 despool_wait=0
    Files=0 Bytes=0 AveBytes/sec=0 LastBytes/sec=0
    FDSocket closed
====

Jobs waiting to reserve a drive:
   3611 JobId=146833 Volume max jobs=1 exceeded on File device "FileStorage-colchicifolium.torproject.org" (/srv/backups/bacula/colchicifolium.torproject.org).
====

[...]

The last line is the error we're getting (in the messages output of the console, but also, more annoyingly, by email). The Running jobs list is more interesting: it's telling us there are three jobs running for the server, two of which are for the same host (JobId=146826 and JobId=146833). We can look at those jobs' logs in more detail to figure out what is going on:

*list joblog jobid=146826
+----------------------------------------------------------------------------------------------------+
| logtext                                                                                              |
+----------------------------------------------------------------------------------------------------+
| bacula-director-01.torproject.org-dir JobId 146826: Start Backup JobId 146826, Job=colchicifolium.torproject.org.2020-12-07_04.45.53_42 |
| bacula-director-01.torproject.org-dir JobId 146826: There are no more Jobs associated with Volume "torproject-colchicifolium.torproject.org-diff.2020-10-13_09:54". Marking it purged. |
| bacula-director-01.torproject.org-dir JobId 146826: New Pool is: poolgraveyard-torproject-colchicifolium.torproject.org |
| bacula-director-01.torproject.org-dir JobId 146826: All records pruned from Volume "torproject-colchicifolium.torproject.org-diff.2020-10-13_09:54"; marking it "Purged" |
| bacula-director-01.torproject.org-dir JobId 146826: Created new Volume="torproject-colchicifolium.torproject.org-diff.2020-12-07_04:52", Pool="pooldiff-torproject-colchicifolium.torproject.org", MediaType="File-colchicifolium.torproject.org" in catalog. |
| bacula-director-01.torproject.org-dir JobId 146826: Using Device "FileStorage-colchicifolium.torproject.org" to write. |
| bacula-director-01.torproject.org-dir JobId 146826: Sending Accurate information to the FD.          |
| bungei.torproject.org-sd JobId 146826: Labeled new Volume "torproject-colchicifolium.torproject.org-diff.2020-12-07_04:52" on File device "FileStorage-colchicifolium.torproject.org" (/srv/backups/bacula/colchicifolium.torproject.org). |
| bungei.torproject.org-sd JobId 146826: Wrote label to prelabeled Volume "torproject-colchicifolium.torproject.org-diff.2020-12-07_04:52" on File device "FileStorage-colchicifolium.torproject.org" (/srv/backups/bacula/colchicifolium.torproject.org) |
| bacula-director-01.torproject.org-dir JobId 146826: Max Volume jobs=1 exceeded. Marking Volume "torproject-colchicifolium.torproject.org-diff.2020-12-07_04:52" as Used. |
| colchicifolium.torproject.org-fd JobId 146826:      /home is a different filesystem. Will not descend from / into it. |
| colchicifolium.torproject.org-fd JobId 146826:      /run is a different filesystem. Will not descend from / into it. |
+----------------------------------------------------------------------------------------------------+
+---------+-------------------------------+---------------------+------+-------+----------+----------+-----------+
| jobid   | name                          | starttime           | type | level | jobfiles | jobbytes | jobstatus |
+---------+-------------------------------+---------------------+------+-------+----------+----------+-----------+
| 146,826 | colchicifolium.torproject.org | 2020-12-07 04:52:15 | B    | D     |        0 |        0 | f         |
+---------+-------------------------------+---------------------+------+-------+----------+----------+-----------+

This job is strange, because it is considered to be running in the storage server, but marked as failed (jobstatus=f) on the director. It doesn't have the normal trailing information logs get when a job completes, so it was possibly interrupted. And indeed, there was a reboot of the director on that day:

reboot   system boot  4.19.0-13-amd64  Mon Dec  7 15:14   still running

As far as the director is concerned, the job failed and is completed:

*llist  jobid=146826
           jobid: 146,826
             job: colchicifolium.torproject.org.2020-12-07_04.45.53_42
            name: colchicifolium.torproject.org
     purgedfiles: 0
            type: B
           level: D
        clientid: 55
      clientname: colchicifolium.torproject.org-fd
       jobstatus: f
       schedtime: 2020-12-07 04:45:53
       starttime: 2020-12-07 04:52:15
         endtime: 2020-12-07 04:52:15
     realendtime: 
        jobtdate: 1,607,316,735
    volsessionid: 0
  volsessiontime: 0
        jobfiles: 0
        jobbytes: 0
       readbytes: 0
       joberrors: 0
 jobmissingfiles: 0
          poolid: 221
        poolname: pooldiff-torproject-colchicifolium.torproject.org
      priorjobid: 0
       filesetid: 1
         fileset: Standard Set
         hasbase: 0
        hascache: 0
         comment:

That leftover job is what makes the next job hang. We can see the errors in that other job's logs:

*list joblog jobid=146833
+----------------------------------------------------------------------------------------------------+
| logtext                                                                                              |
+----------------------------------------------------------------------------------------------------+
| bacula-director-01.torproject.org-dir JobId 146833: Start Backup JobId 146833, Job=colchicifolium.torproject.org.2020-12-07_15.18.44_05 |
| bacula-director-01.torproject.org-dir JobId 146833: Created new Volume="torproject-colchicifolium.torproject.org-diff.2020-12-07_15:18", Pool="pooldiff-torproject-colchicifolium.torproject.org", MediaType="File-colchicifolium.torproject.org" in catalog. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
+----------------------------------------------------------------------------------------------------+
+---------+-------------------------------+---------------------+------+-------+----------+----------+-----------+
| jobid   | name                          | starttime           | type | level | jobfiles | jobbytes | jobstatus |
+---------+-------------------------------+---------------------+------+-------+----------+----------+-----------+
| 146,833 | colchicifolium.torproject.org | 2020-12-07 15:18:46 | B    | D     |        0 |        0 | R         |
+---------+-------------------------------+---------------------+------+-------+----------+----------+-----------+

Curiously, the fix here is to cancel the job generating the warnings, in bconsole:

cancel 146833

It's unclear why this works: normally, the other blocking job should be stopped and cleaned up. But in this case, canceling the blocked job resolved the problem and the warning went away. It is assumed the problem will not return on the next job run. See issue 40110 for one example of this problem.

Unavailable storage scenario

If you see an error like:

 Storage daemon didn't accept Device "FileStorage-rdsys-test-01.torproject.org" command.

It's because the storage server (currently bungei) doesn't know about the host to backup. Restart the storage daemon on the storage server to fix this:

service bacula-sd restart

Normally, Puppet is supposed to take care of those restarts, but it can happen the restarts don't work (presumably because the storage server doesn't do a clean restart when there's a backup already running.

Job disappeared

Another example is this:

*list job=metricsdb-01.torproject.org
Using Catalog "MyCatalog"
+---------+-----------------------------+---------------------+------+-------+-----------+----------------+-----------+
| jobid   | name                        | starttime           | type | level | jobfiles  | jobbytes       | jobstatus |
+---------+-----------------------------+---------------------+------+-------+-----------+----------------+-----------+
| 277,014 | metricsdb-01.torproject.org | 2024-09-08 09:00:26 | B    | F     |   240,183 | 66,850,988,860 | T         |
[...]
| 286,148 | metricsdb-01.torproject.org | 2024-12-11 19:15:46 | B    | I     |         0 |              0 | R         |
+---------+-----------------------------+---------------------+------+-------+-----------+----------------+-----------+

In this case, the job has been running since 2024-12-11 but we're a week after that, so it's probably just disappeared.

The first step to fix this is to cancel this job:

cancel jobid=JOBID

This, however, is likely to tell you the disappointing:

*cancel jobid=286148
Warning Job JobId=286148 is not running.

In that case, try to just run a new backup.

This should get rid of the alert, but not of the underlying problem, as the scheduler will still be confused by the stale job. For that you need to do some plumbing in the PostgreSQL database:

root@bacula-director-01:~# sudo -u postgres psql bacula
could not change directory to "/root": Permission denied
psql (15.10 (Debian 15.10-0+deb12u1))
Type "help" for help.
bacula=# BEGIN;
BEGIN
bacula=# update job set jobstatus='A' where name='metricsdb-01.torproject.org' and jobid=286148;
UPDATE 1
bacula=# COMMIT;
COMMIT
bacula=# 

Then, in bconsole, you should see the backup job running within a couple minutes at most:

Running Jobs:                                                                                                                                                   
Console connected using TLS at 21-Dec-24 15:52                                                                                                                  
 JobId  Type Level     Files     Bytes  Name              Status                                                                                                
======================================================================                                                                                          
287086  Back Diff          0         0  metricsdb-01.torproject.org is running                                                                                  
====

Bacula GDB traceback / Connection refused / Cannot assign requested address: Retrying

If you get an email from the directory stating that it can't connect to the file server on a machine:

09-Mar 04:45 bacula-director-01.torproject.org-dir JobId 154835: Fatal error: bsockcore.c:209 Unable to connect to Client: scw-arm-par-01.torproject.org-fd on scw-arm-par-01.torproject.org:9102. ERR=Connection refused

You can even receive an error like this:

root@forrestii.torproject.org (1 mins. ago) (rapports root tor) Subject: Bacula GDB traceback of bacula-fd on forrestii To: root@forrestii.torproject.org Date: Thu, 26 Mar 2020 00:31:44 +0000

/usr/sbin/btraceback: 60: /usr/sbin/btraceback: gdb: not found

In any case, go on the affected server (in the first case, scw-arm-par-01.torproject.org) and look at the bacula-fd.service:

service bacula-fd status

If you see an error like:

Warning: Cannot bind port 9102: ERR=Cannot assign requested address: Retrying ...

It's Bacula that's being a bit silly and failing to bind on the external interface. It might be an incorrect /etc/hosts. This particularly happens "in the cloud", where IP addresses are in the RFC1918 space and change unpredictably.

In the above case, it was simply a matter of adding the IPv4 and IPv6 addresses to /etc/hosts, and restarting bacula-fd:

vi /etc/hosts
service bacula-fd restart

The GDB errors were documented in issue 33732.

Disaster recovery

Restoring the directory server

If the storage daemon disappears catastrophically, there's nothing we can do: the data is lost. But if the director disappears, we can still restore from backups. Those instructions should cover the case where we need to rebuild the director from backups. The director is, essentially, a PostgreSQL database. Therefore, the restore procedure is to restore that database, along with some configuration.

This procedure can also be used to rotate a replace a still running director.

  1. if the old director is still running, star a fresh backup of the old database cluster from the storage server:

    sudo -tt bungei sudo -u torbackup postgres-make-base-backups dictyotum.torproject.org:5433 &
    
  2. disable puppet on the old director:

    ssh dictyotum.torproject.org puppet agent --disable 'disabling scheduler -- anarcat 2019-10-10'
    
  3. disable scheduler, by commenting out the cron job, and wait for jobs to complete, then shutdown the old director:

    sed -i '/dsa-bacula-scheduler/s/^/#/' /etc/cron.d/puppet-crontab
    watch -c "echo 'status director' | bconsole "
    service bacula-director stop
    

    TODO: this could be improved: <weasel> it's idle when there are no non-idle 'postgres: bacula bacula' processes and it doesn't have any open tcp connections?

  4. create a howto/new-machine run howto/Puppet with the roles::backup::director class applied to the node, say in hiera/nodes/bacula-director-01.yaml:

    classes:
    - roles::backup::director
    bacula::client::director_server: 'bacula-director-01.torproject.org'
    

    This should restore a basic Bacula configuration with the director acting, weirdly, as its own director.

  5. Run Puppet by hand on the new director and the storage server a few times, so their manifest converge:

    ssh bungei.torproject.org puppet agent -t
    ssh bacula-director-01.torproject.org puppet agent -t
    ssh bungei.torproject.org puppet agent -t
    ssh bacula-director-01.torproject.org puppet agent -t
    ssh bungei.torproject.org puppet agent -t
    ssh bacula-director-01.torproject.org puppet agent -t
    ssh bungei.torproject.org puppet agent -t
    ssh bacula-director-01.torproject.org puppet agent -t
    

    The Puppet manifests will fail because PostgreSQL is not installed. And even if it would be, it will fail because it doesn't have the right passwords. For now, PostgreSQL is configured by hand.

    TODO: Do consider deploying it with Puppet, as discussed in howto/postgresql.

  6. Install the right version of PostgreSQL.

    It might be the case that backups of the director are from an earlier version of PostgreSQL than the version available in the new machine. In that case, an older sources.list needs to be added:

    cat > /etc/apt/sources.list.d/stretch.list <<EOF
    deb https://deb.debian.org/debian/  stretch  main
    deb http://security.debian.org/ stretch/updates  main
    EOF
    apt update
    

    Actually install the server:

    apt install -y postgresql-9.6
    
  7. Once the base backup from step one is completed (or if there is no old director left), restore the cluster on the new host, see the PostgreSQL Backup recovery instructions

  8. You will also need to restore the file /etc/dsa/bacula-reader-database from backups (see "Getting files without a director", below), as that file is not (currently) managed through howto/puppet (TODO). Alternatively, that file can be recreated by hand, using a syntax like this:

    user=bacula-dictyotum-reader password=X dbname=bacula host=localhost
    

    The matching user will need to have its password modified to match X, obviously:

    sudo -u postgres psql -c '\password bacula-dictyotum-reader'
    
  9. reset the password of the Bacula director, as it changed in puppet:

    grep dbpassword /etc/bacula/bacula-dir.conf | cut -f2 -d\"
    sudo -u postgres psql -c '\password bacula'
    

    same for the tor-backup user:

    ssh bungei.torproject.org grep director /home/torbackup/.pgpass
    ssh bacula-director-01 -tt sudo -u postgres psql -c '\password bacula'
    
  10. copy over the pg_hba.conf and postgresql.conf (now conf.d/tor.conf) from the previous director cluster configuration (e.g. /var/lib/postgresql/9.6/main) to the new one (TODO: put in howto/puppet). Make sure that:

    • the cluster name (e.g. main or bacula) is correct in the archive_command1
    • the ssl_cert_file and ssl_key_file point to valid SSL certs
  11. Once you have the PostgreSQL database cluster restored, start the director:

     systemctl start bacula-director
    
  12. Then everything should be fairies and magic and happiness all over again. Check that everything works with:

     bconsole
    

    Run a few of the "Basic commands" above, to make sure we have everything. For example, list jobs should show the latest jobs ran on the director. It's normal that status director does not show those, however.

  13. Enable puppet on the director again.

     puppet agent -t
    

    This involves (optionally) keeping a lock on the scheduler so it doesn't immediately start at once. If you're confident (not tested!), this step might be skipped:

     flock -w 0 -e /usr/local/sbin/dsa-bacula-scheduler sleep infinity
    
  14. to switch a single node, configure its director in tor-puppet/hiera/nodes/$FQDN.yaml where $FQDN is the fully qualified domain name of the machine (e.g. tor-puppet/hiera/nodes/perdulce.torproject.org.yaml):

     bacula::client::director_server: 'bacula-director-01.torproject.org'
    

    Then run puppet on that node, the storage, and the director server:

     ssh perdulce.torproject.org puppet agent -t
     ssh bungei.torproject.org puppet agent -t
     ssh bacula-director-01.torproject.org puppet agent -t
    

    Then test a backup job for that host, in bconsole, call run and pick that server which should now show up.

  15. switch all nodes to the new director, in tor-puppet/hiera/common.yaml:

     bacula::client::director_server: 'bacula-director-01.torproject.org'
    
  16. run howto/puppet everywhere (or wait for it to run):

     cumin -b 5 -p 0 -o txt '*' 'puppet agent -t'
    

    Then make sure the storage and director servers are also up to date:

     ssh bungei.torproject.org puppet agent -t
     ssh bacula-director-01.torproject.org puppet agent -t
    
  17. if you held a lock on the scheduler, it can be removed:

    killall sleep
    
  18. you will also need to restore the password file for the Nagios check in /etc/nagios/bacula-database

  19. switch the director in /etc/dsa/bacula-reader-database or /etc/postgresql-common/pg_service.conf to point to the new host

The new scheduler and director should now have completely taken over the new one, and backups should resume. The old server can now be decommissioned, if it's still around, when you feel comfortable the new setup is working.

TODO: some psql users still refer to host-specific usernames like bacula-dictyotum-reader, maybe they should refer to role-specific names instead?

Troubleshooting

If you get this error:

psycopg2.OperationalError: definition of service "bacula" not found

It's probably the scheduler failing to connect to the database server, because the /etc/dsa/bacula-reader-database refers to a non-existent "service", as defined in /etc/postgresql-common/pg_service.conf. Either add something like:

[bacula]
dbname=bacula
port=5433

to that file, or specify the dbname and port manually in the configuration file.

If the scheduler is sending you an email every three minutes with this error:

FileNotFoundError: [Errno 2] No such file or directory: '/etc/dsa/bacula-reader-database'

It's because you forgot to create that file, in step 8. Similar errors may occur if you forgot to change that password.

If the director takes a long time to start and ultimately fails with:

oct 10 18:19:41 bacula-director-01 bacula-dir[31276]: bacula-dir JobId 0: Fatal error: Could not open Catalog "MyCatalog", database "bacula".
oct 10 18:19:41 bacula-director-01 bacula-dir[31276]: bacula-dir JobId 0: Fatal error: postgresql.c:332 Unable to connect to PostgreSQL server. Database=bacula User=bac
oct 10 18:19:41 bacula-director-01 bacula-dir[31276]: Possible causes: SQL server not running; password incorrect; max_connections exceeded.

It's because you forgot to reset the director password, in step 9.

Recovering deleted files

This is not specific to the backup server, but could be seen as a (no)backup/restore situation, and besides, not sure where else this would fit.

If a file was deleted by mistake and it is gone from the backup server, not all is lost. This is the story of how an entire PostgreSQL cluster was deleted in production, then, 7 days later, from the backup servers. Files were completely gone from the filesystem, both on the production server and on the backup server, see issue 41388.

In the following, we'll assume you're working on files deleted multiple days in the past. For files deleted more recently, you might have better luck with ext4magic, which can tap into the journal to find recently deleted files more easily. Example commands you might try:

umount /srv/backup/pg
extundelete --restore-all /dev/mapper/vg_bulk-backups--pg
ext4magic /dev/vg_bulk/backups-pg -f weather-01-13
ext4magic /dev/vg_bulk/backups-pg -RQ -f weather-01-13
ext4magic /dev/vg_bulk/backups-pg -Lx -f weather-01-13
ext4magic /dev/mapper/vg_bulk-backups--pg -b $(date -d "2023-11-01 12:00:00" +%s) -a $(date -d "2023-10-30 12:00:00" +%s) -l

In this case, we're actually going to scrub the entire "free space" area of the disk to hunt for file signatures.

  1. unmount the affected filesystem:

    umount /srv/backup/pg
    
  2. start photorec, part of the testdisk package:

    photorec /dev/mapper/vg_bulk-backups--pg
    
  3. this will get you into an interactive interface, there you should chose to inspect free space and leave most options as is, although you should probably only select tar and gz files to restore. pick a directory with a lot of free space to restore to.

  4. start the procedure. photorec will inspect the entire disk looking for signatures. in this case we're assuming we will be able to restore the "BASE" backups.

  5. once photorec starts reporting it found .gz files, you can already start inspecting those, for example with this shell rune:

    for file in recup_dir.*/*gz; do
        tar -O -x -z -f $file backup_label 2>/dev/null \
            | grep weather  && ls -alh $file
    done
    

    here we're iterating over all restored files in the current directory (photorec puts files in recup_dir.N directories, where N is some arbitrary-looking integer), trying to decompress the file, ignoring errors because restored files are typically truncated or padded with garbage, then extracting only the backup_label file to stdout, and looking for the hostname (in this case weather) and, if it match, list the file size (phew!)

  6. once the recovery is complete, you will end up with a ton of recovered files. using the above pipeline, you might be lucky and find a base backup that makes sense. copy those files over to the actual server (or a new one), e.g. (assuming you setup SSH keys right):

    rsync --progress /srv/backups/bacula/recup_dir.20/f3005349888.gz root@weather-01.torproject.org:/srv
    
  7. then, on the target server, restore that file to a directory with enough disk space:

    mkdir f1959051264
    cd f1959051264/
    tar zfx ../f1959051264.gz
    
  8. inspect the backup to verify its integrity (postgresql backups have a manifest that can be checked):

    /usr/lib/postgresql/13/bin/pg_verifybackup -n .
    

    Here's an example of a working backup, even if gzip and tar complain about the archive itself:

    root@weather-01:/srv# mkdir f1959051264
    root@weather-01:/srv# cd f1959051264/
    root@weather-01:/srv/f1959051264# tar zfx ../f1959051264.gz
    
    gzip: stdin: decompression OK, trailing garbage ignored
    tar: Child returned status 2
    tar: Error is not recoverable: exiting now
    root@weather-01:/srv/f1959051264# cd ^C
    root@weather-01:/srv/f1959051264# du -sch .
    39M .
    39M total
    root@weather-01:/srv/f1959051264# ls -alh ../f1959051264.gz
    -rw-r--r-- 1 root root 3.5G Nov  8 17:14 ../f1959051264.gz
    root@weather-01:/srv/f1959051264# cat backup_label
    START WAL LOCATION: E/46000028 (file 000000010000000E00000046)
    CHECKPOINT LOCATION: E/46000060
    BACKUP METHOD: streamed
    BACKUP FROM: master
    START TIME: 2023-10-08 00:51:04 UTC
    LABEL: bungei.torproject.org-20231008-005104-weather-01.torproject.org-main-13-backup
    START TIMELINE: 1
    
    and it's quite promising, that thing, actually:
    
    root@weather-01:/srv/f1959051264# /usr/lib/postgresql/13/bin/pg_verifybackup -n .
    backup successfully verified
    
  9. disable Puppet. you're going to mess with stopping and starting services and you don't want it in the way:

    puppet agent --disable 'keeping control of postgresql startup -- anarcat 2023-11-08 tpo/tpa/team#41388'
    

TODO split here?

  1. install the right PostgreSQL server (we're entering the actual PostgreSQL restore procedure here, getting out of scope):

    apt install postgresql-13
    
  2. move the cluster out of the way:

    mv /var/lib/postgresql/13/main{,.orig}
    
  3. restore files:

    rsync -a ./ /var/lib/postgresql/13/main/
    chown postgres:postgres /var/lib/postgresql/13/main/
    chmod 750 /var/lib/postgresql/13/main/
    
  4. create a recovery.conf file and tweak the postgres configuration:

    echo "restore_command = 'true'" > /etc/postgresql/13/main/conf.d/recovery.conf
    touch /var/lib/postgresql/13/main/recovery.signal
    rm /var/lib/postgresql/13/main/backup_label
    
    echo max_wal_senders = 0 > /etc/postgresql/13/main/conf.d/wal.conf
    echo hot_standby = no >> /etc/postgresql/13/main/conf.d/wal.conf
    
  5. reset the WAL (Write Ahead Log) since we don't have those (this implies possible data loss, but we're already missing a lot of WALs since we're restoring to a past base backup anyway):

    sudo -u postgres /usr/lib/postgresql/13/bin/pg_resetwal -f /var/lib/postgresql/13/main/
    
  6. cross your fingers, pray to the flying spaghetti monster, and start the server:

    systemctl start postgresql@13-main.service & journalctl -u postgresql@13-main.service -f
    
  7. if you're extremely lucky, it will start and then you should be able to dump the database and restore in the new cluster:

    sudo -u postgres pg_dumpall  -p 5433 | pv > /srv/dump/dump.sql
    sudo -u postgres psql < /srv/dump/dump.sql
    

    DO NOT USE THE DATABASE AS IS! Only dump the content and restore in a new cluster.

  8. if all goes well, clear out the old cluster, and restart Puppet

Reference

Installation

Upgrades

Bacula is packaged in Debian and automatically upgraded. Major Debian upgrades involve a PostgreSQL upgrade, however.

SLA

Design and architecture

This section documents how backups are setup at Tor. It should be useful if you wish to recreate or understand the architecture.

Backups are configured automatically by Puppet on all nodes, and use Bacula with TLS encryption over the wire.

Backups are pulled from machines to the backup server, which means a compromise on a machine shouldn't allow an attacker to delete backups from the backup server.

Bacula splits the different responsibilities of the backup system among multiple components, namely:

  • Director (bacula::director in Puppet, currently bacula-director-01, with a PostgreSQL server configured in Puppet), schedules jobs and tells the storage daemon to pull files from the file daemons
  • Storage daemon (bacula::storage in Puppet, currently bungei), pulls files from the file daemons
  • File daemon (bacula::client, on all nodes), serves files to the storage daemon, also used to restore files to the nodes

In our configuration, the Admin workstation, Database serverand Backup server are all on the same machine, the bacula::director.

Servers are interconnected over TCP connections authenticated with TLS client certificates. Each FD, on all servers, regularly pushes backups to the central SD. This works because the FD has a certificate (/etc/ssl/torproject-auto/clientcerts/thishost.crt) signed by the auto-ca TLS certificate authority (in/etc/ssl/torproject-auto/servercerts/ca.crt).

Volumes are stored in the storage daemon, in /srv/backups/bacula/. Each client stores its volumes in a separate directory, which makes it easier to purge offline clients and evaluate disk usage.

We do not have a bootstrap file as advised by the upstream documentation because we do not use tapes or tape libraries, which make it harder to find volumes. Instead, our catalog is backed up in /srv/backups/bacula/Catalog and each backup contains a single file, the compressed database dump, which is sufficient to re-bootstrap the director.

See the introduction to Bacula for more information on those distinctions.

PostgreSQL backup system

Database backups are handled specially. We use PostgreSQL everywhere apart from a few rare exceptions (currently only CiviCRM) and therefore use postgres-specific configurations to do backups of all our servers.

See PostgreSQL backups reference for those server's specific backup/restore instructions.

MySQL backup system

MySQL also requires special handling, and it's done in the mariadb::server Puppet class. It deploys a script (tpa-backup-mysql-simple) which runs every day and calls mysqldump to store plain text copies of all databases in /var/backups/mysql.

Those backups then get included in the normal Bacula backups.

A more complicated backup system with multiple generations and expiry was previously implemented, but found to be too complicated, using up too much disk space, and duplicating the retention policies implemented in Bacula. It was retired in tpo/tpa/team#42177, in June 2025.

Scheduler

We do not use the builtin Bacula scheduler because it had issues. Instead, jobs are queued by the dsa-bacula-scheduler started from cron in /etc/cron.d/puppet-crontab.

TODO: expand on the problems with the original scheduler and how ours work.

Volume expiry

There is a /etc/bacula/scripts/volume-purge-action script which runs daily (also from puppet-crontab) which will run the truncate allpools storage=%s command on all mediatype entities found in the media table. TODO: what does that even mean?

Then the /etc/bacula/scripts/volumes-delete-old (also runs daily, also from puppet-crontab which will:

  • delete volumes with errors (volstatus=Error), created earlier than two weeks and without change for 6 weeks
  • delete all volumes in "append" mode (volstatus=Append) which are idle
  • delete purged volumes (volstatus=Purged) without files (volfiles=0 and volbytes<1000), marked to be recycled (recycle=1) and older than 4 months

It doesn't actually seem to purge old volumes per say: something else seems to be responsible from marking them as Purged. This is (possibly?) accomplished by the Director, thanks to the Volume Retention settings in the storage jobs configurations.

All the above run on the Director. There's also a cron job bacula-unlink-removed-volumes which runs daily on the storage server (currently bungei) and will garbage-collect volumes that are not referenced in the database. Volumes are removed from the storage servers 60 days after they are removed from the director.

This seem to imply that we have a backup retention period of 6 months.

Issues

There is no issue tracker specifically for this project, File or search for issues in the team issue tracker with the ~Backup label.

Maintainer

This service is maintained by TPA, mostly by anarcat.

Monitoring and metrics

Tests

Logs

The Bacula director logs to /var/log/bacula/bacula.log. Logs can take up a lot of space when a restore job fails. If that happens, cancel the job and try to rotate logs with:

logrotate -f /etc/logrotate.d/bacula-common

Backups

This is the backup service, so it's a bit circular to talk about backups. But the Bacula director server is backed up to the storage server like any other server, disaster recovery procedures explain how to restore in catastrophic failure cases.

An improvement to the backup setup would be two have two storage servers, see tpo/tpa/team#41557 for followup.

Other documentation

Discussion

TODO: populate Discussion section.

Overview

Security and risk assessment

Bacula is pretty good, security-wise, as it "pulls" backups from servers. So even if a server is compromised, an attacker cannot move laterally to destroy the backups.

It is, however, vulnerable to a cluster-wide compromise: if, for example, the Puppet or Bacula director servers are compromised, all backups can be destroyed or tampered with, and there's no clear workaround for this problem.

There are concerns about the consistency of backups. During a GitLab incident, it was found that some log files couldn't be restored properly (tpo/tpa/team#41474). It's unclear what the cause of this problem was.

Technical debt and next steps

Bacula has been lagging behind upstream, in Debian, where we have been stuck with version 9 for three major releases (buster on 9.4 and bullseye/bookworm on 9.6). Version 13 was uploaded to unstable in January 2024 and may ship with Debian trixie (13). But Bacula 15 already came out, so it's possible we might lag behind.

Bacula was forked in 2013 into a project called BareOS but that was never widely adopted. BareOS is not, for example, packaged in Debian.

We have a significant amount of legacy built on top of Bacula. For example, we have our own scheduler, because the Bacula scheduler was perceived to be inadequate. It might be worth reconsidering this.

Bacula is old software, designed for when the state of the art in backups was tape archival. We do not use tape (see below) and are unlikely ever to. This tape-oriented design makes working with normal disks a bit awkward.

Bacula doesn't deduplicate between archives the way more modern backup software (e.g. Borg, Restic) do, which leads to higher disk usage, particularly when keeping longer retention periods.

Proposed Solution

Other alternatives

Tape medium

Last I (anarcat) checked, the latest (published) LTO tape standard stored a whopping 18TB of data, uncompressed, per cartridge and writes 400MB/s which means it takes 12h30m to fill up one tape.

LTO tapes are pretty cheap, e.g. here is a 12TB LTO8 tape from Fuji for 80$CAD. The LTO tape drives are however prohibitively expensive. For example, an "upgrade kit" for an HP tape library sells for a whopping 7k$CAD here. I can't actually find any LTO-8 tape drives on newegg.ca.

As a comparison, you can get a 18TB Seagate IronWolf drive for 410$CAD, which means for the price of that upgrade kit you can get a whopping 300TB worth of HDDs for the price of the tape drive. And you don't have any actual tape yet, you'd need to shell out another 2k$CAD to get 300TB of 12TB tapes.

(Of course, that abstracts away the cost of running those hard drives. You might dodge that issue by pretending you can use HDD "trays" and hot-swap those drives around though, since that is effectively how tapes work. So maybe for the cost of that 2k$ of tapes, you could buy a 4U server with a bunch of slots for the hard drive, which you would still need to do to host the tape drive anyway.)