title: Incident and emergency response: what to do in case of fire

This documentation is for sysadmins to figure out what to do when things go wrong. If you don't have the required accesses and haven't been trained for such situation, you might be better off just trying to wake up someone that can deal with them. See the support documentation instead.

This page lists situations that are not service-specific, generic issue that can happen on any server (or even on your home network). They are, in a sense, the default location for "pager playbooks" that would otherwise live in the service documentation.

Therefore, if the fault concerns a specific service, you will more likely find what you are looking for in the service listing.

[[TOC]]

Incident response procedures

If you are faced with an incident, try to follow this procedure:

File an incident in GitLab

... or at least start taking notes of what is happening if GitLab is down. Use the .gitlab/issue_templates/Incident.md template from the tpo/tpa/team repository in a Etherpad or local text file, in the worst case.
Get help

By default, you're in Command, and responsible for everything. Except in the simplest cases, you should delegate at least Communications and Planning.

Find someone to help with Comms, at least, to shield you off from interrupts and help with updating the status page.

Eventually, for "all hands on deck", major incidents, this will include delegating Command itself or handing off Operations and keeping Command.

Don't hesitate in pinging your colleagues over text messages and out of band. Try to be considerate of time zones, but we're a team, and you're not expected to solve everything alone.
Work the problem!

Depending on the role (and you accumulate all of those by default):
- Operations
Find and fix the problem! try to poke at systems and try to find a solution. don't hesitate to ask for help and keep others informed of your progress.

ONLY ONE ROLE MODIFIES THE SYSTEMS DURING AN INCIDENT.

This is like flying a plane: when you switch pilot, you say "the plane is yours", and the other pilot confirms and says "the plane is mine".

Use tmux to allow others to see what you're doing, and survive outages.
- Communications
Talk internally and possibly externally about this issue.
- stay in close contact with operations to know what's going on
- relay and summarize information about the outage by instant messaging / IRC / Matrix or email
- update the status site as soon as possible, and keep it up to date
- for major incidents, consider involving the Communications Director and social media outlets
- avoid talking to the media, direct to the communications director
- Planning
You're the note taker and take care of assisting operations with long term issues. Your role is:
- document progress in the issue (e.g. add timeline events for comments made by operations)
- file new issues for possible issues / improvements that are uncovered
- document how much systems diverged from norm
- keep track of new tasks that are piling up ("Next steps")
- order dinner and breaks, you're the time keeper
- Command
Tag! you're it! the buck stops here, and you might have all the above roles on your hands. if all roles have been delegated, your role is:
- keep a cool head
- act as a tie-breaker to make tough calls
- remove roadblocks from other administrative boundaries (which makes team leads especially suited for this role)
- do not micro-manage everyone else
- all of the above, which will be delegated as incidents grow
Post-mortem

Once things have settled down, write and communicate a post-mortem

This is typically done by the Planning role, but this can be shifted around.

The post-mortem, if it becomes too big, can be moved to the status site or even a blog post. See this comment for examples of more exhaustive post-mortems.

Your priority is to fix things, not red tape, but it will be easier for long term recovery and coordination with others if you follow that procedure.

More background

Keep in mind this advice from Google SRE book:

Best Practices for Incident Management

Prioritize. Stop the bleeding, restore service, and preserve the evidence for root-causing.

Prepare. Develop and document your incident management procedures in advance, in consultation with incident participants.

Trust. Give full autonomy within the assigned role to all incident participants.

Introspect. Pay attention to your emotional state while responding to an incident. If you start to feel panicky or overwhelmed, solicit more support.

Consider alternatives. Periodically consider your options and re-evaluate whether it still makes sense to continue what you’re doing or whether you should be taking another tack in incident response.

Practice. Use the process routinely so it becomes second nature.

Change it around. Were you incident commander last time? Take on a different role this time. Encourage every team member to acquire familiarity with each role.

See TPA-RFC-91 for a discussion on this procedure and the roles.

Specific situations

Server down

If a server is reported as non-responsive, this situation can be caused by:

a network outage at our provider
sometimes the network outage can be happening between two of our providers so make sure to test network reachability from more than one place on the internet.
RAM and swap being full
the host being offline or crashed

You can first check if it is actually reachable over the network:

ping -4 -c 10 server.torproject.org
ping -6 -c 10 server.torproject.org
ssh server.torproject.org

If it does respond at least from one point on the internet, you can try to diagnose the issue by looking at prometheus and/or Grafana and analyse what, exactly is going on. If you're lucky enough to have SSH access, you can dive deeper in the logs and systemd unit status: for example it might just be that the node exporter has crashed.

If the host does not respond, you should see if it's a virtual machine, and in this case, which server is hosting it. This information is available in howto/ldap (or the web interface, under the physicalHost field). Then login to that server to diagnose this issue. If the physical host is a ganeti node, you can use the serial console and if it's not a ganeti node, you can try to access the console on the hosting provider's web site.

Once you have access to the console, look out for signs of errors like OOM-Kill, disk failures, kernel panics, network-related errors. If you're still able to login and investigate, you might be able to bring the machine back online. Otherwise, look in subsections below for how to perform hard resets.

If the physical host is not responding or is empty (in which case it is a physical host), you need to file a ticket with the upstream provider. That information is available in LDAP, under the physicalHost field.

Check if the parent server is online. If you still can't figure out which server that is, use traceroute or mtr to see the hops to the server. Normally, you should see a reverse DNS matching one of our point of presence. This will also show you whether or not the upstream routers are responsive. This is an example of a healthy trace to fsn-node-01, hosted at Hetzner robot, as seen from the other cluster in Dallas:

root@ssh-dal-01:~# mtr -c 10 -w fsn-node-01.torproject.org
Start: 2024-06-19T18:40:03+0000
HOST: ssh-dal-01                      Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- gw-01.gnt-dal-01.torproject.org  0.0%    10    0.5   4.2   0.4  35.2  10.9
  2.|-- e0-7.switch3.dal2.he.net        90.0%    10    1.7   1.7   1.7   1.7   0.0
  3.|-- ???                             100.0    10    0.0   0.0   0.0   0.0   0.0
  4.|-- ???                             100.0    10    0.0   0.0   0.0   0.0   0.0
  5.|-- ???                             100.0    10    0.0   0.0   0.0   0.0   0.0
  6.|-- port-channel9.core2.par3.he.net  0.0%    10  103.5 105.5 102.3 126.5   7.5
  7.|-- ???                             100.0    10    0.0   0.0   0.0   0.0   0.0
  8.|-- hetzner-online.par.franceix.net 10.0%    10  102.0 102.0 101.9 102.2   0.1
  9.|-- core12.nbg1.hetzner.com          0.0%    10  120.6 121.4 120.4 125.5   1.6
 10.|-- core22.fsn1.hetzner.com          0.0%    10  122.9 123.5 122.7 126.2   1.3
 11.|-- 2a01:4f8:0:3::5fe                0.0%    10  122.8 122.8 122.7 123.0   0.1
 12.|-- fsn-node-01.torproject.org       0.0%    10  123.1 123.1 122.8 124.0   0.4

In the above, you can see when the packets leave the continent from Dallas (hop 2) to land in Paris (hop 6), although the other hops in the middle are not responding and therefore hidden.

Here's a healthy trace to a hetzner-hel1-01, hosted in Hetzner cloud:

root@ssh-dal-01:~# mtr -c 10 -w hetzner-hel1-01.torproject.org
Start: 2024-06-19T18:41:22+0000
HOST: ssh-dal-01                      Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- gw-01.gnt-dal-01.torproject.org  0.0%    10    1.0   0.6   0.4   1.0   0.2
  2.|-- e0-7.switch3.dal2.he.net        70.0%    10    1.2   1.2   1.1   1.3   0.1
  3.|-- ???                             100.0    10    0.0   0.0   0.0   0.0   0.0
  4.|-- ???                             100.0    10    0.0   0.0   0.0   0.0   0.0
  5.|-- ???                             100.0    10    0.0   0.0   0.0   0.0   0.0
  6.|-- ???                             100.0    10    0.0   0.0   0.0   0.0   0.0
  7.|-- ???                             100.0    10    0.0   0.0   0.0   0.0   0.0
  8.|-- ???                             100.0    10    0.0   0.0   0.0   0.0   0.0
  9.|-- ???                             100.0    10    0.0   0.0   0.0   0.0   0.0
 10.|-- ???                             100.0    10    0.0   0.0   0.0   0.0   0.0
 11.|-- ???                             100.0    10    0.0   0.0   0.0   0.0   0.0
 12.|-- 2a03:5f80:4:2::236:87            0.0%    10  130.2 130.9 129.7 138.9   2.8
 13.|-- core32.hel1.hetzner.com          0.0%    10  129.8 129.9 129.7 130.1   0.1
 14.|-- spine16.cloud1.hel1.hetzner.com  0.0%    10  128.5 130.6 128.4 145.2   5.3
 15.|-- spine2.cloud1.hel1.hetzner.com   0.0%    10  129.4 129.6 129.1 131.4   0.7
 16.|-- ???                             100.0    10    0.0   0.0   0.0   0.0   0.0
 17.|-- 12995.your-cloud.host            0.0%    10  128.6 128.7 128.5 129.1   0.2
 18.|-- hetzner-hel1-01.torproject.org   0.0%    10  130.7 131.1 130.3 135.4   1.5

What follows are per-provider instructions:

Hetzner robot (physical servers)

If you're not sure yet whether it's the server or Hetzner, you can use location-specific Hetzner targets:

ash.icmp.hetzner.com
fsn.icmp.hetzner.com
hel.icmp.hetzner.com
nbg.icmp.hetzner.com

... and so on.

If all fails, you can try to reset or reboot the server remotely:

Visit the Heztner Robot server page (password in tor-passwords/hosts-extra-info)
Select the right server (hostname is the second column)
Select the "reset" tab
Select the "Execute an automatic hardware reset" radio button and hit "Send". This is equivalent to hitting the "reset" button on a computer.
Wait for the server to return for a "few" (2? 5? 10? 20?) minutes, depending on how hopeful you are this simple procedure will work.
If that fails, Select the "Order a manual hardware reset" option and hit "Send". This will send an actual human to attend the server and see if they can bring it back online.

If all else fails, Select the "Support" tab and open a support request.

DO NOT file a ticket with support@hetzner.com. That email address is notoriously slow to get an answer from. See incident 40432 for a 3+ days delay.

Hetzner Cloud (virtual servers)

Visit the Hetzner Cloud console (password in tor-passwords/hosts-extra-info)
Select the project (usually "default")
Select the affected server
Open the console (the >_ sign on the top right), and see if there are any error messages and/or if you can login there (using the root password in tor-passwords/hosts)
If that fails, attempt a "Power cycle" in the "Power" tab (on the left)
If that fails, you can also try to boot a rescue system by selecting "Enable Rescue & Power Cycle" in the "Rescue" tab

If all else fails, create a support request. The support menu is in the "Person" menu on the top right of the page.

DO NOT file a ticket with support@hetzner.com. That email address is notoriously slow to get an answer from. See incident 40432 for a 3+ days delay.

Cymru

Open a ticket by writing support@cymru.com.

Sunet / safespring

TBD

Intermittent problems

If you have an intermittent problem that takes a while to manifest itself, you can increase the count. If that takes too long, you can enable "flood" mode which will decrease the interval between packets, waiting less between each failing probe or sending as soon as a reply is received, up to a certain rate.

Here is, for example, a successful 1000 packet ping executed in 100ms:

root@tb-build-03:~# ping -f -c 1000 dal-node-01.torproject.org
PING dal-node-01.torproject.org(2620:7:6002:0:3eec:efff:fed5:6b2a) 56 data bytes

--- dal-node-01.torproject.org ping statistics ---
1000 packets transmitted, 1000 received, 0% packet loss, time 101ms
rtt min/avg/max/mdev = 0.075/0.086/0.211/0.012 ms, ipg/ewma 0.101/0.095 ms

And here is a failing ping aborted after 14 seconds:

root@tb-build-03:~# ping -f -c 1000 maven.mozilla.org
PING maven.mozilla.org(2600:9000:24f8:fa00:1b:afe8:4000:93a1) 56 data bytes
..............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................^C
--- maven.mozilla.org ping statistics ---
878 packets transmitted, 0 received, 100% packet loss, time 14031ms

(See tpo/tpa/team#41654 for a discussion and further analysis of that specific issue.)

MTR can help diagnose issues in this case. Vary parameters like IPv6 (-6) or TCP (--tcp). In the above case, the problem could be reproduced with mtr --tcp -6 -c 10 -w maven.mozilla.org.

Tools like curl can also be useful for quick diagnostics, but note that it supports the happy eyeballs standard so it might hide (e.g. IPv6) issues that might otherwise be affecting other clients.

Unexpected reboot

If a host reboots without a manual intervention, there might be different causes for the reboot to happen. Identifying exactly what happened after the fact can be challenging or even in some cases impossible since logs might not have been updated with information about the issues.

But in some cases the logs do have some information. Some things that can be investigated:

syslog. look particularly for disk errors, OOM kill messages close to the reboot, kernel oops messages
dmesg from previous boots, e.g. journaltcl -k -b -1, or see journalctl --list-boots for a list of boot IDs available
smartctl -t long and smartctl -A / nvme [device-self-test|self-test-log] on all devices
/proc/mdadm and /proc/drbd: make sure that replication is still all right

Also note that it's possible this is a spurious warning, or that a host took longer than expected to reboot. Normally, our Fabric reboot procedures issue a silence for the monitoring system to ignore those warnings. It's possible those delays are not appropriate for this host, for example, and might need to be tweaked upwards.

Network-level attacks

Firewall blocking

If you are sure that a specific $IP is mounting a Denial of Service attack on a server, you can block it with:

iptables -I INPUT -s $IP -j DROP

$IP can also be a network in CIDR notation, e.g. the following drops a whole Google /16 from the host:

iptables -I INPUT -s 74.125.0.0/16 -j DROP

Note that the above inserts (-I) a rule into the rule chain, which puts it before other rules. This is most likely what you want, as it's often possible there's an already existing rule that will allow the traffic through, making a rule appended (-A) to the chain ineffective.

This only blocks one network or host, and quite brutally, at the network level. From a user's perspective, it will look like an outage. A gentler way way is to use -j REJECT to actually send a reset packet to let the user know they're blocked.

See also our nftables documentation.

Note that those changes are gone after reboot or firewall reloads, for permanent blocking, see below.

Server blocking

An even "gentler" approach is to block clients at the server level. That way the client application can provide feedback to the user that the connection has been denied, more clearly. Typically, this is done with a web server level block list.

We don't have a uniform way to do this right now. In profile::nginx, there's a blocked_hosts list that can be used to add CIDR entries which are passed to the Nginx deny directive. Typically, you would define an entry in Hiera with something like this (example from data/roles/gitlab.yaml):

profile::nginx::blocked_hosts:
  # alibaba, tpo/tpa/team#42152
  - "47.74.0.0/15"

For Apache servers, it's even less standardized. A couple servers (currently donate and crm) have a blocklist.txt file that's used in a RewriteMap to deny individual IP addresses.

Extracting IP range lists

A command like this will extract the IP addresses from a webserver log file and group them by number of hits:

awk '{print $1}' /var/log/nginx/gitlab_access.log | grep -v '0.0.0.0' | sort | uniq -c | sort -n

This assumes log redaction has been disabled on the virtual host, of course, which can be done in emergencies like this. The most frequent hosts will show up first.

You can lookup which netblock the relevant IP addresses belong to a command like ip-info (part of the libnet-abuse-utils-perl Debian package) or asn (part of the asn package). Or this can be done by asking the asn.cymru.com service, with, for example:

nc whois.cymru.com 43 <<EOF
begin
verbose
216.90.108.31
192.0.2.1
198.51.100.0/24
203.0.113.42
end
EOF

This can be used to group IP addresses by netblock and AS number, roughly. A much more sophisticated approach is the asncounter project developed by anarcat, which allows AS and CIDR-level counting and can be used to establish a set of networks or entire ASNs to block.

The asncounter(1) manual page has detailed examples for this. That tool has been accepted in Debian unstable as of 2025-05-28 and should slowly make its way down to stable (probably Debian 14 "forky" or later). It's currently installed on gitlab-02 in /root/asncounter but may eventually be deployed site-wide through Puppet.

Filesystem set to readonly

If a filesystem is switched to readonly, it prevents any process from writing to the concerned disk, which can have consequences of differing magnitude depending on which volume is readonly.

If Linux automatically changes a filesystem to readonly, it usually indicates that some serious issues were detected with the disk or filesystem. Those can be:

physical drive errors
bad sectors or other detected ongoing data corruption
hard drive driver errors
filesystem corruption

Look out for disk- or filesystem-related errors in:

syslog
dmesg
physical console (e.g. IMPI console)

In some cases with ext4, running fsck can fix issues. However, watch out for files disappearing or being moved to lost+found if the filesystem encounters serious enough inconsistencies.

If the hard disk seems to be showing signs of breakage. Usually that disk will get ejected from the RAID array without blocking the filesystem. However if disk breakage did impact the filesystem consistency and caused it to switch to readonly, migrate the data away from that drive ASAP for example by moving the instance to its secondary node or by rsync'ing it to another machine.

In such a case, you'll also want to review what other instances are currently using the same drive and possibly move all of those instances as well before replacing the drive.

Web server down

Apache web server diagnostics

If you get an alert like ApacheDown, that is:

Apache web server down on test.example.com

It means the apache exporter cannot contact the local web server over its control address http://localhost/server-status/?auto. First, confirm whether this is a problem with the exporter or the entire service, by checking the main service on this host to see if users are affected. If that's the case, prioritize that.

It's possible, for example, that the webserver has crashed for some reason. The best way to figure that out is to check the service status with:

service apache2 status

You should see something like this if the server is running correctly:

● apache2.service - The Apache HTTP Server
     Loaded: loaded (/lib/systemd/system/apache2.service; enabled; preset: enabled)
     Active: active (running) since Tue 2024-09-10 14:56:49 UTC; 1 day 5h ago
       Docs: https://httpd.apache.org/docs/2.4/
    Process: 475367 ExecReload=/usr/sbin/apachectl graceful (code=exited, status=0/SUCCESS)
   Main PID: 338774 (apache2)
      Tasks: 53 (limit: 4653)
     Memory: 28.6M
        CPU: 11min 30.297s
     CGroup: /system.slice/apache2.service
             ├─338774 /usr/sbin/apache2 -k start
             └─475411 /usr/sbin/apache2 -k start

Sep 10 17:51:50 donate-01 systemd[1]: Reloading apache2.service - The Apache HTTP Server...
Sep 10 17:51:50 donate-01 systemd[1]: Reloaded apache2.service - The Apache HTTP Server.
Sep 10 19:53:00 donate-01 systemd[1]: Reloading apache2.service - The Apache HTTP Server...
Sep 10 19:53:00 donate-01 systemd[1]: Reloaded apache2.service - The Apache HTTP Server.
Sep 11 00:00:01 donate-01 systemd[1]: Reloading apache2.service - The Apache HTTP Server...
Sep 11 00:00:01 donate-01 systemd[1]: Reloaded apache2.service - The Apache HTTP Server.
Sep 11 01:29:29 donate-01 systemd[1]: Reloading apache2.service - The Apache HTTP Server...
Sep 11 01:29:29 donate-01 systemd[1]: Reloaded apache2.service - The Apache HTTP Server.
Sep 11 19:50:51 donate-01 systemd[1]: Reloading apache2.service - The Apache HTTP Server...
Sep 11 19:50:51 donate-01 systemd[1]: Reloaded apache2.service - The Apache HTTP Server.

With the first dot (●) in green and the line Active saying active (running). If it isn't, the logs should show why it failed to start.

It's possible you don't see the right logs in there if the service is stuck in a restart loop. In this case, that use this command instead to see the service logs:

journalctl -b -u apache2

That shows the logs for the server from the last boot.

If the main service is online and it's only the exporter having trouble, try to reproduce the issue with curl from the affected server, for example:

root@test.example.com:~# curl http://localhost/server-status/?auto

Normally, this should work, but it's possible Apache is misconfigured and doesn't listen to localhost for some reason. Look at the apache2ctl -S output, and the rest of the Apache configuration in /etc/apache2, particularly the Ports and Listen directives.

See also the Apache exporter scraping failed instructions in the Prometheus documentation, a related alert.

Disk is full or nearly full

When a disk is filled up to 100% of its capacity, some processes can have issues with continuing to work normally. For example PostgreSQL will purposefully exit when that happens in order to avoid the risk of data corruption. MySQL is not so graceful and it can end up with data corruption in some of its databases.

The first step is to check how long you have. For this, a good tool is the Grafana disk usage dashboard. Select the affected instance, and look at the "change rate" panel, it should show you how much time is left per partition.

To clear up this situation, there are two approaches that can be used in succession:

find what's using disk space and clear out some files
grow the disk

The first thing that should be attempted is to identify where disk space is used and remove some big files that occupy too much space. For example, if the root partition is full, this will show you what is taking up space:

ncdu -x /

Examples

Maybe the syslog grew to ridiculous sizes? Try:

logrotate -f /etc/logrotate.d/syslog-ng

Maybe some users have huge DB dumps laying around in their home directory. After confirming that those files can be deleted:

rm /home/flagada/huge_dump.sql

Maybe the systemd journal has grown too big. This will keep only 500MB:

journalctl --vacuum-size=500M

If in the cleanup phase you can't identify files that can be removed, you'll need to grow the disk. See how to grow disks with ganeti.

Note that it's possible a suddenly growing disk might be a symptom of a larger problem, for example bots crawling a website abusively or an attacker running a denial of service attack. This warrants further (and more complex) investigation, of course, but can be delegated to after the disk usage alert has been handled.

Other documentation:

Host clock desynchronized

If a host's clock has drifted and is no longer in sync with the rest of the internet, some really strange things can start happening, like TLS connections failing even though the certificate is still valid.

If a host has time synchronization issues, check that the ntpd service is still running:

systemctl status ntpd.service

You can gather information about which peer servers are drifting:

ntpq -pun

Logs for this service are sent to syslog, so you can take a look there to see if some errors were mentioned.

If restarting the ntpd service does not work, verify that a firewall is not blocking port 123 UDP.

Support policies

Please see TPA-RFC-2: support.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search