title: TPA-RFC-91: Incident response costs: N/A approval: TPA affected users: TPA deadline: 2 weeks (2025-10-13) status: proposed discussion: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40421
Summary: adopt an incident response procedure and templates, use them more systematically.
[[TOC]]
Background
Since essentially forever, our incident response procedures have been quite informal, based mostly on hunches and judgement of staff during high stress situations.
This makes those situations more difficult and stressful than they already are. It's also hard to followup on issues in a consistent manner.
Last week, we had three more incidents that spurred anarcat into action into formalizing this process a little bit. The first objective was to make a post-mortem template that could be used to write some notes after an incident, but it grew to describe a more proper incident response procedure.
Proposal
The proposal consists of:
-
A template
This is a GitLab issue template (
.gitlab/issue_templates/Incident.md) that gets used when you create an incident in GitLab or when you pick theIncidenttemplate when reporting an issue.It reuses useful ideas from previous incidents like having a list of dashboards to check and a checklist of next steps, but also novel ideas like clearer roles of who does what.
It also includes a full post-mortem template while still trying to keep the whole thing lightweight.
This template is not set in stone by this proposal, we merely state, here, that we need such a template. Further updates can be made to the template without going through a RFC process, naturally. The first draft of this template is in merge request tpo/tpa/team!1.
-
A process:
The process is the companion document to the template. It expands on what each role does, mostly, and spells out general principles. It lives in the
howto/incident-responsepage which is the generic "in case of fire" entry point in our documentation.The first draft of this process is in merge request !86 in the wiki. It includes:
- the principle of filing and documenting issues as we go
- getting help
- Operations, Communications, Planning and Commander roles imported from the Google SRE book.
- writing a post-mortem for larger incidents
This is made into a formal proposal to bring attention to those new mechanisms, offer a space for discussion, and make sure we at least try to use those procedures during the next incidents, in particular the issue template.
Feedback is welcome either in the above merge requests, in the discussion issue, or by email.
Alternatives considered
Other policies
There are of course many other incident response policies out there. We were inspired at least partly by some of those:
- Google SRE book: roles come from here, general principles quoted directly
- Got game? Secrets of great incident management
- Pager Duty incident response documentation
Other post-mortem examples and ideas
We were also inspired by other examples:
- GitHub - danluu/post-mortems: A collection of postmortems
- GitLab example post-mortem (2017, might be newer / better examples)
- Cloudflare example post-mortem (2019)
- Galileo post-mortem
- Amazon example post-mortem
- Root cause analysis ideas
We have also considered the following headings for the post-mortem:
- What happened?
- Where did it happen?
- Who was impacted by the incident?
- When did problem and resolution events occur?
- Why did the incident occur?
But we found them more verbose than the current headings, and lacking the "next steps" aspect of the current post mortem ("What went well?", "What could have gone better?" and "Recommendations and related issues").
No logs, no master, no commander?
A lot of consideration has been given to the title "Commander". The term was adopted as is from the Google SRE book. According to Wikipedia:
Commander [...] is a common naval officer rank as well as a job title in many armies. Commander is also used as a [...] title in other formal organizations, including several police forces. In several countries, this naval rank is termed as a frigate captain.
Commander is also a generic term for an officer commanding any armed forces unit, such as "platoon commander", "brigade commander" and "squadron commander". In the police, terms such as "borough commander" and "incident commander" are used.
We therefore need to acknowledge the fact that the term originally comes from the military, which is not typically how we like to organize our work. This raise a lot of eyebrows in the review of this proposal, as we prefer to work by consensus, leading by example and helping each other.
But we must admit that, in an emergency, deliberation and consensus building might be impossible. We must to delegate power to someone who will do the tough decisions, and it's necessary to have a single person at the helm, a bit like you have a single person on "operations", changing the systems at once, or you have a single person driving a car or a bus in real life.
The commander, however, is also useful because they are typically a person already in a situation of authority in relation with other political units, either inside or outside the organisation. This makes the commander in a better position to remove blockers than others. Note that this often means the person for the role is the Team Lead, especially if politics are involved, but we do not want the Team Lead handling all incidents.
In fact, the best person in Operations and Command is likely to be the person available that is the most familiar with the system at hand. It also must be clear that the roles can and should be rotated, especially if they become tired or seem to be causing more trouble than worth, just like an aggressive or dangerous driver should be taken off the wheel.
Furthermore, it must be understood that Command is not supposed to interface with Operations, once that role has been delegated: this is not a micro-management facility, it's a helper, un-blocker, tie-breaker role.
We have briefly considered using a more modest term like captain of a ship. Having had some experience sailing on ships, anarcat has in particular developed a deeper appreciation of that role in life-threatening situation, where the Captain (or Skipper) not only has authority but also the skills and thorough knowledge of the ship.
Other terms we considered were:
-
"lead": excluded because it can too easily be confused with "team lead", and we can't have the team lead be the lead for every incident out there
-
"coordinator": can too easily be confused with the Planning role, and hides the fact that the person needs to actually makes executive decisions at times
-
"facilitator": similar problems than coordinator, but worse: even "softer" naming that removes essentially all power from the role, while we must delegate some power to the role
But ultimately, we prefer the term Incident Commander because it is a well known terminology used inside (for example at Google) and outside our industry (at FEMA, fire fighters, medical emergencies and so on). The term is therefore not used in its military sense, but in a civil context.
If someone would onboard in TPA and find the "Incident Command" terminology during an emergency, they are more likely to understand what is going on that if they find a "Incident Facilitator", "Coordinator", or "Lead", which are site-specific.
The term also maps to a noun and a verb (a "Commander" is in "Command" and "Commands") than "Captain" (which would map, presumably, to the verb "Captain" and not really any name but "Command").