DNS is the Domain Name Service. It is what turns a name like
www.torproject.org in an IP address that can be routed over the
Internet. TPA maintains its own DNS servers and this document attempts
to describe how those work.
TODO: mention unbound and a rough overview of the setup here
[[TOC]]
Tutorial
How to
Most operations on DNS happens in the domains repository
(dnsadm@nevii.torproject.org:/srv/dns.torproject.org/repositories/domains). Those zones contains
the master copy of the zone files, stored as (mostly) standard Bind zonefiles
(RFC 1034), but notably without a SOA.
Tor's DNS support is fully authenticated with DNSSEC, both to the outside world but also internally, where all TPO hosts use DNSSEC in their resolvers.
Editing a zone
Zone records can be added or modified to a zone in the domains git
and a push.
Serial numbers are managed automatically by the git repository hooks.
Adding a zone
To add a new zone to our infrastructure, the following procedure must be followed:
- add zone in
domainsrepository (dnsadm@nevii.torproject.org:/srv/dns.torproject.org/repositories/domains) - add zone in the
modules/bind/templates/named.conf.torproject-zones.erbPuppet template for DNS secondaries to pick up the zone - also add IP address ranges (if it's a reverse DNS zone file) to
modules/torproject_org/misc/hoster.yamlin thetor-puppet.gitrepository - run puppet on DNS servers:
cumin 'C:roles::dns_primary or C:bind::secondary' 'puppet agent -t' - add zone to
modules/postfix/files/virtual, unless it is a reverse zonefile - add zone to nagios: copy an existing
DNS SOA syncblock and adapt - add zone to external DNS secondaries (currently Netnod)
- make sure the zone is delegated by the root servers somehow. for normal zones, this involves adding our nameservers in the registrar's configuration. for reverse DNS, this involves asking our upstreams to delegate the zone to our DNS servers.
Note that this is a somewhat rarer procedure: this happens only when a
completely new domain name (e.g. torproject.net) or IP address
space (so reverse DNS, e.g. 38.229.82.0/24 AKA
82.229.38.in-addr.arpa) is added to our infrastructure.
Removing a zone
- git grep the domain in the
tor-nagiosgit repository - remove the zone in the
domainsrepository (dnsadm@nevii.torproject.org:/srv/dns.torproject.org/repositories/domains) -
on nevii, remove the generated zonefiles and keys:
cd /srv/dns.torproject.org/var/ mv generated/torproject.fr* OLD-generated/ mv keys/torproject.fr OLD-KEYS/ -
remove the zone from the secondaries (Netnod and our own servers). this means visiting the Netnod web interface for that side, and Puppet (
modules/bind/templates/named.conf.torproject-zones.erb) for our own - the domains will probably be listed in other locations, grep Puppet for Apache virtual hosts and email aliases
- the domains will also probably exist in the
letsencrypt-domainsrepository
DNSSEC key rollover
We no longer rotate DNSSEC keys (KSK, technically) automatically,
but there may still be instances where a manual rollover is
required. This involves new DNSKEY / DS records and requires
manual operation on the registrar (currently https://joker.com).
There are two different scenario's for a manual rollover: (1) where the current keys are no longer trusted and need to be disabled as soon as possible and (2) where the current ZSK can fade out along its automated 120 day cycle. An example of scenario 1 could be a compromise of private key material. An example of scenario 2 could be preemptive upgrading to a stronger cipher without indication of compromise.
Scenario 1
First, we create a new ZSK:
cd /srv/dns.torproject.org/var/keys/torproject.org
dnssec-keygen -I +120d -D +150d -a RSASHA256 -n ZONE torproject.org.
Then, we create a new KSK:
cd /srv/dns.torproject.org/var/keys/torproject.org
dnssec-keygen -f KSK -a RSASHA256 -n ZONE torproject.org.
And restart bind.
Run dnssec-dsfromkey on the newly generated KSK to get the corresponding new DS record.
Save this DS record to a file and propagate it to all our nodes so that unbound has a new trust anchor:
- transfer (e.g.
scp) the file to every node's/var/lib/unbound/torproject.org.key(and no, Puppet doesn't do that because it hasreplaces => falseon that file) - immediately restart unbound (be quick, because unbound can overwrite this file on its own)
- after the restart, check to ensure that
/var/lib/unbound/torproject.org.keyhas the new DS
Puppet ships trust anchors for some of our zones to our unbounds, so make sure
you update the corresponding file ( legacy/unbound/files/torproject.org.key )
in the puppet-control.git repository. You can replace it with only
the new DS, removing the old one.
On nevii, add the new DS record to /srv/dns.torproject.org/var/keys/torproject.org/dsset, while
keeping the old DS record there.
Finally, configure it at our registrar.
To do so on Joker, you need to visit joker.com
and authenticate with the password in dns/joker in
tor-passwords.git, along with the 2FA dance. Then:
- click on the "modify" button next to the domain affected (was first a gear but is now a pen-like icon thing)
- find the DNSSEC section
- click the "modify" button to edit records
- click "more" to add a record
Note that there are two keys there: one (the oldest) should already be in Joker. you need to add the new one.
With the above, you would have the following in Joker:
alg: 8 ("RSA/SHA-256", IANA, RFC5702)digest: ebdf81e6b773f243cdee2879f0d12138115d9b14d560276fcd88e9844777d7e3type: 2 ("SHA-256", IANA, RFC4509)keytag: 57040
And click "save".
After a little while, you should be able to check if the new DS record works on DNSviz.net, for example, the DNSviz.net view of torproject.net should be sane.
After saving the new record, wait one hour for the TTL to expire and delete the old DS record.
Also remove the old DS record in /srv/dns.torproject.org/var/keys/torproject.org/dsset.
Wait another hour before removing the old KSK and ZSK's. To do so:
- stop bind
- remove the keypair files in
/srv/dns.torproject.org/var/keys/torproject.org/ rm /srv/dns.torproject.org/var/generated/torproject.org.signed*rm /srv/dns.torproject.org/var/generated/torproject.org.j*- start bind
That should be your rollover finished.
Scenario 2
In this scenario, we keep our ZSK's and only create a new KSK:
cd /srv/dns.torproject.org/var/keys/torproject.org
dnssec-keygen -f KSK -a RSASHA256 -n ZONE torproject.org.
And restart bind.
Run dnssec-dsfromkey on the newly generated KSK to get the corresponding new DS record.
Puppet ships trust anchors for some of our zones to our unbounds, so make sure
you update the corresponding file ( legacy/unbound/files/torproject.org.key )
in the puppet control repository. You can replace it with only the new DS.
On nevii, add the new DS record to /srv/dns.torproject.org/var/keys/torproject.org/dsset, while
keeping the old DS record there.
Finally, configure it at our registrar.
To do so on Joker, you need to visit joker.com
and authenticate with the password in dns/joker in
tor-passwords.git, along with the 2FA dance. Then:
- click on the "modify" button next to the domain affected (was first a gear but is now a pen-like icon thing)
- find the DNSSEC section
- click the "modify" button to edit records
- click "more" to add a record
Note that there are two keys there: one (the oldest) should already be in Joker. you need to add the new one.
With the above, you would have the following in Joker:
alg: 8 ("RSA/SHA-256", IANA, RFC5702)digest: ebdf81e6b773f243cdee2879f0d12138115d9b14d560276fcd88e9844777d7e3type: 2 ("SHA-256", IANA, RFC4509)keytag: 57040
And click "save".
After a little while, you should be able to check if the new DS record works on DNSviz.net, for example, the DNSviz.net view of torproject.net should be sane.
After saving the new record, wait one hour for the TTL to expire and delete the old DS record.
Also remove the old DS record in /srv/dns.torproject.org/var/keys/torproject.org/dsset.
Do not remove any keys yet, unbound needs 30 days (!) to complete slow, RFC5011-style rolling of KSKs.
After 30 days, remove the old KSK:
Wait another hour before removing the old KSK and ZSK's. To do so:
- stop bind
- remove the old KSK keypair files in
/srv/dns.torproject.org/var/keys/torproject.org/ rm /srv/dns.torproject.org/var/generated/torproject.org.signed*rm /srv/dns.torproject.org/var/generated/torproject.org.j*- start bind
That should be your rollover finished.
Special case: RFC1918 zones
The above is for public zones, for which we have Nagios checks that
warn us about impeding doom. But we also sign zones about reverse IP
looks, specifically 30.172.in-addr.arpa. Normally, recursive
nameservers pick new signatures in that zone automatically, thanks to
rfc 5011.
But if a new host gets provisionned, it needs to get bootstrapped somehow. This is done by Puppet, but those records are maintained by hand and will get out of date. This implies that after a while, you will start seeing messages like this for hosts that were installed after the expiration date:
16:52:39 <nsa> tor-nagios: [submit-01] unbound trust anchors is WARNING: Warning: no valid trust anchors found for 30.172.in-addr.arpa.
The solution is to go on the primary nameserver (currently nevii)
and pick the non-revoked DSSET line from this file:
/srv/dns.torproject.org/var/keys/30.172.in-addr.arpa/dsset
... and inject it in Puppet, in:
tor-puppet/modules/unbound/files/30.172.in-addr.arpa.key
Then new hosts will get the right key and bootstrap properly. Old hosts can get the new key by removing the file by hand on the server and re-running Puppet:
rm /var/lib/unbound/30.172.in-addr.arpa.key ; puppet agent -t
Transferring a domain
Joker
To transfer a domain from another registrar to joker.com, you will need the domain name you want to transfer, and an associated "secret" that you get when you unlock the domain from another registrar, referred below as "secret".
Then follow these steps:
-
login to joker.com
-
in the main view, pick the "Transfer" button
-
enter the domain name to be transferred, hit the "Transfer domain" button
-
enter the secret in the "Auth-ID" field, then hit the "Proceed" button, ignoring the privacy settings
-
pick the
hostmaster@torproject.orgcontact as the "Owner", then for "Billing", uncheck the "Same as" button and pickaccounting@torproject.org, then hit the "Proceed" button -
In the "Domain attributes", keep
joker.comthen check "Enable DNSSEC", and "take over existing nameserver records (zone)", leave "Automatic renewal" checked and "Whois opt-in" unchecked, then hit the "Proceed" button -
In the "Check Domain Information", review the data then hit "Proceed"
-
In "Payment options", pick "Account", then hit "Proceed"
Pager playbook
In general, to debug DNS issues, those tools are useful:
- DNSviz.net, e.g. a DNSSEC Authentication Chain
dig
unbound trust anchors: Some keys are old
This warning can happen when a host was installed with old keys and unbound wasn't able to rotate them:
20:05:39 <nsa> tor-nagios: [chi-node-05] unbound trust anchors is WARNING: Warning: Some keys are old: /var/lib/unbound/torproject.org.key.
The fix is to remove the affected file and rerun Puppet:
rm /var/lib/unbound/torproject.org.key
puppet agent --test
unbound trust anchors: Warning: no valid trust anchors
So this can happen too:
11:27:49 <nsa> tor-nagios: [chi-node-12] unbound trust anchors is WARNING: Warning: no valid trust anchors found for 30.172.in-addr.arpa.
If this happens on many hosts, you will need to update the key, see the Special case: RFC1918 zones section, above. But if it's a single host, it's possible it was installed during the window where the key was expired, and hasn't been properly updated by Puppet yet.
Try this:
rm /var/lib/unbound/30.172.in-addr.arpa.key ; puppet agent -t
Then the warning should have gone away:
# /usr/lib/nagios/plugins/dsa-check-unbound-anchors
OK: All keys in /var/lib/unbound recent and valid
If not, see the Special case: RFC1918 zones section above.
DNS - zones signed properly is CRITICAL
When adding a new reverse DNS zone, it's possible you get this warning from Nagios:
13:31:35 <nsa> tor-nagios: [global] DNS - zones signed properly is CRITICAL: CRITICAL: 82.229.38.in-addr.arpa
16:30:36 <nsa> tor-nagios: [global] DNS - key coverage is CRITICAL: CRITICAL: 82.229.38.in-addr.arpa
That might be because Nagios thinks this zone should be signed (while it isn't and cannot). The fix is to add this line to the zonefile:
; ds-in-parent = no
And push the change. Nagios should notice and stop caring about the zone.
In general, this Nagios check provides a good idea of the DNSSEC chain of a zone:
$ /usr/lib/nagios/plugins/dsa-check-dnssec-delegation overview 82.229.38.in-addr.arpa
zone DNSKEY DS@parent DLV dnssec@parent
--------------------------- -------------------- --------------- --- ----------
82.229.38.in-addr.arpa no(229.38.in-addr.arpa), no(38.in-addr.arpa), yes(in-addr.arpa), yes(arpa), yes(.)
Notice how the 38.in-addr.arpa zone is not signed? This zone can
therefore not be signed with DNSSEC.
DNS - delegation and signature expiry is WARNING
If you get a warning like this:
13:30:15 <nsa> tor-nagios: [global] DNS - delegation and signature expiry is WARNING: WARN: 1: 82.229.38.in-addr.arpa: OK: 12: unsigned: 0
It might be that the zone is not delegated by upstream. To confirm, run this command on the Nagios server:
$ /usr/lib/nagios/plugins/dsa-check-zone-rrsig-expiration 82.229.38.in-addr.arpa
ZONE WARNING: No RRSIGs found; (0.66s) |time=0.664444s;;;0.000000
On the primary DNS server, you should be able to confirm the zone is signed:
dig @nevii -b 127.0.0.1 82.229.38.in-addr.arpa +dnssec
Check the next DNS server up (use dig -t NS to find it) and see if
the zone is delegated:
dig @ns1.cymru.com 82.229.38.in-addr.arpa +dnssec
If it's not delegated, it's because you forgot step 8 in the zone addition procedure. Ask your upstream or registrar to delegate the zone and run the checks again.
DNS - security delegations is WARNING
This error:
11:51:19 <nsa> tor-nagios: [global] DNS - security delegations is WARNING: WARNING: torproject.net (63619,-53722), torproject.org (33670,-28486)
... will happen after rotating the DNSSEC keys at the registrar. The trick is then simply to remove those keys, at the registrar. See DS records expiry and renewal for the procedure.
DNS SOA sync
An example of this problem is the error:
Nameserver ns5.torproject.org for torproject.org returns 0 SOAs
That is because the target nameserver (ns5 in this case) does not
properly respond for the torproject.org. To reproduce the error, you
can run this on the Nagios server:
/usr/lib/nagios/plugins/dsa-check-soas -a nevii.torproject.org torproject.org -v
This happens because the server doesn't correctly transfer the zones
from the master. You can confirm the problem by looking at the logs on
the affected server and on the primary server (e.g. with journalctl
-u named -f). Restarting the server will trigger a zone transfer
attempt.
Typically, this is because a change in tor-puppet.git was forgotten
(in named.conf.options or named.conf.puppet-shared-keys).
DNS - DS expiry
Example:
2023-08-22 16:34:36 <nsa> tor-nagios: [global] DNS - DS expiry is WARNING: WARN: torproject.com, torproject.net, torproject.org : OK: 4
2023-08-26 16:25:39 <nsa> tor-nagios: [global] DNS - DS expiry is CRITICAL: CRITICAL: torproject.com, torproject.net, torproject.org : OK: 4
Full status information is, for example:
CRITICAL: torproject.com, torproject.net, torproject.org : OK: 4
torproject.com: Key 57040 about to expire.
torproject.net: Key 63619 about to expire.
torproject.org: Key 33670 about to expire.
This is Nagios warning you the DS records are about to expire. They will still be renewed so it's not immediately urgent to fix this, but eventually the DS records expiry and renewal procedure should be followed.
The old records that should be replaced are mentioned by Nagios in the extended status information, above.
DomainExpiring alerts
The DomainExpiring looks like:
Domain name tor.network is nearing expiry date
It means the domain (in this case tor.network) is going to expire
soon. It should be renewed at our registrar quickly.
DomainExpiryDataStale alerts
The DomainExpiryDataStale looks like:
RDAP information for domain tor.network is stale
The information about a configured list of domain names is normally fetched by a
daily systemd timer (tpa_domain_expiry) running on the Prometheus server. The
metric indicating the last RDAP refresh date gives us an indication of whether
or not the metrics that we currently hold in prometheus are based on a current
state or not. We don't want to generate alerts with data that's outdated.
If this alert fires, it means that either the job is not running, or the results returned by the RDAP database show issues with the RDAP database itself. We cannot do much about the latter case, but the former we can fix.
Check the status of the job on the Prometheus server with:
systemctl status tpa_domain_expiry
You can try refreshing it with:
systemctl start tpa_domain_expiry
journalctl -e -u tpa_domain_expiry
You can run the query locally with Fabric to check the results:
fab dns.domain-expiry -d tor.network
It should look something like:
anarcat@angela:~/s/t/fabric-tasks> fab dns.domain-expiry -d tor.network
tor.network:
expiration: 2025-05-27T01:09:38.603000+00:00
last changed: 2024-05-02T16:15:48.841000+00:00
last update of RDAP database: 2025-04-30T20:00:08.077000+00:00
registration: 2019-05-27T01:09:38.603000+00:00
transfer: 2020-05-23T17:10:52.960000+00:00
The last update of RDAP database field is the one used in this
alert, and should correspond to the UNIX timestamp in the metric. The
following Python code can convert from the above ISO to the timestamp,
for example:
>>> from datetime import datetime
>>> datetime.fromisoformat("2025-04-30T20:00:08.077000+00:00").timestamp()
1746043208.077
DomainTransferred alerts
The DomainTransferred looks something like:
Domain tor.network recently transferred!
This, like the other domain alerts above, is generated by a cron job that refreshes that data periodically for a list of domains.
If that alert fires, it means the given domain was transferred within the watch window (currently 7 days). Normally, when we transfer domains (which is really rare!), we should silence this alert preemptively to avoid this warning.
Otherwise, if you did mean to transfer this domain, you can silence this alert.
If the domain was really unexpectedly transferred, it's all hands on deck. You need to figure out how to transfer it back under your control, quickly, but even more quickly, you need to make sure the DNS servers recorded for the domain are still ours. If not, this is a real disaster recovery scenario, for which we do not currently have a playbook.
For inspiration, perhaps read the hijacking of perl.com. Knowing people in the registry business can help.
Disaster recovery
Complete DNS breakdown
If DNS completely and utterly fails (for example because of a DS expiry that was mishandled), you will first need to figure out if you can still reach the nameservers.
First diagnostics
Normally, this should give you the list of name servers for the main
.org domain:
dig -t NS torproject.org
If that fails, it means the domain might have expired. Login to the
registrar (currently joker.com) and handle this as a
DomainExpiring alert (above).
If that succeeds, the domain should be fine, but it's possible the DS records are revoked. Check those with:
dig -t DS torproject.org
You can also check popular public resolvers like Google and CloudFlare:
dig -t DS torproject.org @8.8.8.8
dig -t DS torproject.org @1.1.1.1
A DNSSEC error would look like this:
[...]
; EDE: 9 (DNSKEY Missing): (No DNSKEY matches DS RRs of torproject.org)
[...]
;; SERVER: 8.8.4.4#53(8.8.4.4) (UDP)
DNSviz can also help analyzing the situation here.
You can also try to enable or disable the DNS-over-HTTPS feature of Firefox to see if your local resolver is affected.
It's possible you don't see an issue but other users (which respect DNSSEC) do, so it's important to confirm the above.
Accessing DNS servers without DNS
In any case, the next step is to recover access to the nameservers. For this, you might need to login to the machines over SSH, and that will prove difficult without DNS. There's few options to recover from that:
-
existing SSH sessions. if you already have a shell on another
torproject.orgserver (e.g.people.torproject.org) it might be able to resolve other hosts, try to resolvenevii.torproject.orgthere first) -
SSH
known_hosts. you should have a copy of theknown_hosts.d/torproject.orgdatabase, which has an IP associated with each key. This will do a reverse lookup of all the records associated with a given name:grep $(grep nevii ~/.ssh/known_hosts.d/torproject.org | cut -d' ' -f 3 | tail -1) ~/.ssh/known_hosts.d/torproject.orgHere are, for example, all the ED25519 records for
neviiwhich shows the IP address:anarcat@angela:~> grep $(grep nevii ~/.ssh/known_hosts.d/torproject.org | cut -d' ' -f 3 | tail -1) ~/.ssh/known_hosts.d/torproject.org nevii.torproject.org ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIAFMxtFP4h4s+xX5Or5XGjBgCNW+a6t9+ElflLG7eMLL 2a01:4f8:fff0:4f:266:37ff:fee9:5df8 ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIAFMxtFP4h4s+xX5Or5XGjBgCNW+a6t9+ElflLG7eMLL 2a01:4f8:fff0:4f:266:37ff:fee9:5df8 ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIAFMxtFP4h4s+xX5Or5XGjBgCNW+a6t9+ElflLG7eMLL 49.12.57.130 ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIAFMxtFP4h4s+xX5Or5XGjBgCNW+a6t9+ElflLG7eMLL49.12.57.130isnevii's IPv4 address in this case. -
LDAP. if, somehow, you have a dump of the LDAP database, IP addresses are recorded there.
-
Hetzner. Some machines are currently hosted at Hetzner, which should still be reachable in case of a DNS-specific outage. The control panel can be used to get a console access to the physical host the virtual machine is hosted on (e.g.
fsn-node-01.torproject.org) and, from there, the VM.
Reference
Installation
Secondary name server
To install a secondary nameserver, you first need to create a new machine, of course. Requirements for this service:
- trusted location, since DNS is typically clear text traffic
- DDoS resistant, since those have happened in the past
- stable location because secondary name servers are registered as "glue records" in our zones and those take time to change
- 2 cores, 2GB of ram and a few GBs of disk should be plenty for now
In the following example, we setup a new secondary nameserver in the gnt-dal Ganeti cluster:
-
create the virtual machine:
gnt-instance add -o debootstrap+bullseye -t drbd --no-wait-for-sync --net 0:ip=pool,network=gnt-dal-01 --no-ip-check --no-name-check --disk 0:size=10G --disk 1:size=2G,name=swap --backend-parameters memory=2g,vcpus=2 ns3.torproject.org -
the rest of the new machine procedure
-
add the
bind::secondaryclass to the instance in Puppet, also add it tomodules/bind/templates/named.conf.options.erbandmodules/bind/templates/named.conf.puppet-shared-keys.erb -
generate a tsig secret on the primary server (currently
nevii):tsig-keygen -
add that secret in Trocla with this command on the Puppet server (currently
pauli):trocla set tsig-nevii.torproject.org-ns3.torproject.org plain -
add the server to the
/srv/dns.torproject.org/etc/dns-helpers.yamlconfiguration file (!) -
regenerate the zone files:
sudo -u dnsadm /srv/dns.torproject.org/bin/update -
run puppet on the new server, then on the primary
-
test the new nameserver:
At this point, you should be able to resolve names from the secondary server, for example this should work:
dig torproject.org @ns3.torproject.orgTest some reverse DNS as well, for example:
dig -x 204.8.99.101 @ns3.torproject.orgThe logs on the primary server should not have too many warnings:
journalctl -u named -f -
once the server is behaving correctly, add it to the glue records:
- login to
joker.com - go to "Nameserver"
- "Create a new nameserver" (or, if it already exists, "Change" it)
- login to
Nagios should pick up the changes and the new nameserver
automatically. The affected check is DNS SOA sync - torproject.org
and similar, or the dsa_check_soas_add check command.
Upgrades
SLA
Design and architecture
TODO: This needs to be documented better. weasel made a blog post describing parts of the infrastructure on Debian.org, and that is partly relevant to TPO as well.
Most DNS records are managed in LDAP, see the DNS zone file management documentation about that.
Puppet DNS hooks
Puppet can inject DNS records in the torproject.org zonefile with
dnsextras::entry (of which dnsextras::tlsa_record is a
wrapper). For example, this line:
$vhost = 'gitlab.torproject.org'
$algo = 'ed25519'
$hash = 'sha256'
$record = 'SSHFP 4 2 4e6dedc77590b5354fce011e82c877e03bbd4da3d16bb1cdcf56819a831d28bd'
dnsextras::entry { "sshfp-alias-${vhost}-${algo}-${hash}":
zone => 'torproject.org',
content => "${vhost}. IN ${record}",
}
... will create an entry like this (through a Concat resource) on
the DNS server, in
/srv/dns.torproject.org/puppet-extra/include-torproject.org:
; gitlab-02.torproject.org sshfp-alias-gitlab.torproject.org-ed25519-sha256
gitlab.torproject.org. IN SSHFP 4 2 4e6dedc77590b5354fce011e82c877e03bbd4da3d16bb1cdcf56819a831d28bd
Even though the torproject.org zone file in domains.git has an
$INCLUDE directive for that file, you do not see that in the
generated file on disk on the DNS server.
Instead, it is compiled in the final zonefile, through a hook ran
from Puppet (Exec[rebuild torproject.org zone]) which runs:
/bin/su - dnsadm -c "/srv/dns.torproject.org/bin/update"
That, among many other things, calls
/srv/dns.torproject.org/repositories/dns-helpers/write_zonefile
which, through dns-helpers/DSA/DNSHelpers.pm, calls the lovely
compile_zonefile() function which essentially does:
named-compilezone -q -k fail -n fail -S fail -i none -m fail -M fail -o $out torproject.org $in
... with temporary files. That eventually renames a temporary file
to /srv/dns.torproject.org/var/generated/torproject.org.
This means the records you write from Puppet will not be exactly the same in the generated file, because they are compiled by named-compilezone(8). For example, a record like:
_25._tcp.gitlab-02.torproject.org. IN TYPE52 \# 35 03010129255408eafcfd811854c89404b68467298d3000781dc2be0232fa153ff3b16b
is rewritten as:
_25._tcp.gitlab-02.torproject.org. 3600 IN TLSA 3 1 1 9255408EAFCFD811854C89404B68467298D3000781DC2BE0232FA15 3FF3B16B
Note that this is a different source of truth that the primary source of truth for DNS records, which is LDAP. See the DNS zone file management section about this in particular.
mini-nag operation
mini-nag is a small Python script that performs monitoring of the mirror system to take mirrors out of rotation when they become unavailable or are scheduled for reboot. This section tries to analyze its mode of operation with the Nagios/NRPE retirement in mind (tpo/tpa/team#41734).
The script is manually deployed on the primary DNS server (currently
nevii). There's a mostly empty class called profile:mini_nag in
Puppet, but otherwise the script is manually configured.
The main entry point for regular operation is in the dnsadm user
crontab (/var/spool/cron/crontabs/dnsadm), which calls mini-nag (in
/srv/dns.torproject.org/repositories/mini-nag/mini-nag) every 2
minutes.
It is called first with the check argument, then with update-bad,
checking the timestamp of the status directory
(/srv/dns.torproject.org/var/mini-nag/status), and if there's a
change, it triggers the zone rebuild script
(/srv/dns.torproject.org/bin/update).
The check command does this (function check()):
- load the auto-dns YAML configuration file
/srv/dns.torproject.org/repositories/auto-dns/hosts.yaml - connect to the database
/srv/dns.torproject.org/var/mini-nag/status.db - in separate threads, run checks in "soft" mode, if configured in
the
checksfield ofhosts.yaml:ping-check: local commandcheck_ping -H @@HOST@@ -w 800,40% -c 1500,60% -p 10http-check: local commandcheck_http -H @@HOST@@ -t 30 -w 15
- in separate threads, run checks in "hard" mode, if configured in
the
checksfield ofhosts.yaml:shutdown-check: remote NRPE commandcheck_nrpe -H @@HOST@@ -n -c dsa2_shutdown | grep system-in-shutdowndebianhealth-check: local commandcheck_http -I @@HOST@@ -u http://debian.backend.mirrors.debian.org/_health -t 30 -w 15debughealth-check: local commandcheck_http -I @@HOST@@ -u http://debug.backend.mirrors.debian.org/_health -t 30 -w 15
- wait for threads to complete waiting for a 35 seconds timeout
(function
join_checks()) - insert results in an SQLite database, a row like (function
insert_results()):host: hostname (string)test: check name (string)ts: unix timestamp (integer)soft: if the check failed (boolean)hard: if the check was "hard" and it failedmsg: output of the command, orcheck timeoutif timeout was hit
- does some dependency checks between hosts (function
dependency_checks()), a noop since we don't have anydependsfield inhosts.yaml - commit changes to the database and exit
Currently, only the ping-check, shutdown-check, and http-check
checks are enabled in hosts.yaml.
Essentially, the check command runs some probes and writes the
results in the SQLite database, logging command output, timestamp and
status.
The update_bad command does this (function update_bad()):
-
find bad hosts from the database (function
get_bad()), which does this:- cleanup old hosts older than an expiry time (900 seconds,
function
cleanup_bad_in_db()) - run this SQL query (function
get_bad_from_db()):
SELECT total, soft*1.0/total as soft, hard, host, test FROM (SELECT count(*) AS total, sum(soft) AS soft, sum(hard) AS hard, host, test FROM host_status GROUP BY host, test) WHERE soft*1.0/total > 0.40 OR hard > 0- return a dictionary of host => checks list that have failed, where failed is defined "test is 'hard'" or "if soft, then more than 40% of the checks failed"
- cleanup old hosts older than an expiry time (900 seconds,
function
-
cleanup files in the status directory that are not in the
bad_hostslist -
for each bad host above, if the host is not already in the status directory:
-
create an empty file with the hostname in the status directory
-
send an email to the secret
tor-misccommit alias to send notifications over IRC
-
In essence, the update_bad command will look in the database to see
if there are more hosts that have bad check results and will sync the
status directory to reflect that status.
From there, the update command will run the
/srv/dns.torproject.org/repositories/auto-dns/build-services command
from the auto-dns repository which checks the status directory
for the flag file, and skips including that host if the flag is present.
DNSSEC
DNSSEC records are managed automatically by
manage-dnssec-keys in the dns-helpers git repository, through
a cron job in the dnsadm user on the master DNS server (currently
nevii).
There used to be a Nagios hook in
/srv/dns.torproject.org/bin/dsa-check-and-extend-DS that basically
wraps manage-dnssec-keys with some Nagios status codes, but it is
believed this hook is not fired anymore, and only the above cron job
remains.
This is legacy that we aim at converting to BIND's new automation, see tpo/tpa/team#42268.
Services
Storage
mini-nag stores check results in a SQLite database, in
/srv/dns.torproject.org/var/mini-nag/status.db and uses the status
directory (/srv/dns.torproject.org/var/mini-nag/status/) as a
messaging system to auto-dns. Presence of a file there implies the
host is down.
Queues
Interfaces
Authentication
Implementation
Related services
Issues
There is no issue tracker specifically for this project, File or search for issues in the team issue tracker with the label ~DNS.
Maintainer
Users
Upstream
Monitoring and metrics
Tests
Logs
Backups
Other documentation
Discussion
Overview
Security and risk assessment
Technical debt and next steps
Proposed Solution
Other alternatives
Debian registrar scripts
Debian has a set of scripts to automate talking to some providers like Netnod. A YAML file has metadata about the configuration, and pushing changes is as simple as:
publish tor-dnsnode.yaml
That config file would look something like:
---
endpoint: https://dnsnodeapi.netnod.se/apiv3/
base_zone:
endcustomer: "TorProject"
masters:
# nevii.torproject.org
- ip: "49.12.57.130"
tsig: "netnod-torproject-20180831."
- ip: "2a01:4f8:fff0:4f:266:37ff:fee9:5df8"
tsig: "netnod-torproject-20180831."
product: "probono-premium-anycast"
This is not currently in use at TPO and changes are operated manually through the web interface.
zonetool
https://git.autistici.org/ai3/tools/zonetool is a YAML based zone generator with DNSSEC support.
Other resolvers and servers
We currently use bind and unbound as DNS servers and resolvers, respectively. bind, in particular, is a really old codebase and has been known to have security and scalability issues. We've also had experiences with unbound being unreliable, see for example crashes when running out of disk space, but also when used on roaming clients (e.g. anarcat's laptop).
Here are known alternatives:
- hickory-dns: full stack (resolver, server, client), 0.25 (not 1.0) as of 2025-03-27, but used in production at Let's Encrypt, Rust rewrite, packaged in Debian 13 (trixie) and later
- knot: resolver, 3.4.5 as of 2025-03-27, used in production at
Riseup and
nic.cz, C, packaged in Debian - dnsmasq: DHCP server and DNS resolver, more targeted at embedded devices, C
- PowerDNS, authoritative server, resolver, database-backed, used by Tails, C++
Previous monitoring implementation
This section details how monitoring of DNS services was implemented in Nagios.
First, simple DNS (as opposed to DNSSEC) wasn't directly monitored per se. It was assumed, we presume, that normal probes would trigger alerts if DNS resolution would fail. We did have monitoring of a weird bug in unbound, but this was fixed in Debian trixie and the check wasn't ported to Prometheus.
Most of the monitoring was geared towards the more complex DNSSEC setup.
It consisted of the following checks, as per TPA-RFC-33:
| name | command | note |
|---|---|---|
| DNS SOA sync - * | dsa_check_soas_add |
checks that zones are in sync on secondaries |
| DNS - delegation and signature expiry | dsa-check-zone-rrsig-expiration-many |
|
| DNS - zones signed properly | dsa-check-zone-signature-all |
|
| DNS - security delegations | dsa-check-dnssec-delegation |
|
| DNS - key coverage | dsa-check-statusfile |
dsa-check-statusfile /srv/dns.torproject.org/var/nagios/coverage on nevii, could be converted as is |
| DNS - DS expiry | dsa-check-statusfile |
dsa-check-statusfile /srv/dns.torproject.org/var/nagios/ds on nevii |
That said, this is not much information. Let's dig into each of those checks to see precisely what it does and what we need to replicate in the new monitoring setup.
SOA sync
This was configured in the YAML file as:
-
name: DNS SOA sync - torproject.org
check: "dsa_check_soas_add!nevii.torproject.org!torproject.org"
hosts: global
-
name: DNS SOA sync - torproject.net
check: "dsa_check_soas_add!nevii.torproject.org!torproject.net"
hosts: global
-
name: DNS SOA sync - torproject.com
check: "dsa_check_soas_add!nevii.torproject.org!torproject.com"
hosts: global
-
name: DNS SOA sync - 99.8.204.in-addr.arpa
check: "dsa_check_soas_add!nevii.torproject.org!99.8.204.in-addr.arpa"
hosts: global
-
name: DNS SOA sync - 0.0.0.0.2.0.0.6.7.0.0.0.0.2.6.2.ip6.arpa
check: "dsa_check_soas_add!nevii.torproject.org!0.0.0.0.2.0.0.6.7.0.0.0.0.2.6.2.ip6.arpa"
hosts: global
-
name: DNS SOA sync - onion-router.net
check: "dsa_check_soas_add!nevii.torproject.org!onion-router.net"
hosts: global
And that command defined as:
define command{
command_name dsa_check_soas_add
command_line /usr/lib/nagios/plugins/dsa-check-soas -a "$ARG1$" "$ARG2$"
}
That was a Ruby script written in 2006 by weasel, which did the following:
-
parse the commandline,
-a(--add) is an additional nameserver to check (nevii, in all cases),-n(--no-soa-ns) says to not query the "SOArecord" (sic) for a list of nameservers(the script actually checks the
NSrecords for a list of nameservers, not theSOA) -
fail if no
-nis specified without-a -
for each domain on the commandline (in practice, we always process one domain at a time, so this is irrelevant)...
-
fetch the NS record for the domain from the default resolver, add that to the
--addserver for the list of servers to check (names are resolved to IP addresses, possibly multiple) -
for all nameservers, query the
SOArecord found for the checked domain on the given nameserver, raise a warning if resolution fails or we have more or less than oneSOArecord -
record the serial number in a de-duplicated list
-
raise a warning if no serial number was found
-
raise a warning if different serial numbers are found
The output looks like:
> ./dsa-check-soas torproject.org
torproject.org is at 2025092316
A failure looks like:
Nameserver ns5.torproject.org for torproject.org returns 0 SOAs
This script should be relatively easy to port to Prometheus, but we need to figure out what metrics might look like.
delegation and signature expiry
The dsa-check-zone-rrsig-expiration-many command was configured as a
NRPE check in the YAML file as:
-
name: DNS - delegation and signature expiry
hosts: global
remotecheck: "/usr/lib/nagios/plugins/dsa-check-zone-rrsig-expiration-many --warn 20d --critical 7d /srv/dns.torproject.org/repositories/domains"
runfrom: nevii
That is a Perl script written in 2010 by weasel. Interestingly, the
default warning time in the script is 14d, not 20d. There's a check
timeout set to 45 which we presume to be seconds.
The script uses threads and is a challenge to analyze.
-
it parses all files in the given directory (
/srv/dns.torproject.org/repositories/domains), which currently contains the files:0.0.0.0.2.0.0.6.7.0.0.0.0.2.6.2.ip6.arpa 30.172.in-addr.arpa 99.8.204.in-addr.arpa onion-router.net torproject.com torproject.net torproject.org -
For each zone, it checks if the file has a comment that matches
; wzf: dnssec = 0(with tolerance for whitespace), in which case the zone is considered "unsigned". -
For "signed" zones, the
check-initial-refscommand is recorded in a hash keyed -
it does things for "geo" things that we will ignore here
-
it creates a thread for each signed zone which will (in
check_one) run thedsa-check-zone-rrsig-expirationcheck with theinitial-refssaved above -
it collects and prints the result, grouping the zones by status (OK, WARN, CRITICAL, depending on the thresholds)
Note that only one zone has the initial-refs set:
30.172.in-addr.arpa:; check-initial-refs = ns1.torproject.org,ns3.torproject.org,ns4.torproject.org,ns5.torproject.org
No zone has the wzf flag to mark a zone as unsigned.
So this is just a thread executor for each zone, in other words, which
delegates to dsa-check-zone-rrsig-expiration, so let's look at how
that works.
That other script is also a Perl script, "downloaded from http://dns.measurement-factory.com/tools/nagios-plugins/check_zone_rrsig_expiration.html on 2010-02-07 by Peter Palfrader, that script being itself from 2008. It is, presumably, a "nagios plugin to check expiration times of RRSIG records. Reminds you if its time to re-sign your zone."
Concretely, it recurses from the root zones to find the NS records
for the zone, warns about lame nameservers and expired RRSIG records
from any nameserver.
Its overall execution is:
do_recursiondo_queriesdo_analyze
do_recursion is fetches the authoritative NS records from the root
servers, this way:
- iterate randomly over the root servers (
[abcdefghijklm].root-servers.net) - ask for the
NSrecord for the zone on each, stopping when any response is received, exiting with a CRITICAL status if no server is responding, or a server responds with an error - reset the list of servers to the
NSrecords return, go to 2, unless we hit the zone record, in which case we record theNSrecords
At this point we have a list of NS servers for the zone to query,
which we do with do_queries:
- for each
NSrecord - query and record the
SOApacket on that nameserver, with DNSSEC enabled (equivalent todig -t SOA example.com +dnssec)
.. and then, of course, we do_analyze, which is where you have the
core business logic of the check:
- for each
SOArecord fetched from the nameserver found indo_queries - warn about
lamenameservers: not sure how that's implemented,$pkt->header->ancount? (technically, a lame nameserver is when a nameserver recorded in the parent's zoneNSrecords doesn't answer aSOArequest) - count the number of nameservers found, warn if none found
- warn about if no
RRSIGis found - for each
RRSIGrecords found in that packet - check the
sigexpirationfield, parse it as a UTC (ISO?) timestamp - warn/crit if the
RRSIGrecord expires in the past or soon
A single run takes about 12 seconds here, it's pretty slow. It looks like this on success:
> ./dsa-check-zone-rrsig-expiration torproject.org
ZONE OK: No RRSIGs at zone apex expiring in the next 7.0 days; (6.36s) |time=6.363434s;;;0.000000
In practice, I do not remember ever seeing a failure with this.
zones signed properly
This check was defined in the YAML file as:
-
name: DNS - zones signed properly
hosts: global
remotecheck: "/usr/lib/nagios/plugins/dsa-check-zone-signature-all"
runfrom: nevii
The dsa-check-zone-signature-all script essentially performs a
dnssec-verify over the each zone file transferred with a AXFR:
if dig $EXTRA -t axfr @"$MASTER" "$zone" | dnssec-verify -o "$zone" /dev/stdin > "$tmp" 2>&1; then
... and it counts the number of failures.
This reminds me of tpo/tpa/domains#1, where we want to check
SPF records for validity, which the above likely does not do.
security delegations
This check is configured with:
-
name: DNS - security delegations
hosts: global
remotecheck: "/usr/lib/nagios/plugins/dsa-check-dnssec-delegation --dir /srv/dns.torproject.org/repositories/domains check-header"
runfrom: nevii
The dsa-check-dnssec-delegation script was written in 2010 by weasel
and can perform multiple checks, but in practice here it's configured
in check-header mode, which we'll restrict ourselves to here. That
mode is equivalent to check-dlv and check-ds which might mean
"check everything", then.
The script then:
- iterates over all zones
- check for
; ds-in-parent=yesanddlv-submit=yesin the zone, which can be used to disable checks on some zones - fetch the
DNSKEYrecords for the zone - fetch the
DSrecords for the zone, intersect with theDNSKEYrecord, warn for an empty intersect or superfluousDSrecords - also checks DLV records as the ISC, but those have been retired
key coverage
This check is defined in:
-
name: DNS - key coverage
hosts: global
remotecheck: "/usr/lib/nagios/plugins/dsa-check-statusfile /srv/dns.torproject.org/var/nagios/coverage"
runfrom: nevii
So it just outsources to a status file that's piped into that generic
wrapper. This check is therefore actually implemented in
dns-helpers/bin/dsa-check-dnssec-coverage-all-nagios-wrap. This, of
course, is a wrapper for dsa-check-dnssec-coverage-all which
iterates through the auto-dns and domains zones and runs
dnssec-coverage like this for auto-dns zones:
dnssec-coverage \
-c named-compilezone \
-K "$BASE"/var/keys/"$zone" \
-r 10 \
-f "$BASE"/var/geodns-zones/db."$zone" \
-z \
-l "$CUTOFF" \
"$zone"
and like this for domains zones:
dnssec-coverage \
-c named-compilezone \
-K "$BASE"/var/keys/"$zone" \
-f "$BASE"/var/generated/"$zone" \
-l "$CUTOFF" \
"$zone"
Now that script (dnssec-coverage) was apparently written in 2013
by the ISC. Like manage-dnssec-keys (below), it has its own Key
representation of a DNSSEC "key". It checks for:
PHASE 1--Loading keys to check for internal timing problems
PHASE 2--Scanning future key events for coverage failures
Concretely, it:
- "ensure that the gap between Publish and Activate is big enough" and in the right order (Publish before Activate)
- "ensure that the gap between Inactive and Delete is big enough" and in the right order, and for missing Inactive
- some hairy code checks the sequence of events and raises errors like
ERROR: No KSK's are active after this event, it seems to check in the future to see if there are missing active or published keys, and for keys that areboth active and published
DS expiry
-
name: DNS - DS expiry
hosts: global
remotecheck: "/usr/lib/nagios/plugins/dsa-check-statusfile /srv/dns.torproject.org/var/nagios/ds"
runfrom: nevii
Same, but with dns-helpers/bin/dsa-check-and-extend-DS. As mentioned
above, that script is essentially just a wrapper for:
dns-helpers/manage-dnssec-keys --mode ds-check $zones
... with the output as extra information for the Nagios state file.
It is disabled with the ds-disable-checks = yes (note the
whitespace: it matters) either in auto-dns/zones/$ZONE or
domains/$ZONE.
The manage-dnssec-keys script, in ds-check mode does the following
(mostly in the KeySet constructor and KeySet.check_ds)
- load the keys from the
keydir(defined in/etc/dns-helpers.yaml) - loads the timestamps, presumably from the
dssetfile - check the
DSrecord for the zone - check if the
DSkeys (keytag, algo, digest) match an on-disk key - checks for expiry, bumping expiry for some entries, against the loaded timestamps
It's unclear if we need to keep implementing this at all if we stop
expiring DS entries. But it might be good for check for consistency
and, while we're at it, might as well check for expiry.
Summary
So the legacy monitoring infrastructure was checking the following:
- SOA sync, for all zones
- check the local resolver for
NSrecords, all IP addresses - check all
NSrecords respond - check that they all serve the same
SOAserial number - RRSIG check, for all zones:
- check the root name servers for
NSrecords - check the
SOArecords in DNSSEC mode (which attaches aRRSIGrecord) on that each name server - check for lame nameservers
- check for
RRSIGexpiration or missing record - whatever it is that
dnssec-verifyis doing, unchecked DS/DNSKEYmatch check, for all zones- pull all
DSrecords from local resolver - compare with local
DNSKEYrecords, warn about missing or superfluous keys dssetexpiration checks:- check that event ordering is correct
- checks the
DSrecords in DNS match the ones on disk (again?) - checks the
dssetrecords for expiration
Implementation ideas
The python3-dns library is already in use in some of the legacy
code.
The prometheus-dnssec-exporter handles the following:
RRSIGexpiry (days left and "earliest expiry")- DNSSEC resolution is functional
Similarly, the dns exporter only checks if records resolves and latency.
We are therefore missing quite a bit here, most importantly:
- SOA sync
- lame nameservers
- missing
RRSIGrecords (although the dnssec exporters somewhat implicitly checks that by not publishing a metric, that's an easy thing to misconfigure) DS/DNSKEYrecords match- local
DSrecord expiration
Considering that the dnssec exporter implements so little, it seems we would need to essentially start from scratch and write an entire monitoring stack for this.
Multiple Python DNS libraries exist in Debian already:
- python3-aiodns (installed locally on my workstation)
- python3-dns (ditto)
- python3-dnspython (ditto, already used on
nevii) - python3-getdns