DNS is the Domain Name Service. It is what turns a name like www.torproject.org in an IP address that can be routed over the Internet. TPA maintains its own DNS servers and this document attempts to describe how those work.

TODO: mention unbound and a rough overview of the setup here

[[TOC]]

Tutorial

How to

Most operations on DNS happens in the domains repository (dnsadm@nevii.torproject.org:/srv/dns.torproject.org/repositories/domains). Those zones contains the master copy of the zone files, stored as (mostly) standard Bind zonefiles (RFC 1034), but notably without a SOA.

Tor's DNS support is fully authenticated with DNSSEC, both to the outside world but also internally, where all TPO hosts use DNSSEC in their resolvers.

Editing a zone

Zone records can be added or modified to a zone in the domains git and a push.

Serial numbers are managed automatically by the git repository hooks.

Adding a zone

To add a new zone to our infrastructure, the following procedure must be followed:

  1. add zone in domains repository (dnsadm@nevii.torproject.org:/srv/dns.torproject.org/repositories/domains)
  2. add zone in the modules/bind/templates/named.conf.torproject-zones.erb Puppet template for DNS secondaries to pick up the zone
  3. also add IP address ranges (if it's a reverse DNS zone file) to modules/torproject_org/misc/hoster.yaml in the tor-puppet.git repository
  4. run puppet on DNS servers: cumin 'C:roles::dns_primary or C:bind::secondary' 'puppet agent -t'
  5. add zone to modules/postfix/files/virtual, unless it is a reverse zonefile
  6. add zone to nagios: copy an existing DNS SOA sync block and adapt
  7. add zone to external DNS secondaries (currently Netnod)
  8. make sure the zone is delegated by the root servers somehow. for normal zones, this involves adding our nameservers in the registrar's configuration. for reverse DNS, this involves asking our upstreams to delegate the zone to our DNS servers.

Note that this is a somewhat rarer procedure: this happens only when a completely new domain name (e.g. torproject.net) or IP address space (so reverse DNS, e.g. 38.229.82.0/24 AKA 82.229.38.in-addr.arpa) is added to our infrastructure.

Removing a zone

  • git grep the domain in the tor-nagios git repository
  • remove the zone in the domains repository (dnsadm@nevii.torproject.org:/srv/dns.torproject.org/repositories/domains)
  • on nevii, remove the generated zonefiles and keys:

    cd /srv/dns.torproject.org/var/
    mv generated/torproject.fr* OLD-generated/
    mv keys/torproject.fr OLD-KEYS/
    
  • remove the zone from the secondaries (Netnod and our own servers). this means visiting the Netnod web interface for that side, and Puppet (modules/bind/templates/named.conf.torproject-zones.erb) for our own

  • the domains will probably be listed in other locations, grep Puppet for Apache virtual hosts and email aliases
  • the domains will also probably exist in the letsencrypt-domains repository

DNSSEC key rollover

We no longer rotate DNSSEC keys (KSK, technically) automatically, but there may still be instances where a manual rollover is required. This involves new DNSKEY / DS records and requires manual operation on the registrar (currently https://joker.com).

There are two different scenario's for a manual rollover: (1) where the current keys are no longer trusted and need to be disabled as soon as possible and (2) where the current ZSK can fade out along its automated 120 day cycle. An example of scenario 1 could be a compromise of private key material. An example of scenario 2 could be preemptive upgrading to a stronger cipher without indication of compromise.

Scenario 1

First, we create a new ZSK:

cd /srv/dns.torproject.org/var/keys/torproject.org
dnssec-keygen -I +120d -D +150d -a RSASHA256 -n ZONE torproject.org.

Then, we create a new KSK:

cd /srv/dns.torproject.org/var/keys/torproject.org
dnssec-keygen -f KSK -a RSASHA256 -n ZONE torproject.org.

And restart bind.

Run dnssec-dsfromkey on the newly generated KSK to get the corresponding new DS record.

Save this DS record to a file and propagate it to all our nodes so that unbound has a new trust anchor:

  • transfer (e.g. scp) the file to every node's /var/lib/unbound/torproject.org.key (and no, Puppet doesn't do that because it has replaces => false on that file)
  • immediately restart unbound (be quick, because unbound can overwrite this file on its own)
  • after the restart, check to ensure that /var/lib/unbound/torproject.org.key has the new DS

Puppet ships trust anchors for some of our zones to our unbounds, so make sure you update the corresponding file ( legacy/unbound/files/torproject.org.key ) in the puppet-control.git repository. You can replace it with only the new DS, removing the old one.

On nevii, add the new DS record to /srv/dns.torproject.org/var/keys/torproject.org/dsset, while keeping the old DS record there.

Finally, configure it at our registrar.

To do so on Joker, you need to visit joker.com and authenticate with the password in dns/joker in tor-passwords.git, along with the 2FA dance. Then:

  1. click on the "modify" button next to the domain affected (was first a gear but is now a pen-like icon thing)
  2. find the DNSSEC section
  3. click the "modify" button to edit records
  4. click "more" to add a record

Note that there are two keys there: one (the oldest) should already be in Joker. you need to add the new one.

With the above, you would have the following in Joker:

  • alg: 8 ("RSA/SHA-256", IANA, RFC5702)
  • digest: ebdf81e6b773f243cdee2879f0d12138115d9b14d560276fcd88e9844777d7e3
  • type: 2 ("SHA-256", IANA, RFC4509)
  • keytag: 57040

And click "save".

After a little while, you should be able to check if the new DS record works on DNSviz.net, for example, the DNSviz.net view of torproject.net should be sane.

After saving the new record, wait one hour for the TTL to expire and delete the old DS record. Also remove the old DS record in /srv/dns.torproject.org/var/keys/torproject.org/dsset.

Wait another hour before removing the old KSK and ZSK's. To do so:

  • stop bind
  • remove the keypair files in /srv/dns.torproject.org/var/keys/torproject.org/
  • rm /srv/dns.torproject.org/var/generated/torproject.org.signed*
  • rm /srv/dns.torproject.org/var/generated/torproject.org.j*
  • start bind

That should be your rollover finished.

Scenario 2

In this scenario, we keep our ZSK's and only create a new KSK:

cd /srv/dns.torproject.org/var/keys/torproject.org
dnssec-keygen -f KSK -a RSASHA256 -n ZONE torproject.org.

And restart bind.

Run dnssec-dsfromkey on the newly generated KSK to get the corresponding new DS record.

Puppet ships trust anchors for some of our zones to our unbounds, so make sure you update the corresponding file ( legacy/unbound/files/torproject.org.key ) in the puppet control repository. You can replace it with only the new DS.

On nevii, add the new DS record to /srv/dns.torproject.org/var/keys/torproject.org/dsset, while keeping the old DS record there.

Finally, configure it at our registrar.

To do so on Joker, you need to visit joker.com and authenticate with the password in dns/joker in tor-passwords.git, along with the 2FA dance. Then:

  1. click on the "modify" button next to the domain affected (was first a gear but is now a pen-like icon thing)
  2. find the DNSSEC section
  3. click the "modify" button to edit records
  4. click "more" to add a record

Note that there are two keys there: one (the oldest) should already be in Joker. you need to add the new one.

With the above, you would have the following in Joker:

  • alg: 8 ("RSA/SHA-256", IANA, RFC5702)
  • digest: ebdf81e6b773f243cdee2879f0d12138115d9b14d560276fcd88e9844777d7e3
  • type: 2 ("SHA-256", IANA, RFC4509)
  • keytag: 57040

And click "save".

After a little while, you should be able to check if the new DS record works on DNSviz.net, for example, the DNSviz.net view of torproject.net should be sane.

After saving the new record, wait one hour for the TTL to expire and delete the old DS record. Also remove the old DS record in /srv/dns.torproject.org/var/keys/torproject.org/dsset.

Do not remove any keys yet, unbound needs 30 days (!) to complete slow, RFC5011-style rolling of KSKs.

After 30 days, remove the old KSK:

Wait another hour before removing the old KSK and ZSK's. To do so:

  • stop bind
  • remove the old KSK keypair files in /srv/dns.torproject.org/var/keys/torproject.org/
  • rm /srv/dns.torproject.org/var/generated/torproject.org.signed*
  • rm /srv/dns.torproject.org/var/generated/torproject.org.j*
  • start bind

That should be your rollover finished.

Special case: RFC1918 zones

The above is for public zones, for which we have Nagios checks that warn us about impeding doom. But we also sign zones about reverse IP looks, specifically 30.172.in-addr.arpa. Normally, recursive nameservers pick new signatures in that zone automatically, thanks to rfc 5011.

But if a new host gets provisionned, it needs to get bootstrapped somehow. This is done by Puppet, but those records are maintained by hand and will get out of date. This implies that after a while, you will start seeing messages like this for hosts that were installed after the expiration date:

16:52:39 <nsa> tor-nagios: [submit-01] unbound trust anchors is WARNING: Warning: no valid trust anchors found for 30.172.in-addr.arpa.

The solution is to go on the primary nameserver (currently nevii) and pick the non-revoked DSSET line from this file:

/srv/dns.torproject.org/var/keys/30.172.in-addr.arpa/dsset

... and inject it in Puppet, in:

tor-puppet/modules/unbound/files/30.172.in-addr.arpa.key

Then new hosts will get the right key and bootstrap properly. Old hosts can get the new key by removing the file by hand on the server and re-running Puppet:

rm /var/lib/unbound/30.172.in-addr.arpa.key ; puppet agent -t

Transferring a domain

Joker

To transfer a domain from another registrar to joker.com, you will need the domain name you want to transfer, and an associated "secret" that you get when you unlock the domain from another registrar, referred below as "secret".

Then follow these steps:

  1. login to joker.com

  2. in the main view, pick the "Transfer" button

  3. enter the domain name to be transferred, hit the "Transfer domain" button

  4. enter the secret in the "Auth-ID" field, then hit the "Proceed" button, ignoring the privacy settings

  5. pick the hostmaster@torproject.org contact as the "Owner", then for "Billing", uncheck the "Same as" button and pick accounting@torproject.org, then hit the "Proceed" button

  6. In the "Domain attributes", keep joker.com then check "Enable DNSSEC", and "take over existing nameserver records (zone)", leave "Automatic renewal" checked and "Whois opt-in" unchecked, then hit the "Proceed" button

  7. In the "Check Domain Information", review the data then hit "Proceed"

  8. In "Payment options", pick "Account", then hit "Proceed"

Pager playbook

In general, to debug DNS issues, those tools are useful:

unbound trust anchors: Some keys are old

This warning can happen when a host was installed with old keys and unbound wasn't able to rotate them:

20:05:39 <nsa> tor-nagios: [chi-node-05] unbound trust anchors is WARNING: Warning: Some keys are old: /var/lib/unbound/torproject.org.key.

The fix is to remove the affected file and rerun Puppet:

rm /var/lib/unbound/torproject.org.key
puppet agent --test

unbound trust anchors: Warning: no valid trust anchors

So this can happen too:

11:27:49 <nsa> tor-nagios: [chi-node-12] unbound trust anchors is WARNING: Warning: no valid trust anchors found for 30.172.in-addr.arpa.

If this happens on many hosts, you will need to update the key, see the Special case: RFC1918 zones section, above. But if it's a single host, it's possible it was installed during the window where the key was expired, and hasn't been properly updated by Puppet yet.

Try this:

rm /var/lib/unbound/30.172.in-addr.arpa.key ; puppet agent -t

Then the warning should have gone away:

# /usr/lib/nagios/plugins/dsa-check-unbound-anchors
OK: All keys in /var/lib/unbound recent and valid

If not, see the Special case: RFC1918 zones section above.

DNS - zones signed properly is CRITICAL

When adding a new reverse DNS zone, it's possible you get this warning from Nagios:

13:31:35 <nsa> tor-nagios: [global] DNS - zones signed properly is CRITICAL: CRITICAL: 82.229.38.in-addr.arpa
16:30:36 <nsa> tor-nagios: [global] DNS - key coverage is CRITICAL: CRITICAL: 82.229.38.in-addr.arpa

That might be because Nagios thinks this zone should be signed (while it isn't and cannot). The fix is to add this line to the zonefile:

; ds-in-parent = no

And push the change. Nagios should notice and stop caring about the zone.

In general, this Nagios check provides a good idea of the DNSSEC chain of a zone:

$ /usr/lib/nagios/plugins/dsa-check-dnssec-delegation overview 82.229.38.in-addr.arpa
                       zone DNSKEY               DS@parent       DLV dnssec@parent
--------------------------- -------------------- --------------- --- ----------
     82.229.38.in-addr.arpa                                          no(229.38.in-addr.arpa), no(38.in-addr.arpa), yes(in-addr.arpa), yes(arpa), yes(.)

Notice how the 38.in-addr.arpa zone is not signed? This zone can therefore not be signed with DNSSEC.

DNS - delegation and signature expiry is WARNING

If you get a warning like this:

13:30:15 <nsa> tor-nagios: [global] DNS - delegation and signature expiry is WARNING: WARN: 1: 82.229.38.in-addr.arpa: OK: 12: unsigned: 0

It might be that the zone is not delegated by upstream. To confirm, run this command on the Nagios server:

$ /usr/lib/nagios/plugins/dsa-check-zone-rrsig-expiration  82.229.38.in-addr.arpa
ZONE WARNING: No RRSIGs found; (0.66s) |time=0.664444s;;;0.000000

On the primary DNS server, you should be able to confirm the zone is signed:

dig @nevii  -b 127.0.0.1 82.229.38.in-addr.arpa +dnssec

Check the next DNS server up (use dig -t NS to find it) and see if the zone is delegated:

dig @ns1.cymru.com 82.229.38.in-addr.arpa +dnssec

If it's not delegated, it's because you forgot step 8 in the zone addition procedure. Ask your upstream or registrar to delegate the zone and run the checks again.

DNS - security delegations is WARNING

This error:

11:51:19 <nsa> tor-nagios: [global] DNS - security delegations is WARNING: WARNING: torproject.net (63619,-53722), torproject.org (33670,-28486)

... will happen after rotating the DNSSEC keys at the registrar. The trick is then simply to remove those keys, at the registrar. See DS records expiry and renewal for the procedure.

DNS SOA sync

An example of this problem is the error:

Nameserver ns5.torproject.org for torproject.org returns 0 SOAs

That is because the target nameserver (ns5 in this case) does not properly respond for the torproject.org. To reproduce the error, you can run this on the Nagios server:

/usr/lib/nagios/plugins/dsa-check-soas -a nevii.torproject.org torproject.org -v

This happens because the server doesn't correctly transfer the zones from the master. You can confirm the problem by looking at the logs on the affected server and on the primary server (e.g. with journalctl -u named -f). Restarting the server will trigger a zone transfer attempt.

Typically, this is because a change in tor-puppet.git was forgotten (in named.conf.options or named.conf.puppet-shared-keys).

DNS - DS expiry

Example:

2023-08-22 16:34:36 <nsa> tor-nagios: [global] DNS - DS expiry is WARNING: WARN: torproject.com, torproject.net, torproject.org : OK: 4
2023-08-26 16:25:39 <nsa> tor-nagios: [global] DNS - DS expiry is CRITICAL: CRITICAL: torproject.com, torproject.net, torproject.org : OK: 4

Full status information is, for example:

CRITICAL: torproject.com, torproject.net, torproject.org : OK: 4
torproject.com: Key 57040 about to expire.
torproject.net: Key 63619 about to expire.
torproject.org: Key 33670 about to expire.

This is Nagios warning you the DS records are about to expire. They will still be renewed so it's not immediately urgent to fix this, but eventually the DS records expiry and renewal procedure should be followed.

The old records that should be replaced are mentioned by Nagios in the extended status information, above.

DomainExpiring alerts

The DomainExpiring looks like:

Domain name tor.network is nearing expiry date

It means the domain (in this case tor.network) is going to expire soon. It should be renewed at our registrar quickly.

DomainExpiryDataStale alerts

The DomainExpiryDataStale looks like:

RDAP information for domain tor.network is stale

The information about a configured list of domain names is normally fetched by a daily systemd timer (tpa_domain_expiry) running on the Prometheus server. The metric indicating the last RDAP refresh date gives us an indication of whether or not the metrics that we currently hold in prometheus are based on a current state or not. We don't want to generate alerts with data that's outdated.

If this alert fires, it means that either the job is not running, or the results returned by the RDAP database show issues with the RDAP database itself. We cannot do much about the latter case, but the former we can fix.

Check the status of the job on the Prometheus server with:

systemctl status tpa_domain_expiry

You can try refreshing it with:

systemctl start tpa_domain_expiry
journalctl -e -u tpa_domain_expiry

You can run the query locally with Fabric to check the results:

fab dns.domain-expiry -d tor.network

It should look something like:

anarcat@angela:~/s/t/fabric-tasks> fab dns.domain-expiry -d tor.network
tor.network:
   expiration: 2025-05-27T01:09:38.603000+00:00
   last changed: 2024-05-02T16:15:48.841000+00:00
   last update of RDAP database: 2025-04-30T20:00:08.077000+00:00
   registration: 2019-05-27T01:09:38.603000+00:00
   transfer: 2020-05-23T17:10:52.960000+00:00

The last update of RDAP database field is the one used in this alert, and should correspond to the UNIX timestamp in the metric. The following Python code can convert from the above ISO to the timestamp, for example:

>>> from datetime import datetime
>>> datetime.fromisoformat("2025-04-30T20:00:08.077000+00:00").timestamp()
1746043208.077

DomainTransferred alerts

The DomainTransferred looks something like:

Domain tor.network recently transferred!

This, like the other domain alerts above, is generated by a cron job that refreshes that data periodically for a list of domains.

If that alert fires, it means the given domain was transferred within the watch window (currently 7 days). Normally, when we transfer domains (which is really rare!), we should silence this alert preemptively to avoid this warning.

Otherwise, if you did mean to transfer this domain, you can silence this alert.

If the domain was really unexpectedly transferred, it's all hands on deck. You need to figure out how to transfer it back under your control, quickly, but even more quickly, you need to make sure the DNS servers recorded for the domain are still ours. If not, this is a real disaster recovery scenario, for which we do not currently have a playbook.

For inspiration, perhaps read the hijacking of perl.com. Knowing people in the registry business can help.

Disaster recovery

Complete DNS breakdown

If DNS completely and utterly fails (for example because of a DS expiry that was mishandled), you will first need to figure out if you can still reach the nameservers.

First diagnostics

Normally, this should give you the list of name servers for the main .org domain:

dig -t NS torproject.org

If that fails, it means the domain might have expired. Login to the registrar (currently joker.com) and handle this as a DomainExpiring alert (above).

If that succeeds, the domain should be fine, but it's possible the DS records are revoked. Check those with:

dig -t DS torproject.org

You can also check popular public resolvers like Google and CloudFlare:

dig -t DS torproject.org @8.8.8.8
dig -t DS torproject.org @1.1.1.1

A DNSSEC error would look like this:

[...]

; EDE: 9 (DNSKEY Missing): (No DNSKEY matches DS RRs of torproject.org)

[...]

;; SERVER: 8.8.4.4#53(8.8.4.4) (UDP)

DNSviz can also help analyzing the situation here.

You can also try to enable or disable the DNS-over-HTTPS feature of Firefox to see if your local resolver is affected.

It's possible you don't see an issue but other users (which respect DNSSEC) do, so it's important to confirm the above.

Accessing DNS servers without DNS

In any case, the next step is to recover access to the nameservers. For this, you might need to login to the machines over SSH, and that will prove difficult without DNS. There's few options to recover from that:

  1. existing SSH sessions. if you already have a shell on another torproject.org server (e.g. people.torproject.org) it might be able to resolve other hosts, try to resolve nevii.torproject.org there first)

  2. SSH known_hosts. you should have a copy of the known_hosts.d/torproject.org database, which has an IP associated with each key. This will do a reverse lookup of all the records associated with a given name:

    grep $(grep nevii ~/.ssh/known_hosts.d/torproject.org | cut -d' ' -f 3 | tail -1) ~/.ssh/known_hosts.d/torproject.org
    

    Here are, for example, all the ED25519 records for nevii which shows the IP address:

    anarcat@angela:~> grep $(grep nevii ~/.ssh/known_hosts.d/torproject.org | cut -d' ' -f 3 | tail -1) ~/.ssh/known_hosts.d/torproject.org
    nevii.torproject.org ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIAFMxtFP4h4s+xX5Or5XGjBgCNW+a6t9+ElflLG7eMLL
    2a01:4f8:fff0:4f:266:37ff:fee9:5df8 ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIAFMxtFP4h4s+xX5Or5XGjBgCNW+a6t9+ElflLG7eMLL
    2a01:4f8:fff0:4f:266:37ff:fee9:5df8 ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIAFMxtFP4h4s+xX5Or5XGjBgCNW+a6t9+ElflLG7eMLL
    49.12.57.130 ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIAFMxtFP4h4s+xX5Or5XGjBgCNW+a6t9+ElflLG7eMLL
    

    49.12.57.130 is nevii's IPv4 address in this case.

  3. LDAP. if, somehow, you have a dump of the LDAP database, IP addresses are recorded there.

  4. Hetzner. Some machines are currently hosted at Hetzner, which should still be reachable in case of a DNS-specific outage. The control panel can be used to get a console access to the physical host the virtual machine is hosted on (e.g. fsn-node-01.torproject.org) and, from there, the VM.

Reference

Installation

Secondary name server

To install a secondary nameserver, you first need to create a new machine, of course. Requirements for this service:

  • trusted location, since DNS is typically clear text traffic
  • DDoS resistant, since those have happened in the past
  • stable location because secondary name servers are registered as "glue records" in our zones and those take time to change
  • 2 cores, 2GB of ram and a few GBs of disk should be plenty for now

In the following example, we setup a new secondary nameserver in the gnt-dal Ganeti cluster:

  1. create the virtual machine:

    gnt-instance add
        -o debootstrap+bullseye
        -t drbd --no-wait-for-sync
        --net 0:ip=pool,network=gnt-dal-01
        --no-ip-check
        --no-name-check
        --disk 0:size=10G
        --disk 1:size=2G,name=swap
        --backend-parameters memory=2g,vcpus=2
        ns3.torproject.org
    
  2. the rest of the new machine procedure

  3. add the bind::secondary class to the instance in Puppet, also add it to modules/bind/templates/named.conf.options.erb and modules/bind/templates/named.conf.puppet-shared-keys.erb

  4. generate a tsig secret on the primary server (currently nevii):

    tsig-keygen
    
  5. add that secret in Trocla with this command on the Puppet server (currently pauli):

    trocla set tsig-nevii.torproject.org-ns3.torproject.org plain
    
  6. add the server to the /srv/dns.torproject.org/etc/dns-helpers.yaml configuration file (!)

  7. regenerate the zone files:

    sudo -u dnsadm /srv/dns.torproject.org/bin/update
    
  8. run puppet on the new server, then on the primary

  9. test the new nameserver:

    At this point, you should be able to resolve names from the secondary server, for example this should work:

    dig torproject.org @ns3.torproject.org
    

    Test some reverse DNS as well, for example:

    dig -x 204.8.99.101 @ns3.torproject.org
    

    The logs on the primary server should not have too many warnings:

    journalctl -u named -f
    
  10. once the server is behaving correctly, add it to the glue records:

    1. login to joker.com
    2. go to "Nameserver"
    3. "Create a new nameserver" (or, if it already exists, "Change" it)

Nagios should pick up the changes and the new nameserver automatically. The affected check is DNS SOA sync - torproject.org and similar, or the dsa_check_soas_add check command.

Upgrades

SLA

Design and architecture

TODO: This needs to be documented better. weasel made a blog post describing parts of the infrastructure on Debian.org, and that is partly relevant to TPO as well.

Most DNS records are managed in LDAP, see the DNS zone file management documentation about that.

Puppet DNS hooks

Puppet can inject DNS records in the torproject.org zonefile with dnsextras::entry (of which dnsextras::tlsa_record is a wrapper). For example, this line:

$vhost = 'gitlab.torproject.org'
$algo = 'ed25519'
$hash = 'sha256'
$record = 'SSHFP 4 2 4e6dedc77590b5354fce011e82c877e03bbd4da3d16bb1cdcf56819a831d28bd'
dnsextras::entry { "sshfp-alias-${vhost}-${algo}-${hash}":
  zone => 'torproject.org',
  content => "${vhost}. IN ${record}",
}

... will create an entry like this (through a Concat resource) on the DNS server, in /srv/dns.torproject.org/puppet-extra/include-torproject.org:

; gitlab-02.torproject.org sshfp-alias-gitlab.torproject.org-ed25519-sha256
gitlab.torproject.org. IN SSHFP 4 2 4e6dedc77590b5354fce011e82c877e03bbd4da3d16bb1cdcf56819a831d28bd

Even though the torproject.org zone file in domains.git has an $INCLUDE directive for that file, you do not see that in the generated file on disk on the DNS server.

Instead, it is compiled in the final zonefile, through a hook ran from Puppet (Exec[rebuild torproject.org zone]) which runs:

/bin/su - dnsadm -c "/srv/dns.torproject.org/bin/update"

That, among many other things, calls /srv/dns.torproject.org/repositories/dns-helpers/write_zonefile which, through dns-helpers/DSA/DNSHelpers.pm, calls the lovely compile_zonefile() function which essentially does:

named-compilezone -q -k fail -n fail -S fail -i none -m fail -M fail -o $out torproject.org $in

... with temporary files. That eventually renames a temporary file to /srv/dns.torproject.org/var/generated/torproject.org.

This means the records you write from Puppet will not be exactly the same in the generated file, because they are compiled by named-compilezone(8). For example, a record like:

_25._tcp.gitlab-02.torproject.org. IN TYPE52 \# 35  03010129255408eafcfd811854c89404b68467298d3000781dc2be0232fa153ff3b16b

is rewritten as:

_25._tcp.gitlab-02.torproject.org.            3600 IN TLSA      3 1 1  9255408EAFCFD811854C89404B68467298D3000781DC2BE0232FA15 3FF3B16B

Note that this is a different source of truth that the primary source of truth for DNS records, which is LDAP. See the DNS zone file management section about this in particular.

mini-nag operation

mini-nag is a small Python script that performs monitoring of the mirror system to take mirrors out of rotation when they become unavailable or are scheduled for reboot. This section tries to analyze its mode of operation with the Nagios/NRPE retirement in mind (tpo/tpa/team#41734).

The script is manually deployed on the primary DNS server (currently nevii). There's a mostly empty class called profile:mini_nag in Puppet, but otherwise the script is manually configured.

The main entry point for regular operation is in the dnsadm user crontab (/var/spool/cron/crontabs/dnsadm), which calls mini-nag (in /srv/dns.torproject.org/repositories/mini-nag/mini-nag) every 2 minutes.

It is called first with the check argument, then with update-bad, checking the timestamp of the status directory (/srv/dns.torproject.org/var/mini-nag/status), and if there's a change, it triggers the zone rebuild script (/srv/dns.torproject.org/bin/update).

The check command does this (function check()):

  1. load the auto-dns YAML configuration file /srv/dns.torproject.org/repositories/auto-dns/hosts.yaml
  2. connect to the database /srv/dns.torproject.org/var/mini-nag/status.db
  3. in separate threads, run checks in "soft" mode, if configured in the checks field of hosts.yaml:
    • ping-check: local command check_ping -H @@HOST@@ -w 800,40% -c 1500,60% -p 10
    • http-check: local command check_http -H @@HOST@@ -t 30 -w 15
  4. in separate threads, run checks in "hard" mode, if configured in the checks field of hosts.yaml:
    • shutdown-check: remote NRPE command check_nrpe -H @@HOST@@ -n -c dsa2_shutdown | grep system-in-shutdown
    • debianhealth-check: local command check_http -I @@HOST@@ -u http://debian.backend.mirrors.debian.org/_health -t 30 -w 15
    • debughealth-check: local command check_http -I @@HOST@@ -u http://debug.backend.mirrors.debian.org/_health -t 30 -w 15
  5. wait for threads to complete waiting for a 35 seconds timeout (function join_checks())
  6. insert results in an SQLite database, a row like (function insert_results()):
    • host: hostname (string)
    • test: check name (string)
    • ts: unix timestamp (integer)
    • soft: if the check failed (boolean)
    • hard: if the check was "hard" and it failed
    • msg: output of the command, or check timeout if timeout was hit
  7. does some dependency checks between hosts (function dependency_checks()), a noop since we don't have any depends field in hosts.yaml
  8. commit changes to the database and exit

Currently, only the ping-check, shutdown-check, and http-check checks are enabled in hosts.yaml.

Essentially, the check command runs some probes and writes the results in the SQLite database, logging command output, timestamp and status.

The update_bad command does this (function update_bad()):

  1. find bad hosts from the database (function get_bad()), which does this:

    1. cleanup old hosts older than an expiry time (900 seconds, function cleanup_bad_in_db())
    2. run this SQL query (function get_bad_from_db()):

    SELECT total, soft*1.0/total as soft, hard, host, test FROM (SELECT count(*) AS total, sum(soft) AS soft, sum(hard) AS hard, host, test FROM host_status GROUP BY host, test) WHERE soft*1.0/total > 0.40 OR hard > 0

    1. return a dictionary of host => checks list that have failed, where failed is defined "test is 'hard'" or "if soft, then more than 40% of the checks failed"
  2. cleanup files in the status directory that are not in the bad_hosts list

  3. for each bad host above, if the host is not already in the status directory:

    1. create an empty file with the hostname in the status directory

    2. send an email to the secret tor-misc commit alias to send notifications over IRC

In essence, the update_bad command will look in the database to see if there are more hosts that have bad check results and will sync the status directory to reflect that status.

From there, the update command will run the /srv/dns.torproject.org/repositories/auto-dns/build-services command from the auto-dns repository which checks the status directory for the flag file, and skips including that host if the flag is present.

DNSSEC

DNSSEC records are managed automatically by manage-dnssec-keys in the dns-helpers git repository, through a cron job in the dnsadm user on the master DNS server (currently nevii).

There used to be a Nagios hook in /srv/dns.torproject.org/bin/dsa-check-and-extend-DS that basically wraps manage-dnssec-keys with some Nagios status codes, but it is believed this hook is not fired anymore, and only the above cron job remains.

This is legacy that we aim at converting to BIND's new automation, see tpo/tpa/team#42268.

Services

Storage

mini-nag stores check results in a SQLite database, in /srv/dns.torproject.org/var/mini-nag/status.db and uses the status directory (/srv/dns.torproject.org/var/mini-nag/status/) as a messaging system to auto-dns. Presence of a file there implies the host is down.

Queues

Interfaces

Authentication

Implementation

Issues

There is no issue tracker specifically for this project, File or search for issues in the team issue tracker with the label ~DNS.

Maintainer

Users

Upstream

Monitoring and metrics

Tests

Logs

Backups

Other documentation

Discussion

Overview

Security and risk assessment

Technical debt and next steps

Proposed Solution

Other alternatives

Debian registrar scripts

Debian has a set of scripts to automate talking to some providers like Netnod. A YAML file has metadata about the configuration, and pushing changes is as simple as:

publish tor-dnsnode.yaml

That config file would look something like:

---
  endpoint: https://dnsnodeapi.netnod.se/apiv3/
  base_zone:
    endcustomer: "TorProject"
    masters:
      # nevii.torproject.org
      - ip: "49.12.57.130"
        tsig: "netnod-torproject-20180831."
      - ip: "2a01:4f8:fff0:4f:266:37ff:fee9:5df8"
        tsig: "netnod-torproject-20180831."
    product: "probono-premium-anycast"

This is not currently in use at TPO and changes are operated manually through the web interface.

zonetool

https://git.autistici.org/ai3/tools/zonetool is a YAML based zone generator with DNSSEC support.

Other resolvers and servers

We currently use bind and unbound as DNS servers and resolvers, respectively. bind, in particular, is a really old codebase and has been known to have security and scalability issues. We've also had experiences with unbound being unreliable, see for example crashes when running out of disk space, but also when used on roaming clients (e.g. anarcat's laptop).

Here are known alternatives:

  • hickory-dns: full stack (resolver, server, client), 0.25 (not 1.0) as of 2025-03-27, but used in production at Let's Encrypt, Rust rewrite, packaged in Debian 13 (trixie) and later
  • knot: resolver, 3.4.5 as of 2025-03-27, used in production at Riseup and nic.cz, C, packaged in Debian
  • dnsmasq: DHCP server and DNS resolver, more targeted at embedded devices, C
  • PowerDNS, authoritative server, resolver, database-backed, used by Tails, C++

Previous monitoring implementation

This section details how monitoring of DNS services was implemented in Nagios.

First, simple DNS (as opposed to DNSSEC) wasn't directly monitored per se. It was assumed, we presume, that normal probes would trigger alerts if DNS resolution would fail. We did have monitoring of a weird bug in unbound, but this was fixed in Debian trixie and the check wasn't ported to Prometheus.

Most of the monitoring was geared towards the more complex DNSSEC setup.

It consisted of the following checks, as per TPA-RFC-33:

name command note
DNS SOA sync - * dsa_check_soas_add checks that zones are in sync on secondaries
DNS - delegation and signature expiry dsa-check-zone-rrsig-expiration-many
DNS - zones signed properly dsa-check-zone-signature-all
DNS - security delegations dsa-check-dnssec-delegation
DNS - key coverage dsa-check-statusfile dsa-check-statusfile /srv/dns.torproject.org/var/nagios/coverage on nevii, could be converted as is
DNS - DS expiry dsa-check-statusfile dsa-check-statusfile /srv/dns.torproject.org/var/nagios/ds on nevii

That said, this is not much information. Let's dig into each of those checks to see precisely what it does and what we need to replicate in the new monitoring setup.

SOA sync

This was configured in the YAML file as:

  -
    name: DNS SOA sync - torproject.org
    check: "dsa_check_soas_add!nevii.torproject.org!torproject.org"
    hosts: global
  -
    name: DNS SOA sync - torproject.net
    check: "dsa_check_soas_add!nevii.torproject.org!torproject.net"
    hosts: global
  -
    name: DNS SOA sync - torproject.com
    check: "dsa_check_soas_add!nevii.torproject.org!torproject.com"
    hosts: global
  -
    name: DNS SOA sync - 99.8.204.in-addr.arpa
    check: "dsa_check_soas_add!nevii.torproject.org!99.8.204.in-addr.arpa"
    hosts: global
  -
    name: DNS SOA sync - 0.0.0.0.2.0.0.6.7.0.0.0.0.2.6.2.ip6.arpa
    check: "dsa_check_soas_add!nevii.torproject.org!0.0.0.0.2.0.0.6.7.0.0.0.0.2.6.2.ip6.arpa"
    hosts: global
  -
    name: DNS SOA sync - onion-router.net
    check: "dsa_check_soas_add!nevii.torproject.org!onion-router.net"
    hosts: global

And that command defined as:

define command{
    command_name    dsa_check_soas_add
    command_line    /usr/lib/nagios/plugins/dsa-check-soas -a "$ARG1$" "$ARG2$"
}

That was a Ruby script written in 2006 by weasel, which did the following:

  1. parse the commandline, -a (--add) is an additional nameserver to check (nevii, in all cases), -n (--no-soa-ns) says to not query the "SOA record" (sic) for a list of nameservers

    (the script actually checks the NS records for a list of nameservers, not the SOA)

  2. fail if no -n is specified without -a

  3. for each domain on the commandline (in practice, we always process one domain at a time, so this is irrelevant)...

  4. fetch the NS record for the domain from the default resolver, add that to the --add server for the list of servers to check (names are resolved to IP addresses, possibly multiple)

  5. for all nameservers, query the SOA record found for the checked domain on the given nameserver, raise a warning if resolution fails or we have more or less than one SOA record

  6. record the serial number in a de-duplicated list

  7. raise a warning if no serial number was found

  8. raise a warning if different serial numbers are found

The output looks like:

> ./dsa-check-soas torproject.org
torproject.org is at 2025092316

A failure looks like:

Nameserver ns5.torproject.org for torproject.org returns 0 SOAs

This script should be relatively easy to port to Prometheus, but we need to figure out what metrics might look like.

delegation and signature expiry

The dsa-check-zone-rrsig-expiration-many command was configured as a NRPE check in the YAML file as:

  -
    name: DNS - delegation and signature expiry
    hosts: global
    remotecheck: "/usr/lib/nagios/plugins/dsa-check-zone-rrsig-expiration-many --warn 20d --critical 7d /srv/dns.torproject.org/repositories/domains"
    runfrom: nevii

That is a Perl script written in 2010 by weasel. Interestingly, the default warning time in the script is 14d, not 20d. There's a check timeout set to 45 which we presume to be seconds.

The script uses threads and is a challenge to analyze.

  1. it parses all files in the given directory (/srv/dns.torproject.org/repositories/domains), which currently contains the files:

    0.0.0.0.2.0.0.6.7.0.0.0.0.2.6.2.ip6.arpa 30.172.in-addr.arpa 99.8.204.in-addr.arpa onion-router.net torproject.com torproject.net torproject.org

  2. For each zone, it checks if the file has a comment that matches ; wzf: dnssec = 0 (with tolerance for whitespace), in which case the zone is considered "unsigned".

  3. For "signed" zones, the check-initial-refs command is recorded in a hash keyed

  4. it does things for "geo" things that we will ignore here

  5. it creates a thread for each signed zone which will (in check_one) run the dsa-check-zone-rrsig-expiration check with the initial-refs saved above

  6. it collects and prints the result, grouping the zones by status (OK, WARN, CRITICAL, depending on the thresholds)

Note that only one zone has the initial-refs set:

30.172.in-addr.arpa:; check-initial-refs = ns1.torproject.org,ns3.torproject.org,ns4.torproject.org,ns5.torproject.org

No zone has the wzf flag to mark a zone as unsigned.

So this is just a thread executor for each zone, in other words, which delegates to dsa-check-zone-rrsig-expiration, so let's look at how that works.

That other script is also a Perl script, "downloaded from http://dns.measurement-factory.com/tools/nagios-plugins/check_zone_rrsig_expiration.html on 2010-02-07 by Peter Palfrader, that script being itself from 2008. It is, presumably, a "nagios plugin to check expiration times of RRSIG records. Reminds you if its time to re-sign your zone."

Concretely, it recurses from the root zones to find the NS records for the zone, warns about lame nameservers and expired RRSIG records from any nameserver.

Its overall execution is:

  1. do_recursion
  2. do_queries
  3. do_analyze

do_recursion is fetches the authoritative NS records from the root servers, this way:

  1. iterate randomly over the root servers ([abcdefghijklm].root-servers.net)
  2. ask for the NS record for the zone on each, stopping when any response is received, exiting with a CRITICAL status if no server is responding, or a server responds with an error
  3. reset the list of servers to the NS records return, go to 2, unless we hit the zone record, in which case we record the NS records

At this point we have a list of NS servers for the zone to query, which we do with do_queries:

  1. for each NS record
  2. query and record the SOA packet on that nameserver, with DNSSEC enabled (equivalent to dig -t SOA example.com +dnssec)

.. and then, of course, we do_analyze, which is where you have the core business logic of the check:

  1. for each SOA record fetched from the nameserver found in do_queries
  2. warn about lame nameservers: not sure how that's implemented, $pkt->header->ancount? (technically, a lame nameserver is when a nameserver recorded in the parent's zone NS records doesn't answer a SOA request)
  3. count the number of nameservers found, warn if none found
  4. warn about if no RRSIG is found
  5. for each RRSIG records found in that packet
  6. check the sigexpiration field, parse it as a UTC (ISO?) timestamp
  7. warn/crit if the RRSIG record expires in the past or soon

A single run takes about 12 seconds here, it's pretty slow. It looks like this on success:

> ./dsa-check-zone-rrsig-expiration  torproject.org
ZONE OK: No RRSIGs at zone apex expiring in the next 7.0 days; (6.36s) |time=6.363434s;;;0.000000

In practice, I do not remember ever seeing a failure with this.

zones signed properly

This check was defined in the YAML file as:

  -
    name: DNS - zones signed properly
    hosts: global
    remotecheck: "/usr/lib/nagios/plugins/dsa-check-zone-signature-all"
    runfrom: nevii

The dsa-check-zone-signature-all script essentially performs a dnssec-verify over the each zone file transferred with a AXFR:

    if dig $EXTRA -t axfr @"$MASTER" "$zone" | dnssec-verify -o "$zone" /dev/stdin > "$tmp" 2>&1; then

... and it counts the number of failures.

This reminds me of tpo/tpa/domains#1, where we want to check SPF records for validity, which the above likely does not do.

security delegations

This check is configured with:

  -
    name: DNS - security delegations
    hosts: global
    remotecheck: "/usr/lib/nagios/plugins/dsa-check-dnssec-delegation --dir /srv/dns.torproject.org/repositories/domains check-header"
    runfrom: nevii

The dsa-check-dnssec-delegation script was written in 2010 by weasel and can perform multiple checks, but in practice here it's configured in check-header mode, which we'll restrict ourselves to here. That mode is equivalent to check-dlv and check-ds which might mean "check everything", then.

The script then:

  1. iterates over all zones
  2. check for ; ds-in-parent=yes and dlv-submit=yes in the zone, which can be used to disable checks on some zones
  3. fetch the DNSKEY records for the zone
  4. fetch the DS records for the zone, intersect with the DNSKEY record, warn for an empty intersect or superfluous DS records
  5. also checks DLV records as the ISC, but those have been retired

key coverage

This check is defined in:

  -
    name: DNS - key coverage
    hosts: global
    remotecheck: "/usr/lib/nagios/plugins/dsa-check-statusfile /srv/dns.torproject.org/var/nagios/coverage"
    runfrom: nevii

So it just outsources to a status file that's piped into that generic wrapper. This check is therefore actually implemented in dns-helpers/bin/dsa-check-dnssec-coverage-all-nagios-wrap. This, of course, is a wrapper for dsa-check-dnssec-coverage-all which iterates through the auto-dns and domains zones and runs dnssec-coverage like this for auto-dns zones:

dnssec-coverage \
        -c named-compilezone \
        -K "$BASE"/var/keys/"$zone" \
        -r 10 \
        -f "$BASE"/var/geodns-zones/db."$zone" \
        -z \
        -l "$CUTOFF" \
        "$zone"

and like this for domains zones:

dnssec-coverage \
    -c named-compilezone \
    -K "$BASE"/var/keys/"$zone" \
    -f "$BASE"/var/generated/"$zone" \
    -l "$CUTOFF" \
    "$zone"

Now that script (dnssec-coverage) was apparently written in 2013 by the ISC. Like manage-dnssec-keys (below), it has its own Key representation of a DNSSEC "key". It checks for:

PHASE 1--Loading keys to check for internal timing problems
PHASE 2--Scanning future key events for coverage failures

Concretely, it:

  • "ensure that the gap between Publish and Activate is big enough" and in the right order (Publish before Activate)
  • "ensure that the gap between Inactive and Delete is big enough" and in the right order, and for missing Inactive
  • some hairy code checks the sequence of events and raises errors like ERROR: No KSK's are active after this event, it seems to check in the future to see if there are missing active or published keys, and for keys that are both active and published

DS expiry

  -
    name: DNS - DS expiry
    hosts: global
    remotecheck: "/usr/lib/nagios/plugins/dsa-check-statusfile /srv/dns.torproject.org/var/nagios/ds"
    runfrom: nevii

Same, but with dns-helpers/bin/dsa-check-and-extend-DS. As mentioned above, that script is essentially just a wrapper for:

dns-helpers/manage-dnssec-keys --mode ds-check $zones

... with the output as extra information for the Nagios state file.

It is disabled with the ds-disable-checks = yes (note the whitespace: it matters) either in auto-dns/zones/$ZONE or domains/$ZONE.

The manage-dnssec-keys script, in ds-check mode does the following (mostly in the KeySet constructor and KeySet.check_ds)

  1. load the keys from the keydir (defined in /etc/dns-helpers.yaml)
  2. loads the timestamps, presumably from the dsset file
  3. check the DS record for the zone
  4. check if the DS keys (keytag, algo, digest) match an on-disk key
  5. checks for expiry, bumping expiry for some entries, against the loaded timestamps

It's unclear if we need to keep implementing this at all if we stop expiring DS entries. But it might be good for check for consistency and, while we're at it, might as well check for expiry.

Summary

So the legacy monitoring infrastructure was checking the following:

  • SOA sync, for all zones
  • check the local resolver for NS records, all IP addresses
  • check all NS records respond
  • check that they all serve the same SOA serial number
  • RRSIG check, for all zones:
  • check the root name servers for NS records
  • check the SOA records in DNSSEC mode (which attaches a RRSIG record) on that each name server
  • check for lame nameservers
  • check for RRSIG expiration or missing record
  • whatever it is that dnssec-verify is doing, unchecked
  • DS / DNSKEY match check, for all zones
  • pull all DS records from local resolver
  • compare with local DNSKEY records, warn about missing or superfluous keys
  • dsset expiration checks:
  • check that event ordering is correct
  • checks the DS records in DNS match the ones on disk (again?)
  • checks the dsset records for expiration

Implementation ideas

The python3-dns library is already in use in some of the legacy code.

The prometheus-dnssec-exporter handles the following:

  • RRSIG expiry (days left and "earliest expiry")
  • DNSSEC resolution is functional

Similarly, the dns exporter only checks if records resolves and latency.

We are therefore missing quite a bit here, most importantly:

  • SOA sync
  • lame nameservers
  • missing RRSIG records (although the dnssec exporters somewhat implicitly checks that by not publishing a metric, that's an easy thing to misconfigure)
  • DS / DNSKEY records match
  • local DS record expiration

Considering that the dnssec exporter implements so little, it seems we would need to essentially start from scratch and write an entire monitoring stack for this.

Multiple Python DNS libraries exist in Debian already:

  • python3-aiodns (installed locally on my workstation)
  • python3-dns (ditto)
  • python3-dnspython (ditto, already used on nevii)
  • python3-getdns