TPA uses Puppet to manage all servers it operates. It handles most of the configuration management of the base operating system and some services. It is not designed to handle ad-hoc tasks, for which we favor the use of fabric.
[[TOC]]
Tutorial
This page is long! This first section hopes to get you running with a simple task quickly.
Adding an "message of the day" (motd) on a server
To post announcements to shell users of a servers, it might be a good
idea to post a "message of the day" (/etc/motd) that will show up on
login. Good examples are known issues, maintenance windows, or service
retirements.
This change should be fairly inoffensive because it should affect only
a single server, and only the motd, so the worst that can happen
here is a silly motd gets displayed (or nothing at all).
Here is how to make the change:
-
To any change on the Puppet server, you will first need to clone the git repository:
git clone git@puppet.torproject.org:/srv/puppet.torproject.org/git/tor-puppetThis needs to be only done once.
-
the messages are managed by the
motdmodule, but to easily add an "extra" entry, you should had to the Hiera data storage for the specific host you want to modify. Let's say you want to add amotdonperdulce, the currentpeople.torproject.orgserver. The file you will need to change (or create!) ishiera/nodes/perdulce.torproject.org.yaml:$EDITOR hiera/nodes/perdulce.torproject.org.yaml -
Hiera stores data in YAML. So you need to create a little YAML snippet, like this:
motd::extra: | Hello world! -
Then you can commit this and push:
git commit -m"add a nice friendly message to the motd" && git push -
Then you should login to the host and make sure the code applies correctly, in dry-run mode:
ssh -tt perdulce.torproject.org sudo puppet agent -t --noop -
If that works, you can do it for real:
ssh -tt perdulce.torproject.org sudo puppet agent -t
On next login, you should see your friendly new message. Do not forget to revert the change!
The next tutorial is about a more elaborate change, performed on multiple servers.
Adding an IP address to the global allow list
In this tutorial, we will add an IP address to the global allow list, on all firewalls on all machines. This is a big deal! It will allow that IP address to access the SSH servers on all boxes and more. This should be an static IP address on a trusted network.
If you have never used Puppet before or are nervous at all about making such a change, it is a good idea to have a more experienced sysadmin nearby to help you. They can also confirm this tutorial is what is actually needed.
-
To any change on the Puppet server, you will first need to clone the git repository:
git clone git@puppet.torproject.org:/srv/puppet.torproject.org/git/tor-puppetThis needs to be only done once.
-
The firewall rules are defined in the
fermmodule, which lives inmodules/ferm. The file you specifically need to change ismodules/ferm/templates/defs.conf.erb, so open that in your editor of choice:$EDITOR modules/ferm/templates/defs.conf.erb -
The code you are looking for is
ADMIN_IPS. Add a@deffor your IP address and add the new macro to theADMIN_IPSmacro. When you exit your editor, git should show you a diff that looks something like this:--- a/modules/ferm/templates/defs.conf.erb +++ b/modules/ferm/templates/defs.conf.erb @@ -77,7 +77,10 @@ def $TPO_NET = (<%= networks.join(' ') %>); @def $linus = (); @def $linus = ($linus 193.10.5.2/32); # kcmp@adbc @def $linus = ($linus 2001:6b0:8::2/128); # kcmp@adbc -@def $ADMIN_IPS = ($weasel $linus); +@def $anarcat = (); +@def $anarcat = ($anarcat 203.0.113.1/32); # home IP +@def $anarcat = ($anarcat 2001:DB8::DEAD/128 2001:DB8:F00F::/56); # home IPv6 +@def $ADMIN_IPS = ($weasel $linus $anarcat); @def $BASE_SSH_ALLOWED = (); -
Then you can commit this and push:
git commit -m'add my home address to the allow list' && git push -
Then you should login to one of the hosts and make sure the code applies correctly:
ssh -tt perdulce.torproject.org sudo puppet agent -t
Puppet shows colorful messages. If nothing is red and it returns correctly, you are done. If that doesn't work, go back to step 2. If that doesn't work, ask for help from your colleague in the Tor sysadmin team.
If this works, congratulations, you have made your first change across
the entire Puppet infrastructure! You might want to look at the rest
of the documentation to learn more about how to do different tasks and
how things are setup. A key "How to" we recommend is the Progressive
deployment section below, which will teach you how to make a change
like the above while making sure you don't break anything even if it
affects a lot of machines.
How-to
Programming workflow
Using environments
During ordinary maintenance operations, it's appropriate to work directly on the
default production branch, which deploys to the production environment.
However, for more complex changes, such as when deploying a new service or adding a module (see below), it's recommended to start by working on a feature branch which will deploy as a distinct environment on the Puppet server.
To quickly test a different environment used, you can switch the
environment used by the Puppet agent using the --environment
flag. For example, this will switch a node from production to
test:
puppet agent --test --environment test
Note that this setting is sticky: further runs will keep the
test environment even if the --environment flag is not set, as the
setting is written in the puppet.conf. To reset to the production
environment, you can simply use that flag again:
puppet agent --test --environment test
A node or group of nodes can be switch to a different environment
using the external node classifier (ENC), by adding a environment:
key, like this in nodes/test.torproject.org.yaml:
---
environment: test
parameters:
role: test
Once the feature branch is satisfactory, it can then be merged to
production and deleted:
git merge test
git branch -d test
git push -d origin test
Branches are not deleted automatically after merge: make sure you cleanup after yourself.
Because environments aren't totally isolated from each other and a compromised
node could choose to apply an environment other than production, care should
be taken with the code pushed to these feature branches. It's recommended to
avoid overly broad debugging statements, if any, and to generally keep an
active eye on feature branches so as to prevent the accumulation of unreviewed
code.
Finally, note that environments are automatically destroyed (alongside their branch) on the Puppet server after 2 weeks since the last commit to the branch. An email warning about this will be sent to the author of that last commit. This doesn't destroy the mirrored branch on GitLab.
When an environment is removed, Puppet agents will revert back to the
production environment automatically.
Modifying an existing configuration
For new deployments, this is NOT the preferred method. For example,
if you are deploying new software that is not already in use in our
infrastructure, do not follow this guide and instead follow the
Adding a new module guide below.
If you are touching an existing configuration, things are much
simpler however: you simply go to the module where the code already
exists and make changes. You git commit and git push the code,
then immediately run puppet agent -t on the affected node.
Look at the File layout section above to find the right piece of
code to modify. If you are making changes that potentially affect more
than one host, you should also definitely look at the Progressive
deployment section below.
Adding a new module
This is a broad topic, but let's take the Prometheus monitoring system as an example which followed the role/profile/module pattern.
First, the Prometheus modules on the Puppet forge were evaluated for quality and popularity. There was a clear winner there: the Prometheus module from Vox Populi had hundreds of thousands more downloads than the next option, which was deprecated.
Next, the module was added to the Puppetfile (in
./Puppetfile):
mod 'puppet/prometheus', # 12.5.0
:git => 'https://github.com/voxpupuli/puppet-prometheus.git',
:commit => '25dd701b489fc32c892390fd464e765ebd6f513a' # tag: v12.5.0
Note that:
- Since tpo/tpa/team#41974 we don't import 3rd-party code into our repo and instead deploy the modules dynamically in the server.
- Because of that, modules in the
Puppetfileshould always be pinned to a Git repo and commit, as that's currently the simplest way to avoid some MITM issues. - We currently don't have an automated way of managing module dependencies, so
you'll have to manually and recursively add dependencies to the
Puppetfile. Sorry! - Make sure to manually audit the code for each module, by reading each file and looking for obvious security flaws or back doors.
Then the code was committed into git:
git add Puppetfile
git commit -m'install prometheus module and its dependencies after audit'
Then the module was configured in a profile, in modules/profile/manifests/prometheus/server.pp:
class profile::prometheus::server {
class {
'prometheus::server':
# follow prom2 defaults
localstorage => '/var/lib/prometheus/metrics2',
storage_retention => '15d',
}
}
The above contains our local configuration for the upstream
prometheus::server class. In
particular, it sets a retention period and a different path for the
metrics, so that they follow the new Prometheus 2.x defaults.
Then this profile was added to a role, in
modules/roles/manifests/monitoring.pp:
# the monitoring server
class roles::monitoring {
include profile::prometheus::server
}
Notice how the role does not refer to any implementation detail, like that the monitoring server uses Prometheus. It looks like a trivial, useless, class but it can actually grow to include multiple profiles.
Then that role is added to the Hiera configuration of the monitoring
server, in hiera/nodes/hetzner-nbg1-01.torproject.org.yaml:
classes:
- roles::monitoring
And Puppet was ran on the host, with:
puppet --enable ; puppet agent -t --noop ; puppet --disable "testing prometheus deployment"
If you need to deploy the code to multiple hosts, see the Progressive
deployment section below. To contribute changes back upstream (and
you should do so), see the section right below.
Contributing changes back upstream
Fork the upstream repository and operate on your fork until the changes are eventually merged upstream.
Then, update the Puppetfile, for example:
The module is then forked on GitHub or wherever it is hosted mod 'puppet-prometheus', :git => 'https://github.com/anarcat/puppet-prometheus.git', :commit => '(...)'
Note that the deploy branch here is a merge of all the different
branches proposed upstream in different pull requests, but it could
also be the master branch or a single branch if only a single pull
request was sent.
You'll have to keep a clone of the upstream repository somewhere outside of the
tor-puppet work tree, from which you can push and pull normally with
upstream. When you make a change, you need to commit (and push) the change in
your external clone and update the Puppetfile in the repository.
Running tests
Ideally, Puppet modules have a test suite. This is done with rspec-puppet and rspec-puppet-facts. This is not very well documented upstream, but it's apparently part of the Puppet Development Kit (PDK). Anyways: assuming tests exists, you will want to run some tests before pushing your code upstream, or at least upstream might ask you for this before accepting your changes. Here's how to get setup:
sudo apt install ruby-rspec-puppet ruby-puppetlabs-spec-helper ruby-bundler
bundle install --path vendor/bundle
This installs some basic libraries, system-wide (Ruby bundler and the
rspec stuff). Unfortunately, required Ruby code is rarely all present
in Debian and you still need to install extra gems. In this case we
set it up within the vendor/bundle directory to isolate them from
the global search path.
Finally, to run the tests, you need to wrap your invocation with
bundle exec, like so:
bundle exec rake test
Validating Puppet code
You SHOULD run validation checks on commit locally before pushing your manifests. To install those hooks, you should clone this repository:
git clone https://github.com/anarcat/puppet-git-hooks
... and deploy it as a pre-commit hook:
ln -s $PWD/puppet-git-hooks/pre-commit tor-puppet/.git/hooks/pre-commit
This hook is deployed on the server and will refuse your push if it fails linting, see issue 31226 for a discussion.
Puppet tricks
Password management
If you need to set a password in a manifest, there are special functions to handle this. We do not want to store passwords directly in Puppet source code, for various reasons: it is hard to erase because code is stored in git, but also, ultimately, we want to publish that source code publicly.
We use Trocla for this purpose, which generates random passwords and stores the hash or, if necessary, the clear-text in a YAML file.
Trocla's man page is not very useful, but you can see a list of subcommands in the project's README file.
With Trocla, each password is generated on the fly from a secure
entropy source (Ruby's SecureRandom module) and stored inside a
state file (in /var/lib/trocla/trocla_data.yml, configured
/etc/puppet/troclarc.yaml) on the Puppet master.
Trocla can return "hashed" versions of the passwords, so that the plain text password is never visible from the client. The plain text can still be stored on the Puppet master, or it can be deleted once it's been transmitted to the user or another password manager. This makes it possible to have Trocla not keep any secret at all.
This piece of code will generate a bcrypt-hashed password for the Grafana admin, for example:
$grafana_admin_password = trocla('grafana_admin_password', 'bcrypt')
The plain-text for that password will never leave the Puppet master. it will still be stored on the Puppet master, and you can see the value with:
trocla get grafana_admin_password plain
... on the command-line.
A password can also be set with this command:
trocla set grafana_guest_password plain
Note that this might erase other formats for this password, although those will get regenerated as needed.
Also note that trocla get will fail if the particular password or
format requested does not exist. For example, say you generate a
plain-text password with and then get the bcrypt version:
trocla create test plain
trocla get test bcrypt
This will return the empty string instead of the hashed
version. Instead, use trocla create to generate that password. In
general, it's safe to use trocla create as it will reuse existing
password. It's actually how the trocla() function behaves in Puppet
as well.
TODO: Trocla can provide passwords to classes transparently, without having to do function calls inside Puppet manifests. For example, this code:
class profile::grafana {
$password = trocla('profile::grafana::password', 'plain')
# ...
}
Could simply be expressed as:
class profile::grafana(String $password) {
# ...
}
But this requires a few changes:
- Trocla needs to be included in Hiera
- We need roles to be more clearly defined in Hiera, and use Hiera as an ENC so that we can do per-roles passwords (for example), which is not currently possible.
Getting information from other nodes
A common pattern in Puppet is to deploy resources on a given host with information from another host. For example, you might want to grant access to host A from host B. And while you can hardcode host B's IP address in host A's manifest, it's not good practice: if host B's IP address changes, you need to change the manifest, and that practice makes it difficult to introduce host C into the pool...
So we need ways of having a node use information from other nodes in our Puppet manifests. There are 5 methods in our Puppet source code at the time of writing:
- Exported resources
- PuppetDB lookups
- Puppet Query Language (PQL)
- LDAP lookups
- Hiera lookups
This section walks through how each method works, outlining the advantage/disadvantage of each.
Exported resources
Our Puppet configuration supports exported resources, a key component of complex Puppet deployments. Exported resources allow one host to define a configuration that will be exported to the Puppet server and then realized on another host.
These exported resources are not confined by environments: for example,
resources exported by a node assigned to the foo environment will be
available on all resources of the production environment, and vice-versa.
We commonly use this to punch holes in the firewall between nodes. For
example, this manifest in the roles::puppetmaster class:
@@ferm::rule::simple { "roles::puppetmaster-${::fqdn}":
tag => 'roles::puppetmaster',
description => 'Allow Puppetmaster access to LDAP',
port => ['ldap', 'ldaps'],
saddr => $base::public_addresses,
}
... exports a firewall rule that will, later, allow the Puppet server
to access the LDAP server (hence the port => ['ldap', 'ldaps']
line). This rule doesn't take effect on the host applying the
roles::puppetmaster class, but only on the LDAP server, through this
rather exotic syntax:
Ferm::Rule::Simple <<| tag == 'roles::puppetmaster' |>>
This tells the LDAP server to apply whatever rule was exported with
the @@ syntax and the specified tag. Any Puppet resource can be
exported and realized that way.
Note that there are security implications with collecting exported resources: it delegates the resource specification of a node to another. So, in the above scenario, the Puppet master could decide to open other ports on the LDAP server (say, the SSH port), because it exports the port number and the LDAP server just blindly applies the directive. A more secure specification would explicitly specify the sensitive information, like so:
Ferm::Rule::Simple <<| tag == 'roles::puppetmaster' |>> {
port => ['ldap'],
}
But then a compromised server could send a different saddr and
there's nothing the LDAP server could do here: it cannot override the
address because it's exactly the information we need from the other
server...
PuppetDB lookups
A common pattern in Puppet is to extract information from host A and use it on host B. The above "exported resources" pattern can do this for files, commands and many more resources, but sometimes we just want a tiny bit of information to embed in a configuration file. This could, in theory, be done with an exported concat resource, but this can become prohibitively complicated for something as simple as an allowed IP address in a configuration file.
For this we use the puppetdbquery module, which allows us to do
elegant queries against PuppetDB. For example, this will extract the
IP addresses of all nodes with the roles::gitlab class applied:
$allow_ipv4 = query_nodes('Class[roles::gitlab]', 'networking.ip')
$allow_ipv6 = query_nodes('Class[roles::gitlab]', 'networking.ip6')
This code, in profile::kgb_bot, propagates those variables into a
template through the allowed_addresses variable, which gets expanded
like this:
<% if $allow_addresses { -%>
<% $allow_addresses.each |String $address| { -%>
allow <%= $address %>;
<% } -%>
deny all;
<% } -%>
Note that there is a potential security issue with that approach. The same way that exported resources trust the exporter, we trust that the node exported the right fact. So it's in theory possible that a compromised Puppet node exports an evil IP address in the above example, granting access to an attacker instead of the proper node. If that is a concern, consider using LDAP or Hiera lookups instead.
Also note that this will eventually fail when the node goes down: after a while, resources are expired from the PuppetDB server and the above query will return an empty list. This seems reasonable: we do want to eventually revoke access to nodes that go away, but it's still something to keep in mind.
Keep in mind that the networking.ip fact, in the above example,
might be incorrect in the case of a host that's behind NAT. In that
case, you should use LDAP or Hiera lookups.
Note that this could also be implemented with a concat exported
resource, but much harder because you would need some special case
when no resource is exported (to avoid adding the deny) and take
into account that other configurations might also be needed in the
file. It would have the same security and expiry issues anyways.
Puppet query language
Note that there's also a way to do those queries without a Forge
module, through the Puppet query language and the
puppetdb_query function. The problem with that approach is that the
function is not very well documented and the query syntax is somewhat
obtuse. For example, this is what I came up with to do the equivalent
of the query_nodes call, above:
$allow_ipv4 = puppetdb_query(
['from', 'facts',
['and',
['=', 'name', 'networking.ip'],
['in', 'certname',
['extract', 'certname',
['select_resources',
['and',
['=', 'type', 'Class'],
['=', 'title', 'roles::gitlab']]]]]]])
It seems like I did something wrong, because that returned an empty
array. I could not figure out how to debug this, and apparently I
needed more functions (like map and filter) to get what I wanted
(see this gist). I gave up at that point: the puppetdbquery
abstraction is much cleaner and usable.
If you are merely looking for a hostname, however, PQL might be a
little more manageable. For example, this is how the
roles::onionoo_frontend class finds its backends to setup the
IPsec network:
$query = 'nodes[certname] { resources { type = "Class" and title = "Roles::Onionoo_backend" } }'
$peer_names = sort(puppetdb_query($query).map |$value| { $value["certname"] })
$peer_names.each |$peer_name| {
$network_tag = [$::fqdn, $peer_name].sort().join('::')
ipsec::network { "ipsec::${network_tag}":
peer_networks => $base::public_addresses
}
}
Note that Voxpupuli has a helpful list of Puppet Query Language examples as well. Those are based on the puppet query command line tool, but it gives good examples of possible queries that can be used in manifests as well.
LDAP lookups
Our Puppet server is hooked up to the LDAP server and has information
about the hosts defined there. Information about the node running the
manifest is available in the global $nodeinfo variable, but there is
also a $allnodeinfo parameter with information about every host
known in LDAP.
A simple example of how to use the $nodeinfo variable is how the
base::public_address and base::public_address6 parameters -- which
represent the IPv4 and IPv6 public address of a node -- are
initialized in the base class:
class base(
Stdlib::IP::Address $public_address = filter_ipv4(getfromhash($nodeinfo, 'ldap', 'ipHostNumber'))[0],
Optional[Stdlib::IP::Address] $public_address6 = filter_ipv6(getfromhash($nodeinfo, 'ldap', 'ipHostNumber'))[0],
) {
$public_addresses = [ $public_address, $public_address6 ].filter |$addr| { $addr != undef }
}
This loads the ipHostNumber field from the $nodeinfo variable, and
uses the filter_ipv4 or filter_ipv6 functions to extract the IPv4
or IPv6 addresses respectively.
A good example of the $allnodeinfo parameter is how the
roles::onionoo_frontend class finds the IP addresses of its
backend. After having loaded the host list from PuppetDB, it then uses
the parameter to extract the IP address:
$backends = $peer_names.map |$name| {
[
$name,
$allnodeinfo[$name]['ipHostNumber'].filter |$a| { $a =~ Stdlib::IP::Address::V4 }[0]
] }.convert_to(Hash)
Such a lookup is considered more secure than going through PuppetDB as LDAP is a trusted data source. It is also our source of truth for this data, at the time of writing.
Hiera lookups
For more security-sensitive data, we should use a trusted data source
to extract information about hosts. We do this through Hiera lookups,
with the lookup function. A good example is how we populate the
SSH public keys on all hosts, for the admin user. In the
profile::ssh class, we do the following:
$keys = lookup('profile::admins::keys', Data, 'hash')
This will lookup the profile::admin::keys field in Hiera, which is a
trusted source because under the control of the Puppet git repo. This
refers to the following data structure in hiera/common.yaml:
profile::admins::keys:
anarcat:
type: "ssh-rsa"
pubkey: "AAAAB3[...]"
The key point with Hiera is that it's a "hierarchical" data structure, so each host can have its own override. So in theory, the above keys could be overridden per host. Similarly, the IP address information for each host could be stored in Hiera instead of LDAP. But in practice, we do not currently do this and the per-host information is limited.
Looking for facts values across the fleet
This will show you how many hosts per hoster, a fact present on every host:
curl -s -X GET http://localhost:8080/pdb/query/v4/facts \
--data-urlencode 'query=["=", "name", "hoster"]' \
| jq -r .[].value | sort | uniq -c | sort -n
Example:
root@puppetdb-01:~# curl -s -X GET http://localhost:8080/pdb/query/v4/facts --data-urlencode 'query=["=", "name", "hoster"]' | jq -r .[].value | sort | uniq -c | sort -n
1 hetzner-dc14
1 teksavvy
3 hetzner-hel1
3 hetzner-nbg1
3 safespring
38 hetzner-dc13
47 quintex
Such grouping can be done directly in the query language though, for example, this shows the number of hosts per Debian release:
curl -s -G http://localhost:8080/pdb/query/v4/fact-contents \
--data-urlencode 'query=["extract", [["function","count"],"value"], ["=","path",["os","distro","codename"]], ["group_by", "value"]]' | jq
Example:
root@puppetdb-01:~# curl -s -G http://localhost:8080/pdb/query/v4/fact-contents --data-urlencode 'query=["extract", [["function","count"],"value"], ["=","path",["os","distro","codename"]], ["group_by", "value"]]' | jq
[
{
"count": 51,
"value": "bookworm"
},
{
"count": 45,
"value": "trixie"
}
]
Revoking and generating a new certificate for a host
Revocation procedures problems were discussed in 33587 and 33446.
-
Clean the certificate on the master
puppet cert clean host.torproject.org -
Clean the certificate on the client:
find /var/lib/puppet/ssl -name host.torproject.org.pem -delete -
Then run the bootstrap script on the client from
fabric-tasks/installer/puppet-bootstrap-clientand get a new checksum -
Run
tpa-puppet-sign-clienton the master and pass the checksum -
Run
puppet agent -tto have puppet running on the client again.
Generating a batch of resources from Hiera
Say you have a class (let's call it sbuild::qemu) and you want it to
generate some resources from a class parameter (and, by extension,
Hiera). Let's call those parameters sbuild::qemu::image. How do we
do this?
The simplest way is to just use the .each construct and iterate over
each parameter from the class:
# configure a qemu sbuilder
class sbuild::qemu (
Hash[String, Hash] $images = { 'unstable' => {}, },
) {
include sbuild
package { 'sbuild-qemu':
ensure => 'installed',
}
$images.each |$image, $values| {
sbuild::qemu::image { $image: * => $values }
}
}
That will create, by default, an unstable image with the default
parameters defined in sbuild::qemu::image. Some parameters could be
set by default there as well, for example:
$images.each |$image, $values| {
$_values = $values + {
override => "foo",
}
sbuild::qemu::image { $image: * => $_values }
}
Going beyond that allows for pretty complicated rules including validation and so on, for example if the data comes from an untrusted YAML file. See this immerda snippet for an example.
Quickly restore a file from the filebucket
When Puppet changes or deletes a file, a backup is automatically done locally.
Info: Computing checksum on file /etc/subuid
Info: /Stage[main]/Profile::User_namespaces/File[/etc/subuid]: Filebucketed /etc/subuid to puppet with sum 3e8e6d9a252f21f9f5008ebff266c6ed
Notice: /Stage[main]/Profile::User_namespaces/File[/etc/subuid]/ensure: removed
To revert this file at its original location, note the hash sum and run this on the system:
puppet filebucket --local restore /etc/subuid 3e8e6d9a252f21f9f5008ebff266c6ed
A different path may be specified to restore it to another location.
Deployments
Listing all hosts under puppet
This will list all active hosts known to the Puppet master:
ssh -t puppetdb-01.torproject.org 'sudo -u postgres psql puppetdb -P pager=off -A -t -c "SELECT c.certname FROM certnames c WHERE c.deactivated IS NULL"'
The following will list all hosts under Puppet and their virtual
value:
ssh -t puppetdb-01.torproject.org "sudo -u postgres psql puppetdb -P pager=off -F',' -A -t -c \"SELECT c.certname, value_string FROM factsets fs INNER JOIN facts f ON f.factset_id = fs.id INNER JOIN fact_values fv ON fv.id = f.fact_value_id INNER JOIN fact_paths fp ON fp.id = f.fact_path_id INNER JOIN certnames c ON c.certname = fs.certname WHERE fp.name = 'virtual' AND c.deactivated IS NULL\"" | tee hosts.csv
The resulting file is a Comma-Separated Value (CSV) file which can be used for other purposes later.
Possible values of the virtual field can be obtain with a similar
query:
ssh -t puppetdb-01.torproject.org "sudo -u postgres psql puppetdb -P pager=off -A -t -c \"SELECT DISTINCT value_string FROM factsets fs INNER JOIN facts f ON f.factset_id = fs.id INNER JOIN fact_values fv ON fv.id = f.fact_value_id INNER JOIN fact_paths fp ON fp.id = f.fact_path_id WHERE fp.name = 'virtual';\""
The currently known values are: kvm, physical, and xenu.
Other ways of extracting a host list
- Using the PuppetDB API:
curl -s -G http://localhost:8080/pdb/query/v4/facts | jq -r ".[].certname"
The fact API is quite extensive and allows for very complex
queries. For example, this shows all hosts with the apache2 fact
set to true:
curl -s -G http://localhost:8080/pdb/query/v4/facts --data-urlencode 'query=["and", ["=", "name", "apache2"], ["=", "value", true]]' | jq -r ".[].certname"
This will list all hosts sorted by their report date, older first, followed by the timestamp, space-separated:
curl -s -G http://localhost:8080/pdb/query/v4/nodes | jq -r 'sort_by(.report_timestamp) | .[] | "\(.certname) \(.report_timestamp)"' | column -s\ -t
This will list all hosts with the roles::static_mirror class:
curl -s -G http://localhost:8080/pdb/query/v4 --data-urlencode 'query=inventory[certname] { resources { type = "Class" and title = "Roles::Static_mirror" }} ' | jq -r ".[].certname"
This will show all hosts running Debian bookworm:
curl -s -G http://localhost:8080/pdb/query/v4 --data-urlencode 'query=inventory[certname] { facts.os.distro.codename = "bookworm" }' | jq -r ".[].certname"
See also the Looking for facts values across the fleet documentation.
-
Using howto/cumin
-
Using LDAP:
ldapsearch -H ldap://db.torproject.org -x -ZZ -b "ou=hosts,dc=torproject,dc=org" '*' hostname | sed -n '/hostname/{s/hostname: //;p}' | sort
Same, but only hosts not in a Ganeti cluster:
ldapsearch -H ldap://db.torproject.org -x -ZZ -b "ou=hosts,dc=torproject,dc=org" '(!(physicalHost=gnt-*))' hostname | sed -n '/hostname/{s/hostname: //;p}' | sort
Running Puppet everywhere
There are many ways to run a command on all hosts (see next section), but the TL;DR: is to basically use cumin and run this command:
cumin -o txt -b 5 '*' 'puppet agent -t'
But before doing this, consider doing a progressive deployment instead.
Batch jobs on all hosts
With that trick, a job can be ran on all hosts with
parallel-ssh, for example, check the uptime:
cut -d, -f1 hosts.hsv | parallel-ssh -i -h /dev/stdin uptime
This would do the same, but only on physical servers:
grep 'physical$' hosts.hsv | cut -d -f1 | parallel-ssh -i -h /dev/stdin uptime
This would fetch the /etc/motd on all machines:
cut -d -f1 hosts.csv | parallel-slurp -h /dev/stdin -L motd /etc/motd motd
To run batch commands through sudo that requires a password, you will need to
fool both sudo and ssh a little more:
cut -d -f1 hosts.csv | parallel-ssh -P -I -i -x -tt -h /dev/stdin -o pvs sudo pvs
You should then type your password then Control-d. Warning: this will show your password on your terminal and probably in the logs as well.
Batch jobs can also be ran on all Puppet hosts with Cumin:
ssh -N -L8080:localhost:8080 puppetdb-01.torproject.org &
cumin '*' uptime
See howto/cumin for more examples.
Another option for batch jobs is tmux-xpanes.
Progressive deployment
If you are making a major change to the infrastructure, you may want
to deploy it progressively. A good way to do so is to include the new
class manually in an existing role, say in
modules/role/manifests/foo.pp:
class role::foo {
include my_new_class
}
Then you can check the effect of the class on the host with the
--noop mode. Make sure you disable Puppet so that automatic runs do
not actually execute the code, with:
puppet agent --disable "testing my_new_class deployment"
Then the new manifest can be simulated with this command:
puppet agent --enable ; puppet agent -t --noop ; puppet agent --disable "testing my_new_class deployment"
Examine the output and, once you are satisfied, you can re-enable the agent and actually run the manifest with:
puppet agent --enable ; puppet agent -t
If the change is inside an existing class, that change can be
enclosed in a class parameter and that parameter can be passed as an
argument from Hiera. This is how the transition to a managed
/etc/apt/sources.list file was done:
-
first, a parameter was added to the class that would remove the file, defaulting to
false:class torproject_org( Boolean $manage_sources_list = false, ) { if $manage_sources_list { # the above repositories overlap with most default sources.list file { '/etc/apt/sources.list': ensure => absent, } } } -
then that parameter was enabled on one host, say in
hiera/nodes/brulloi.torproject.org.yaml:torproject_org::manage_sources_list: true -
Puppet was run on that host using the simulation mode:
puppet agent --enable ; puppet agent -t --noop ; puppet agent --disable "testing my_new_class deployment" -
when satisfied, the real operation was done:
puppet agent --enable ; puppet agent -t --noop -
then this was added to two other hosts, and Puppet was ran there
-
finally, all hosts were checked to see if the file was present on hosts and had any content, with howto/cumin (see above for alternative way of running a command on all hosts):
cumin '*' 'du /etc/apt/sources.list' -
since it was missing everywhere, the parameter was set to
trueby default and the custom configuration removed from the three test nodes -
then Puppet was ran by hand everywhere, using Cumin, with a batch of 5 hosts at a time:
cumin -o txt -b 5 '*' 'puppet agent -t'because Puppet returns a non-zero value when changes are made, this will above when any one host in a batch of 5 will actually operate a change. You can then examine the output and see if the change is legitimate or abort the configuration change.
Once the Puppet agent is disabled on all nodes, it's possible to enable
it and run the agent only on nodes that still have the agent disabled.
This way it's possible to "resume" a deployment when a problem or
change causes the cumin run to abort.
cumin -b 5 '*' 'if test -f /var/lib/puppet/state/agent_disabled.lock; then puppet agent --enable ; puppet agent -t ; fi'
Because the output cumin produces groups together nodes that return
identical output, and because puppet agent -t outputs unique
strings like catalog serial number and runtime in fractions of a
second, we have made a wrapper called patc that will silence those
and will allow cumin to group those commands together:
cumin -b 5 '*' 'patc'
Adding/removing a global admin
To add a new sysadmin, you need to add their SSH key to the root
account everywhere. This can be done in the profile::admins::key
field in hiera/common.yaml.
You also need to add them to the adm group in LDAP, see adding
users to a group in LDAP.
Troubleshooting
Consult the logs of past local Puppet agent runs
The command journalctl can be used to consult puppet agent logs on
the local machine:
journalctl -t puppet-agent
To view limit logs to the last day only:
journalctl -t puppet-agent --since=-1d
Running Puppet by hand and logging
When a Puppet manifest is not behaving as it should, the first step is to run it by hand on the host:
puppet agent -t
If that doesn't yield enough information, you can see pretty much
everything that Puppet does with the --debug flag. This will, for
example, include Exec resources onlyif commands and allow you to
see why they do not work correctly (a common problem):
puppet agent -t --debug
Finally, some errors show up only on the Puppet server: you can look in
/var/log/daemon.log there for errors that will only show up there.
Finding source of exported resources
Debugging exported resources can be hard since errors are reported by the puppet agent that's collecting the resources but it's not telling us what host exported the resource that's in conflict.
To get further information, we can poke around the underlying database or we can ask PuppetDB.
with SQL queries
Connecting to the PuppetDB database itself can sometimes be easier than trying to operate the API. There you can inspect the entire thing as a normal SQL database, use this to connect:
sudo -u postgres psql puppetdb
It's possible exported resources do surprising things sometimes. It is
useful to look at the actual PuppetDB to figure out which tags
exported resources have. For example, this query lists all exported
resources with troodi in the name:
SELECT certname_id,type,title,file,line,tags FROM catalog_resources WHERE exported = 't' AND title LIKE '%troodi%';
Keep in mind that there are automatic tags in exported resources which can complicate things.
with PuppetDB
This query will look for exported resources with the type
Bacula::Director::Client (which can be a class, define, or builtin resource)
and match a title (the unique "name" of the resource as defined in the
manifests), like in the above SQL example, that contains troodi:
curl -s -X POST http://localhost:8080/pdb/query/v4 \
-H 'Content-Type:application/json' \
-d '{"query": "resources { exported = true and type = \"Bacula::Director::Client\" and title ~ \".*troodi.*\" }"}' \
| jq . | less -SR
Finding all instances of a deployed resource
Say you want to deprecate cron. You want to see where the Cron
resource is used to understand how hard of a problem this is.
This will show you the resource titles and how many instances of each there are:
SELECT count(*),title FROM catalog_resources WHERE type = 'Cron' GROUP BY title ORDER by count(*) DESC;
Example output:
puppetdb=# SELECT count(*),title FROM catalog_resources WHERE type = 'Cron' GROUP BY title ORDER by count(*) DESC;
count | title
-------+---------------------------------
87 | puppet-cleanup-clientbucket
81 | prometheus-lvm-prom-collector-
9 | prometheus-postfix-queues
6 | docker-clear-old-images
5 | docker-clear-nightly-images
5 | docker-clear-cache
5 | docker-clear-dangling-images
2 | collector-service
2 | onionoo-bin
2 | onionoo-network
2 | onionoo-service
2 | onionoo-web
2 | podman-clear-cache
2 | podman-clear-dangling-images
2 | podman-clear-nightly-images
2 | podman-clear-old-images
1 | update rt-spam-blocklist hourly
1 | update torexits for apache
1 | metrics-web-service
1 | metrics-web-data
1 | metrics-web-start
1 | metrics-web-start-rserve
1 | metrics-network-data
1 | rt-externalize-attachments
1 | tordnsel-data
1 | tpo-gitlab-backup
1 | tpo-gitlab-registry-gc
1 | update KAM ruleset
(28 rows)
A more exhaustive list of each resource and where it's declared:
SELECT certname_id,type,title,file,line,tags FROM catalog_resources WHERE type = 'Cron';
Which host uses which resource:
SELECT certname,title FROM catalog_resources JOIN certnames ON certname_id=certnames.id WHERE type = 'Cron' ORDER BY certname;
Top 10 hosts using the resource:
puppetdb=# SELECT certname,count(title) FROM catalog_resources JOIN certnames ON certname_id=certnames.id WHERE type = 'Cron' GROUP BY certname ORDER BY count(title) DESC LIMIT 10;
certname | count
-----------------------------------+-------
meronense.torproject.org | 7
forum-01.torproject.org | 7
ci-runner-x86-02.torproject.org | 7
onionoo-backend-01.torproject.org | 6
onionoo-backend-02.torproject.org | 6
dangerzone-01.torproject.org | 6
btcpayserver-02.torproject.org | 6
chi-node-14.torproject.org | 6
rude.torproject.org | 6
minio-01.torproject.org | 6
(10 rows)
Examining a Puppet catalog
It can sometimes be useful to examine a node's catalog in order to determine if certain resources are present, or to view a resource's full set of parameters.
List resources by type
To list all service resources managed by Puppet on a node, the
command below may be executed on the node itself:
puppet catalog select --terminus rest "$(hostname -f)" service
At the end of the command line, service may be replaced by any
built-in resource types such as file or cron. Defined resource
names may also be used here, like ssl::service.
View/filter full catalog
To extract a node's full catalog in JSON format:
puppet catalog find --terminus rest "$(hostname -f)"
The output can be manipulated using jq to extract more precise
information. For example, to list all resources of a specific type:
jq '.resources[] | select(.type == "File") | .title' < catalog.json
To list all classes in the catalog:
jq '.resources[] | select(.type=="Class") | .title' < catalog.json
To display a specific resource selected by title:
jq '.resources[] | select((.type == "File") and (.title=="sources.list.d"))' < catalog.json
More examples can be found on this blog post.OB
Examining agent reports
If you want to look into what agent run errors happened previously, for example if there were errors during the night but that didn't reoccur on subsequent agent runs, you can use PuppetDB's capabilities of storing and querying agent reports, and then use jq to find out the information you're looking for in the report(s).
In this example, we'll first query for reports and save the output to a file. We'll then filter the file's contents with jq. This approach can let you search for more details in the report more efficiently, but don't forget to remove the file once you're done.
Here we're grabbing the reports for the host pauli.torproject.org where there
were changes done, after a set date -- we're expecting to get only one report as
a result, but that might differ when you run the query:
curl -s -X POST http://localhost:8080/pdb/query/v4 \
-H 'Content-Type:application/json' \
-d '{"query": "reports { certname = \"pauli.torproject.org\" and start_time > \"2024-10-28T00:00:00.000Z\" and status = \"changed\" }" }' \
> pauli_catalog_what_changed.json
Note that the date format above needs to look like what's above, otherwise you
might get a very non-descriptive error like
parse error: Invalid numeric literal at line 1, column 12
With the report in the file on disk, we can query for certain details.
To see what puppet did during the run:
jq .[].logs.data pauli_catalog_what_changed.json
For more information about what information is available in reports, check out the resource endpoint documentation.
Pager playbook
Stale Puppet catalog
A Prometheus PuppetCatalogStale error looks like this:
Stale Puppet catalog on test.torproject.org
One of the following is happening, in decreasing likeliness:
- the node's Puppet manifest has an error of some sort that makes it impossible to run the catalog
- the node is down and has failed to report since the last time specified
- the node was retired but the monitoring or puppet server doesn't know
- the Puppet server is down and all nodes will fail to report in the same way (in which case a lot more warnings will show up, and other warnings about the server will come in)
The first situation will usually happen after someone pushed a commit introducing the error. We try to keep all manifests compiling all the time and such errors should be immediately fixed. Look at the history of the Puppet source tree and try to identify the faulty commit. Reverting such a commit is acceptable to restore the service.
The second situation can happen if a node is in maintenance for an extended duration. Normally, the node will recover when it goes back online. If a node is to be permanently retired, it should be removed from Puppet, using the host retirement procedures.
The third situation should not normally occur: when a host is retired following the retirement procedure, it's also retired from Puppet. That should normally clean up everything, but reports generated by the Puppet reporter do actually stick around for 7 extra days. There's now a silence in the retirement procedure to hide those alerts, but they will still be generated on host retirements.
Finally, if the main Puppet server is down, it should definitely be brought back up. See disaster recovery, below.
In any case, running the Puppet agent on the affected node should give more information:
ssh NODE puppet agent -t
The Puppet metrics are generated by the Puppet reporter, which is
a plugin deployed on the Puppet server (currently pauli) which
accepts reports from nodes and writes metrics in the node exporter's
"textfile collector" directory
(/var/lib/prometheus/node-exporter/). You can, for example, see the
metrics for the host idle-fsn-01 like this:
root@pauli:~# cat /var/lib/prometheus/node-exporter/idle-fsn-01.torproject.org.prom
# HELP puppet_report Unix timestamp of the last puppet run
# TYPE puppet_report gauge
# HELP puppet_transaction_completed transaction completed status of the last puppet run
# TYPE puppet_transaction_completed gauge
# HELP puppet_cache_catalog_status whether a cached catalog was used in the run, and if so, the reason that it was used
# TYPE puppet_cache_catalog_status gauge
# HELP puppet_status the status of the client run
# TYPE puppet_status gauge
# Old metrics
# New metrics
puppet_report{environment="production",host="idle-fsn-01.torproject.org"} 1731076367.657
puppet_transaction_completed{environment="production",host="idle-fsn-01.torproject.org"} 1
puppet_cache_catalog_status{state="not_used",environment="production",host="idle-fsn-01.torproject.org"} 1
puppet_cache_catalog_status{state="explicitly_requested",environment="production",host="idle-fsn-01.torproject.org"} 0
puppet_cache_catalog_status{state="on_failure",environment="production",host="idle-fsn-01.torproject.org"} 0
puppet_status{state="failed",environment="production",host="idle-fsn-01.torproject.org"} 0
puppet_status{state="changed",environment="production",host="idle-fsn-01.torproject.org"} 0
puppet_status{state="unchanged",environment="production",host="idle-fsn-01.torproject.org"} 1
If something is off between reality and what the monitoring system thinks, this file should be inspected for validity, and its timestamp checked. Normally, those files should be updated every time the node runs a catalog, for example.
Expired nodes should disappear from that directory after 7 days,
defined in /etc/puppet/prometheus.yaml. The reporter is hooked in
the Puppet server through the /etc/puppet/puppet.conf file, with the
following line:
[master]
# ...
reports = puppetdb,prometheus
See also issue #41639 for notes on the deployment of that monitoring tool.
Agent running on non-production environment for too long
When we're working on changes that we want to test on a limited number of hosts,
we can change the environment that the puppet agent is using. We usually do this
for short periods of time and it is highly desirable to move the host back to
the production environment once our tests are done.
This alert occurs when a host has been running on a different
environment than production for too long. This has the undesirable
effect that that host might miss out on important changes like access
revocation, policy changes and the like.
If a host has been left away from production for too long, first check out which environment it is running on:
# grep environment /etc/puppet/puppet.conf
environment = alertmanager_template_tests
Check out with TPA members to see if someone is currently actively working on that branch and if the host should still be left on that environment. If so, create a silence for the alert, but for a maximum of 2 weeks at a time.
If the host is not supposed to stay away from production, then check out if bringing it back will cause any undesirable changes:
patn --environment production
If all seems well, run the same command as above but with pat instead of
patn.
Once this is done, also consider whether or not the branch for the environment needs to be removed. If it was already merged into production it's usually safe to remove it.
Note that when a branch gets removed from the control repository, the
corresponding environment is automatically removed. There is also a
script that runs daily on the Puppet server
(tpa-purge-old-branches in a tpa-purge-old-branches.timer and
.service) that deletes branches (and environments) that haven't had
a commit in over two weeks.
This will cause puppet agents running that now-absent environment to automatically revert back to production on subsequent runs, unless they are hardcoded in the ENC.
So this alert should only happen if a branch is in development for more than two weeks or if it is forgotten in the ENC.
Problems pushing to the Puppet server
If you get this error when pushing commits to the Puppet server:
error: remote unpack failed: unable to create temporary object directory
... or, longer version:
anarcat@curie:tor-puppet$ LANG=C git push
Enumerating objects: 7, done.
Counting objects: 100% (7/7), done.
Delta compression using up to 4 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (4/4), 772 bytes | 772.00 KiB/s, done.
Total 4 (delta 2), reused 0 (delta 0), pack-reused 0
error: remote unpack failed: unable to create temporary object directory
To puppet.torproject.org:/srv/puppet.torproject.org/git/tor-puppet
! [remote rejected] master -> master (unpacker error)
error: failed to push some refs to 'puppet.torproject.org:/srv/puppet.torproject.org/git/tor-puppet'
anarcat@curie:tor-puppet[1]$
It's because you're not using the git role account. Update your
remote URL configuration to use git@puppet.torproject.org instead,
with:
git remote set-url origin git@puppet.torproject.org:/srv/puppet.torproject.org/git/tor-puppet.git
This is because we have switched to a role user for pushing changes to the Git repository, see issue 29663 for details.
Error: The CRL issued by 'CN=Puppet CA: pauli.torproject.org' has expired
This error causes the Puppet agent to abort its runs.
Check the expiry date for the Puppet CRL file at /var/lib/puppet/crl.pem:
cumin '*' 'openssl crl -in /var/lib/puppet/ssl/crl.pem -text | grep "Next Update"'
If the date is in the past, the node won't be able to get a catalog from the Puppet server.
An up-to-date CRL may be retrieved from the Puppet server and installed as such:
curl --silent --cert /var/lib/puppet/ssl/certs/$(hostname -f).pem \
--key /var/lib/puppet/ssl/private_keys/$(hostname -f).pem \
--cacert /var/lib/puppet/ssl/certs/ca.pem \
--output /var/lib/puppet/ssl/crl.pem \
"https://puppet:8140/puppet-ca/v1/certificate_revocation_list/ca?environment=production"
TODO: shouldn't the Puppet agent be updating the CRL on its own?
Puppet server CA renewal
If clients fail to run with:
certificate verify failed [certificate has expired for CN=Puppet CA: ...]
It's the CA certificate for the Puppet server that expired. It needs to be renewed. Ideally, this is done before the expiry date to avoid outages, of course.
On the Puppet server:
-
move the old certificate out of the way:
mv /var/lib/puppet/ssl/ca/ca_crt.pem{,.old} -
renew the certificate. this can be done in a plethora of ways. anarcat used those raw OpenSSL instructions to renew only the CSR and CRT files:
cd /var/lib/puppet/ssl/ca openssl x509 -x509toreq -in ca_crt.pem -signkey ca_key.pem -out ca_csr.pem cat > extension.cnf << EOF [CA_extensions] basicConstraints = critical,CA:TRUE nsComment = "Puppet Ruby/OpenSSL Internal Certificate" keyUsage = critical,keyCertSign,cRLSign subjectKeyIdentifier = hash EOF openssl x509 -req -days 3650 -in ca_csr.pem -signkey ca_key.pem -out ca_crt.pem -extfile extension.cnf -extensions CA_extensions openssl x509 -in ca_crt.pem -noout -text|grep -A 3 Validity chown -R puppet:puppet . cp -a ca_crt.pem ../certs/ca.pemBut, presumably, this could also work:
puppetserver ca setupYou might also have to move all of
/var/lib/puppet/ssland/etc/puppet/puppetserver/ca/out of the way for this to work, in which case you need to reissue all node certs as well -
restart the two servers:
systemctl restart puppetserver puppetdb
At this point, you should have a fresh new cert running on the Puppet server and the PuppetDB server. Now you need to deploy that new certs on all client Puppet nodes:
-
deploy the new certificate
/var/lib/puppet/ssl/ca/ca_crt.peminto/var/lib/puppet/ssl/certs/ca.pem:scp ca_crt.pem node.example.com:/var/lib/puppet/ssl/certs/ca.pem -
re-run Puppet:
puppet agent --testor simply:
pat
You might get a warning about a stale CRL:
Error: certificate verify failed [CRL has expired for CN=marcos.anarc.at]
In which case you can just move the old CRL out of the way:
mv /var/lib/puppet/ssl/crl.pem /var/lib/puppet/ssl/crl.pem.orig
You might also end up in situations where the client just can't get back on. In that case, you need to make an entirely new cert for that client. On the server:
puppetserver ca revoke --certname node.example.com
On the client:
mv /var/lib/puppet/ssl{,.orig}
puppet agent --test --waitforcert=2
Then on the server:
puppetserver ca sign --certname node.example.com
You might also get the following warning on some nodes:
Warning: Failed to automatically renew certificate: 403 Forbidden
The manifest applies fine though. It's unclear how to fix this. According to the upstream documentation, this means "Invalid certificate presented" (which, you know, they could have used instead of "Forbidden", since the "reason" field is purely cosmetic, see RFC9112 section 4). Making a new client fixes this.
The installer/puppet-bootstrap-client in fabric-tasks.git must also be
updated.
This is not expected to happen before year 2039.
Failed systemd units on hosts
To check out what's happening with failed systemd units on a host:
systemctl --failed
You can, of course, run this check on all servers with Cumin:
cumin '*' 'systemctl --failed'
If you need further information you can dive into the logs of the units reported by the command above:
journalctl -xeu failed-unit.service
Disaster recovery
Ideally, the main Puppet server would be deployable from Puppet bootstrap code and the main installer. But in practice, much of its configuration was done manually over the years and it MUST be restored from backups in case of failure.
This probably includes a restore of the PostgreSQL database backing the PuppetDB server as well. It's possible this step could be skipped in an emergency, because most of the information in PuppetDB is a cache of exported resources, reports and facts. But it could also break hosts and make converging the infrastructure impossible, as there might be dependency loops in exported resources.
In particular, the Puppet server needs access to the LDAP server, and that is configured in Puppet. So if the Puppet server needs to be rebuilt from scratch, it will need to be manually allowed access to the LDAP server to compile its manifest.
So it is strongly encouraged to restore the PuppetDB server database as well in case of disaster.
This also applies in case of an IP address change of the Puppet server, in which case access to the LDAP server needs to be manually granted before the configuration can run and converge. This is a known bootstrapping issue with the Puppet server and is further discussed in the design section.
Reference
This documents generally how things are setup.
Installation
Setting up a new Puppet server from scratch is not supported, or, to be more accurate, would be somewhat difficult. The server expects various external services to populate it with data, in particular:
The auto-ca component is also deployed manual, and so are the git hooks, repositories and permissions.
This needs to be documented, automated and improved. Ideally, it should be possible to install a new Puppet server from scratch using nothing but a Puppet bootstrap manifest, see issue 30770 and issue 29387, along with discussion about those improvements in this page, for details.
Puppetserver gems
Our Puppet Server deployment depends on two important Ruby gems: trocla, for
secrets management, and net-ldap for LDAP data retrieval, for example via our
nodeinfo() custom Puppet function.
Puppet Server 7 and later rely on JRuby and an isolated Rubygems environment,
so we can't simply install them using Debian packages. Instead, we need to
use the puppetserver gem command to manually install the gems:
puppetserver gem install net-ldap trocla --no-doc
Then restart puppetserver.service.
Starting from trixie, the trocla-puppetserver package will be available to
replace this manual deployment of the trocla gem.
Upgrades
Puppet upgrades can be involved, as backwards compatibility between releases is not always maintained. Worse, newer releases are not always packaged in Debian. TPA, and @lavamind in particular, worked really hard to package the Puppet 7 suite to Debian, which finally shipped in Debian 12 ("bookworm"). Lavamind also packaged Puppet 8 for trixie.
See issue 33588 for the background on this.
SLA
No formal SLA is defined. Puppet runs on a fairly slow cron job so
doesn't have to be highly available right now. This could change in
the future if we rely more on it for deployments.
Design
The Puppet master currently lives on pauli. That server
was setup in 2011 by weasel. It follows the configuration of the
Debian Sysadmin (DSA) Puppet server, which has its source code
available in the dsa-puppet repository.
PuppetDB, which was previously hosted on pauli, now runs on its own dedicated
machine puppetdb-01. Its configuration and PostgreSQL database are managed by
the profile::puppetdb and role::puppetdb class pair.
The service is maintained by TPA and manages all TPA-operated machines. Ideally, all services are managed by Puppet, but historically, only basic services were configured through Puppet, leaving service admins responsible for deploying their services on top of it. That tendency has shifted recently (~2020) with the deployment of the GitLab service through Puppet, for example.
The source code to the Puppet manifests (see below for a Glossary) is managed through git on a repository hosted directly on the Puppet server. Agents are deployed as part of the install process, and talk to the central server using a Puppet-specific certificate authority (CA).
As mentioned in the installation section, the Puppet server assumes a few components (namely LDAP, Let's Encrypt and auto-ca) feed information into it. This is also detailed in the sections below. In particular, Puppet acts as a duplicate "source of truth" for some information about servers. For example, LDAP has a "purpose" field describing what a server is for, but Puppet also has the concept of a role, attributed through Hiera (see issue 30273). A similar problem exists with IP addresses and user access control, in general.
Puppet is generally considered stable, but the code base is somewhat showing its age and has accumulated some technical debt.
For example, much of the Puppet code deployed is specific to Tor (and DSA, to a certain extent) and therefore is only maintained by a handful of people. It would be preferable to migrate to third-party, externally maintained modules (e.g. systemd, but also many others, see issue 29387 for details). A similar problem exists with custom Ruby code implemented for various functions, which is being replaced with Hiera (issue 30020).
Glossary
This is a subset of the Puppet glossary to quickly get you started with the vocabulary used in this document.
- Puppet node: a machine (virtual or physical) running Puppet
- Manifest: Puppet source code
- Catalog: a set of compiled of Puppet source which gets applied on a node by a Puppet agent
- Puppet agents: the Puppet program that runs on all nodes to apply manifests
- Puppet server: the server which all agents connect to to fetch their catalog, also known as a Puppet master in older Puppet versions (pre-6)
- Facts: information collected by Puppet agents on nodes, and exported to the Puppet server
- Reports: log of changes done on nodes recorded by the Puppet server
- PuppetDB server: an application server on top of a PostgreSQL database providing an API to query various resources like node names, facts, reports and so on
File layout
The Puppet server runs on pauli.torproject.org.
Two bare-mode git repositories live on this server, below
/srv/puppet.torproject.org/git:
-
tor-puppet-hiera-enc.git, the external node classifier (ENC) code and data. This repository has a hook that deploys to/etc/puppet/hiera-enc. See the "External node classifier" section below. -
tor-puppet.git, the puppet environments, also referred to as the "control repository". Contains the puppet modules and data. That repository has a hook that deploys to/etc/puppet/code/environments. See the "Environments" section below.
The pre-receive and post-receive hooks are fully managed by
Puppet. Both scripts are basically stubs that use run-parts(8) to
execute a series of hooks in pre-receive.d and
post-receive.d. This was done because both hooks were getting quite
unwieldy and needlessly complicated.
The pre-receive hook will stop processing if one of the called hooks
fails, but not the post-receive hook.
External node classifier
Before catalog compilation occurs, each node is assigned an environment
(production, by default) and a "role" through the ENC, which is configured
using the tor-puppet-hiera-enc.git repository. The node definitions at
nodes/$FQDN.yaml are merged with the defaults defined in
nodes/default.yaml.
To be more accurate, the ENC assigns top-scope $role variable to each node,
which is in turn used to include a role::$rolename class on each node. This
occurs in the default node definition in manifests/site.pp in
tor-puppet.git.
Some nodes include a list of classes, inherited from the previous Hiera-based setup, but we're in the process of transitioning all nodes to single role classes, see issue 40030 for progress on this work.
Environments
Environments on the Puppet Server are managed using tor-puppet.git which is
our "control repository". Each branch on this repo is mapped to an environment
on the server which takes the name of the branch, with every non \W character
replaced by an underscore.
This deployment is orchestrated using a git pre-receive hook that's managed
via the profile::puppet::server class and the puppet module.
In order to test a new branch/environment on a Puppet node after being pushed
to the control repository, additional configuration needs to be done in
tor-puppet-hiera-enc.git to specify which node(s) should use the test
environment instead of production. This is done by editing the
nodes/<name>.yaml file and adding an environment: key at the document root.
Once the environment is not needed anymore, the changes to the ENC should be
reverted before the branch is deleted on the control repo using git push
--delete <branch>. A git hook will take care of cleaning up the environment
files under /etc/puppet/code/environments.
It should be noted that contrary to hiera data and modules, exported resources are not confined by environments. Rather, they all shared among all nodes regadless of their assigned environment.
The environments themselves are structured as follows. All paths are relative to the root of that git repository.
-
modulesinclude modules that are shared publicly and do not contain any TPO-specific configuration. There is aPuppetfilethere that documents where each module comes from and that can be maintained with r10k or librarian. -
siteincludes roles, profiles, and classes that make the bulk of our configuration. -
The
torproject_orgmodule (legacy/torproject_org/manifests/init.pp) performs basic host initialisation, like configuring Debian mirrors and APT sources, installing a base set of packages, configuring puppet and timezone, setting up a bunch of configuration files and runningud-replicate. -
There is also the
hoster.yamlfile (legacy/torproject_org/misc/hoster.yaml) which defines hosting providers and specifies things like which network blocks they use, if they have a DNS resolver or a Debian mirror.hoster.yamlis read by - the
nodeinfo()function (modules/puppetmaster/lib/puppet/parser/functions/nodeinfo.rb), used for setting up the$nodeinfovariable -
ferm'sdef.conftemplate (modules/ferm/templates/defs.conf.erb) -
The root of definitions and execution is in Puppet is found in the
manifests/site.ppfile. Its purpose is to include a role class for the node as well as a number of other classes which are common for all nodes.
Note that the above is the current state of the file hierarchy. As part Hiera transition (issue 30020), a lot of the above architecture will change in favor of the more standard role/profile/module pattern.
Note that this layout might also change in the future with the introduction of a role account (issue 29663) and when/if the repository is made public (which requires changing the layout).
See ticket #29387 for an in-depth discussion.
Installed packages facts
The modules/torproject_org/lib/facter/software.rb file defines our
custom facts, making it possible to get answer to questions like "Is
this host running apache2?" by simply looking at a puppet
variable.
Those facts are deprecated and we should instead install packages through Puppet instead of manually installing packages on hosts.
Style guide
Puppet manifests should generally follow the Puppet style guide. This can be easily done with Flycheck in Emacs, vim-puppet, or a similar plugin in your favorite text editor.
Many files do not currently follow the style guide, as they predate the creation of said guide. Files should not be completely reformatted unless there's a good reason. For example, if a conditional covering a large part of a file is removed and the file needs to be re-indented, it's a good opportunity to fix style in the file. Same if a file is split in two components or for some other reason completely rewritten.
Otherwise the style already in use in the file should be followed.
External Node Classifier (ENC)
We use an External Node Classifier (or ENC for short) to classify nodes in different roles but also assign them environments and other variables. The way the ENC works is that the Puppet server requests information from the ENC about a node before compiling its catalog.
The Puppet server pulls three elements about nodes from the ENC:
-
environmentis the standard way to assign nodes to a Puppet environment. The default isproductionwhich is the only environment currently deployed. -
parametersis a hash where each key is made available as a top-scope variable in a node's manifests. We use this assign a unique "role" to each node. The way this works is, for a given rolefoo, a classrole::foowill be included. That class should only consist of a set of profile classes. -
classesis an array of class names which Puppet includes on the target node. We are currently transitioning from this method of including classes on nodes (previously in Hiera) to theroleparameter and unique role classes.
For a given node named $fqdn, these elements are defined in
tor-puppet-hiera-enc.git/nodes/$fqdn.yaml. Defaults can also be set
in tor-puppet-hiera-enc.git/nodes/default.yaml.
Role classes
Each host defined in the ENC declares which unique role it should be
attributed through the parameter hash. For example, this is what
configures a GitLab runner:
parameters:
- role: gitlab::runner
Roles should be abstract and not implementation specific. Each
role class includes a set of profiles which are implementation
specific. For example, the monitoring role includes
profile::prometheus::server and profile::grafana.
As a temporary exception to this rule, old modules can be included as
we transition from the Hiera mechanism, but eventually those should
be ported to shared modules from the Puppet forge, with our glue built
into a profile on top of the third-party module. The role
role::gitlab follows that pattern correctly. See issue 40030 for
progress on that work.
Hiera
Hiera is a "key/value lookup tool for configuration data" which Puppet uses to look up values for class parameters and node configuration in General.
We are in the process of transitioning over to this mechanism from our previous set of custom YAML lookup system. This documents the way we currently use Hiera.
Common configuration
Class parameters which are common across several or all roles can be
defined in hiera/common.yaml to avoid duplication at the role level.
However, unless this parameter can be expected to change or evolve over time, it's sometimes preferable to hardcode some parameters directly in profile classes in order to keep this dataset from growing too much, which can impact performance of the Puppet server and degrade its readability. In other words, it's OK to place site-specific data in profile manifests, as long as it may never or very rarely change.
These parameters can be override by role and node configurations.
Role configuration
Class parameters specific to a certain node role are defined in
hiera/roles/${::role}.yaml. This is the principal method by which we
configure the various profiles, thus shaping each of the roles we
maintain.
These parameters can be override by node-specific configurations.
Node configuration
On top of the role configuration, some node-specific configuration can
be performed from Hiera. This should be avoided as much as possible,
but sometimes there is just no other way. A good example was the
build-arm-* nodes which included the following configuration:
bacula::client::ensure: "absent"
This disables backups on those machines, which are normally configured
everywhere. This is done because they are behind a firewall and
therefore not reachable, an unusual condition in the network. Another
example is nutans which sits behind a NAT so it doesn't know its own
IP address. To export proper firewall rules, the allow address has
been overridden as such:
bind::secondary::allow_address: 89.45.235.22
Those types of parameters are normally automatically guess inside modules' classes, but they are overridable from Hiera.
Note: eventually all host configuration will be done here, but there
are currently still some configurations hardcoded in individual
modules. For example, the Bacula director is hardcoded in the bacula
base class (in modules/bacula/manifests/init.pp). That should be
moved into a class parameter, probably in common.yaml.
Cron and scheduling
Although Puppet supports running the agent as a daemon, our agent runs are
handled by a systemd timer/service unit pair: puppet-run.timer and
puppet-run.service. These are managed via the profile::puppet class and the
puppet module.
The runs are executed every 4 hours, with a random (but fixed per
host, using FixedRandomDelay) 4 hour delay to spread the runs across
the fleet.
Because the additional delay is fixed, any given host should have any given change applied within the next 4 hours. It follows that a change propagates across the fleet within 4 hours as well.
A Prometheus alert (PuppetCatalogStale) will raise an alarm for
hosts that have not run for more than 24 hours.
LDAP integration
The Puppet is configured to talk with Puppet through a few custom
functions defined in
modules/puppetmaster/lib/puppet/parser/functions. The main plumbing
function is called ldapinfo() and connects to the LDAP server
through db.torproject.org over TLS on port 636. It takes a hostname
as an argument and will load all hosts matching that pattern under the
ou=hosts,dc=torproject,dc=org subtree. If the specified hostname is
the * wildcard, the result will be a hash of host => hash entries,
otherwise only the hash describing the provided host will be
returned.
The nodeinfo() function uses that function to populate the global
$nodeinfo hash available globally, or, more specifically, the
$nodeinfo['ldap'] component. It also loads the $nodeinfo['hoster']
value from the whohosts() function. That function, in turn, tries to
match the IP address of the host against the "hosters" defined in the
hoster.yaml file.
The allnodeinfo() function does a similar task as nodeinfo(),
except that it loads all nodes from LDAP, into a single hash. It
does not include the "hoster" and is therefore equivalent to calling
nodeinfo() on each host and extracting only the ldap member hash
(although it is not implemented that way).
Puppet does not require any special credentials to access the LDAP server. It accesses the LDAP database anonymously, although there is a firewall rule (defined in Puppet) that grants it access to the LDAP server.
There is a bootstrapping problem there: if one would be to rebuild the Puppet server, it would actually fail to compile its catalog because it would not be able to connect to the LDAP server to fetch its catalog, unless the LDAP server has been manually configured to let the Puppet server through.
NOTE: much (if not all?) of this is being moved into Hiera, in
particular the YAML files. See issue 30020 for details. Moving
the host information into Hiera would resolve the bootstrapping
issues, but would require, in turn some more work to resolve questions
like how users get granted access to individual hosts, which is
currently managed by ud-ldap. We cannot, therefore, simply move host
information from LDAP into Hiera without creating a duplicate source
of truth without rebuilding or tweaking the user distribution
system. See also the LDAP design document for more information
about how LDAP works.
Let's Encrypt TLS certificates
Public TLS certificates, as issued by Let's Encrypted, are distributed by Puppet. Those certificates are generated by the "letsencrypt" Git repository (see the TLS documentation for details on that workflow). The relevant part, as far as Puppet is concerned, is that certificates magically end up in the following directory when a certificate is issued or (automatically) renewed:
/srv/puppet.torproject.org/from-letsencrypt
See also the TLS deployment docs for how that directory gets populated.
Normally, those files would not be available from the Puppet
manifests, but the ssl Puppet module uses a special trick whereby
those files are read by Puppet .erb templates. For example, this is
how .crt files get generated on the Puppet master, in
modules/ssl/templates/crt.erb:
<%=
fn = "/srv/puppet.torproject.org/from-letsencrypt/#{@name}.crt"
out = File.read(fn)
out
%>
Similar templates exist for the other files.
Those certificates should not be confused with the "auto-ca" TLS certificates
in use internally and which are deployed directly using a symlink from the
environment's modules/ssl/files/ to /var/lib/puppetserver/auto-ca, see
below.
Internal auto-ca TLS certificates
The Puppet server also manages an internal CA which we informally call "auto-ca". Those certificates are internal in that they are used to authenticate nodes to each other, not to the public. They are used, for example, to encrypt connections between mail servers (in Postfix) and backup servers (in Bacula).
The auto-ca deploys those certificates into an "auto-ca" directory under the
Puppet "$vardir", /var/lib/puppetserver/auto-ca, which is symlinked from the
environment's modules/ssl/files/. Details of that system are available in the
TLS documentation.
Issues
There is no issue tracker specifically for this project, File or search for issues in the team issue tracker with the ~Puppet label.
Monitoring and testing
Puppet is monitored using Prometheus through the Prometheus
reporter. This is a small Ruby module that ingests reports posted
by Puppet agent to the Puppet server and writes metrics to the
Prometheus node exporter textfile collector, in
/var/lib/prometheus/node-exporter.
There is an alert (PuppetCatalogStale) raised for hosts that have
not run for more than 24 hours, and another (PuppetAgentErrors) if a
given node has errors running its catalog.
We were previously checking Puppet twice when we were running Icinga:
- One job ran on the Puppetmaster and checked PuppetDB for
reports. This was done with a patched version of the
check_puppetdb_nodes Nagios check, shipped inside the
tor-nagios-checksDebian package - The same job actually runs twice; once to check all manifests, and another to check each host individually and assign the result to the right host.
The twin checks were present so that we could find stray Puppet hosts. For example, if a host was retired from Icinga but not retired from Puppet, or added to Icinga but not Puppet, we would notice. This was necessary because the Icinga setup was not Puppetized: the twin check now seems superfluous and we only check reports on the server.
Note that we could check agents individually with the puppet agent exporter.
There are no validation checks and a priori no peer review of code: code is directly pushed to the Puppet server without validation. Work is being done to implement automated checks but that is only being deployed on the client side for now, and voluntarily. See the Validating Puppet code section above.
Logs and metrics
PuppetDB exposes a performance dashboard which is accessible via web. To reach
it, first establish an ssh forwarding to puppetdb-01 on port 8080 as
described on this page, and point your browser at
http://localhost:8080/pdb/dashboard/index.html
PuppetDB itself also holds performance information about the Puppet agent runs, which are called "reports". Those reports contain information about changes operated on each server, how long the agent runs take and so on. Those metrics could be made more visible by using a dashboard, but that has not been implemented yet (see issue 31969).
The Puppet server, Puppet agents and PuppetDB keep logs of their
operations. The latter keeps its logs in /var/log/puppetdb/ for a
maximum of 90 days or 1GB, whichever comes first (configured in
/etc/puppetdb/request-logging.xml and
/etc/puppetdb/logback.xml). The other logs are sent to syslog, and
usually end up in daemon.log.
Puppet should hold minimal personally identifiable information, like user names, user public keys and project names.
Other documentation
- Latest Puppet docs - might be too new, see also the Puppet 5.5 docs
- Function reference
- Type reference
- Mapping between versions of Puppet Entreprise, Facter, Hiera, Agent, etc
Discussion
This section goes more in depth into how Puppet is setup, why it was setup the way it was, and how it could be improved.
Overview
Our Puppet setup dates back from 2011, according to the git history, and was probably based off the Debian System Administrator's Puppet codebase which dates back to 2009.
Goals
The general goal of Puppet is to provide basic automation across the architecture, so that software installation and configuration, file distribution, user and some service management is done from a central location, managed in a git repository. This approach is often called Infrastructure as code.
This section also documents possible improvements to our Puppet configuration that we are considering.
Must have
- secure: only sysadmins should have access to push configuration, whatever happens. this includes deploying only audited and verified Puppet code into production.
- code review: changes on servers should be verifiable by our peers, through a git commit log
- fix permissions issues: deployment system should allow all admins to push code to the puppet server without having to constantly fix permissions (e.g. through a role account)
- secrets handling: there are some secrets in Puppet. those should remain secret.
We mostly have this now, although there are concerns about permissions being wrong sometimes, which a role account could fix.
Nice to have
Those are mostly issues with the current architecture we'd like to fix:
- Continuous Integration: before deployment, code should be vetted by a peer and, ideally, automatically checked for errors and tested
- single source of truth: when we add/remove nodes, we should not have to talk to multiple services (see also the install automation ticket and the new-machine discussion
- collaboration with other sysadmins outside of TPA, for which we would need to...
- ... publicize our code (see ticket 29387)
- no manual changes: every change on every server should be committed to version control somewhere
- bare-metal recovery: it should be possible to recover a service's configuration from a bare Debian install with Puppet (and with data from the backup service of course...)
- one commit only: we shouldn't have to commit "twice" to get changes propagated (once in a submodule, once in the parent module, for example)
Non-Goals
- ad hoc changes to the infrastructure. one-off jobs should be handled by fabric, Cumin, or straight SSH.
Approvals required
TPA should approve policy changes as per tpa-rfc-1.
Proposed Solution
To improve on the above "Goals", I would suggest the following configuration.
TL;DR:
- publish our repository (tpo/tpa/team#29387)
- Use a control repository
- Get rid of
3rdparty - Deploy with
g10k - Authenticate with checksums
- Deploy to branch-specific environments (tpo/tpa/team#40861)
- Rename the default branch "production"
- Push directly on the Puppet server
- Use a role account (tpo/tpa/team#29663)
- Use local test environments
- Develop a test suite
- Hook into CI
- OpenPGP verification and web hook
Steps 1-8 could be implemented without too much difficulty and should be a mid term objective. Steps 9 to 12 require significantly more work and could be implemented once the new infrastructure stabilizes.
What follows is an explanation and justification of each step.
Publish our repository
Right now our Puppet repository is private, because there's sensitive information in there. The goal of this step is to make sure we can safely publish our repository without risking disclosing secrets.
Secret data is currently stored in Trocla, and we should keep using it for that purpose. That would avoid having to mess around splitting the repository in multiple components in the short term.
This is the data that needs to be moved into Trocla at the time of writing:
modules/postfix/files/virtual- email addressesmodules/postfix/files/access-1-sender-rejectand related - email addresses- sudoers configurations?
A full audit should be redone before this is completed.
Use a control repository
The base of the infrastructure is a control-repo (example, another more complex example) which chain-loads all the other modules. This implies turning all our "modules" into "profiles" and moving "real" modules (which are fit for public consumption) "outside", into public repositories (see also issue 29387: publish our puppet repository).
Note that the control repository could also be public: we could simply have all the private data inside of Trocla or some other private repository.
The control repository concept originates from the proprietary version of Puppet (Puppet Enterprise or PE) but its logic is applicable to the open source Puppet release as well.
Get rid of 3rdparty
The control repo's core configuration file is the Puppetfile. We
already use a Puppetfile to manage modules inside of the 3rdparty
directory.
Our current modules/ directory would be split into site/, which
is the designated location for roles and profiles, and legacy/, which
would host private custom modules, with the goal of getting rid of legacy/
altogether by either publishing our custom modules and integrating them into
the Puppetfile or transforming them into a new profile class in
site/profile/.
In other words, this is the checklist:
- [x] convert everything to hiera (tpo/tpa/team#30020) - this
requires creating
rolesfor each machine (more or less) -- effectively done as far as this issue is concerned - [ ] sanitize repository (tpo/tpa/team#29387)
- [x] rename
hiera/todata/ - [x] add
site/andlegacy/to modulepaths environment config - [x] move
modules/profile/andmodules/role/modules intosite/ - [x] move remaining modules in
modules/intolegacy/ - [x] move
3rdparty/*into environment root
Once this is done, our Puppet environment would look like this:
-
data/- configuration data for profiles and modules -
modules/- equivalent of the current3rdparty/modules/directory: fully public, reusable code that's aimed at collaboration, mostly code from the Puppet forge or our own repository if no equivalent there -
site/profile/- "magic sauce" on top of 3rd partymodules/to configure 3rd party modules according to our site-specific requirements -
site/role/- abstract classes that assemble several profiles to define a logical role for any given machine in our infrastructure -
legacy/- remaining custom modules that still need to be either published and moved to their own repository inmodules/, or replaced with an existing 3rd party module (eg. from voxpupuli)
Although the module paths would be rearranged, no class names would be changed as a result of this, such that no changes would be required of the actual puppet code.
Deploy with g10k
It seems clear that everyone is converging over the use of a
Puppetfile to deploy code. While there are still monorepos out
there, but they do make our life harder, especially when we need to
operate on non-custom modules.
Instead, we should converge towards not following upstream modules
in our git repository. Modules managed by the Puppetfile would not
be managed in our git monorepo and, instead, would be deployed by
r10k or g10k (most likely the latter because of its support for
checksums).
Note that neither r10k or g10k resolve dependencies in a
Puppetfile. We therefore also need a tool to verify the file
correctly lists all required modules. The following solutions need to
be validated but could address that issue:
- generate-puppetfile: take a
Puppetfileand walk the dependency tree, generating a newPuppetfile(see also this introduction to the project) - Puppetfile-updater: read the
Puppetfileand fetch new releases - ra10ke: a bunch of Rake tasks to validate a
Puppetfile r10k:syntax: syntax check, see alsor10k puppetfile checkr10k:dependencies: check for out of date dependenciesr10k:solve_dependencies: check for missing dependenciesr10k:install: wrapper aroundr10kto install with some caveatsr10k:validate: make sure modules are accessibler10k:duplicates: look for duplicate declarations- lp2r10k: convert "librarian"
Puppetfile(missing dependencies) into a "r10k"Puppetfile(with dependencies)
Note that this list comes from the updating your Puppetfile documentation in the r10k project, which is also relevant here.
Authenticate code with checksums
This part is the main problem with moving away from a monorepo. By
using a monorepo, we can audit the code we push into production. But
if we offload this to r10k, it can download code from wherever the
Puppetfile says, effectively shifting our trust path from OpenSSH
to HTTPS, the Puppet Forge, git and whatever remote gets added to the
Puppetfile.
There is no obvious solution for this right now, surprisingly. Here are two possible alternatives:
-
g10k supports using a
:sha256sumparameter to checksum modules, but that only works for Forge modules. Maybe we could pair this with using an explicitsha1reference for git repository, ensuring those are checksummed as well. The downside of that approach is that it leaves checked out git repositories in a "detached head" state. -
r10khas a pending pull request to add afilter_commanddirective which could run after a git checkout has been performed. it could presumably be used to verify OpenPGP signatures on git commits, although this would work only on modules we sign commits on (and therefore not third party)
It seems the best approach would be to use g10k for now with checksums on both git commit and forge modules.
A validation hook running before g10k COULD validate that all mod
lines have a checksum of some sort...
Note that this approach does NOT solve the "double-commit" problem identified in the Goals. It is believed that only a "monorepo" would fix that problem and that approach comes in direct conflict with the "collaboration" requirement. We chose the latter.
This could be implemented as a patch to ra10ke.
Deploy to branch-specific environments
A key feature of r10k (and, of course, g10k) is that they are capable of deploying code to new environments depending on the branch we're working on. We would enable that feature to allow testing some large changes to critical code paths without affecting all servers.
See tpo/tpa/team#40861.
Rename the default branch "production"
In accordance with Puppet's best practices, the control repository's default branch would be called "production" and not "master".
Also: Black Lives Matter.
Push directly on the Puppet server
Because we are worried about the GitLab attack surface, we could still keep on pushing to the Puppet server for now. The control repository could be mirrored to GitLab using a deploy key. All other repositories would be published on GitLab anyways, and there the attack surface would not matter because of the checksums in the control repository.
Use a role account
To avoid permission issues, use a role account (say git) to accept
pushes and enforce git hooks (tpo/tpa/team#29663).
Use local test environments
It should eventually be possible to test changes locally before pushing to production. This would involve radically simplifying the Puppet server configuration and probably either getting rid of the LDAP integration or at least making it optional so that changes can be tested without it.
This would involve "puppetizing" the Puppet server configuration so that a Puppet server and test agent(s) could be bootstrapped automatically. Operators would run "smoke tests" (running Puppet by hand and looking at the result) to make sure their code works before pushing to production.
Develop a test suite
The next step is to start working on a test suite for services, at
least for new deployments, so that code can be tested without running
things by hand. Plenty of Puppet modules have such test suite,
generally using rspec-puppet and rspec-puppet-facts, and we
already have a few modules in modules that have such tests. The
idea would be to have those tests on a per-role or per-profile basis.
The Foreman people have published their test infrastructure which could be useful as inspiration for our purposes here.
Hook into continuous integration
Once tests are functional, the last step is to move the control repository into GitLab directly and start running CI against the Puppet code base. This would probably not happen until GitLab CI is deployed, and would require lots of work to get there, but would eventually be worth it.
The GitLab CI would be indicative: an operator would need to push to a topic branch there first to confirm tests pass but would still push directly to the Puppet server for production.
Note that we are working on (client-side) validation hooks for now, see issue 31226.
OpenPGP verification and web hook
To stop pushing directly to the Puppet server, we could implement OpenPGP verification on the control repository. If a hook checks that commits are signed by a trusted party, it does not matter where the code is hosted.
A good reference for OpenPGP verification is this guix article which covers a few scenarios and establishes a pretty solid verification workflow. There's also a larger project-wide discussion in GitLab issue 81.
We could use the webhook system to have GitLab notify the Puppet server to pull code.
Cost
N/A.
Alternatives considered
Ansible was considered for managing GitLab for a while, but this was eventually abandoned in favor of using Puppet and the "Omnibus" package.
For ad hoc jobs, fabric is being used.
For code management, I have done a more extensive review of possible alternatives. This talk is a good introduction for git submodule, librarian and r10k. Based on that talk and these slide, I've made the following observations:
ENCs
- LDAP-enc: OFTC uses LDAP to store classes to load for a given host
repository management
monorepo
This is our current approach, which is that all code is committed in one monolithic repository. This effectively makes it impossible to share code outside of the repository with anyone else because there is private data inside, but also because it doesn't follow the standard role/profile/modules separation that makes collaboration possible at all. To work around that, I designed a workflow where we locally clone subrepos as needed, but this is clunky as it requires to commit every change twice: one for the subrepo, one for the parent.
Our giant monorepo also mixes all changes together which can be an pro and a con: on the one hand it's easy to see and audit all changes at once, but on the other hand, it can be overwhelming and confusing.
But it does allow us to integrate with librarian right now and is a good stopgap solution. A better solution would need to solve the "double-commit" problem and still allow us to have smaller repositories that we can collaborate on outside of our main tree.
submodules
The talk partially covers how difficult git submodules work and how
hard they are to deal with. I say partially because submodules are
even harder to deal with than the examples she gives. She shows how
submodules are hard to add and remove, because the metadata is stored
in stored in multiple locations (.gitsubmodules, .git/config,
.git/modules/ and the submodule repository itself).
She also mentions submodules don't know about dependencies and it's likely you will break your setup if you forget one step. (See this post for more examples.)
In my experience, the biggest annoyance with submodules is the "double-commit" problem: you need to make commits in the submodule, then redo the commits in the parent repository to chase the head of that submodule. This does not improve on our current situation, which is that we need to do those two commits anyways in our giant monorepo.
One advantage with submodules is that they're mostly standard: everyone knows about them, even if they're not familiar and their knowledge is reusable outside of Puppet.
Others have strong opinions about submodules, with one Debian
developer suggesting to Never use git submodules and instead
recommending git subtree, a monorepo, myrepos, or ad-hoc scripts.
librarian
Librarian is written in ruby. It's built on top of another library called librarian that is used by Ruby's bundler. At the time of the talk, was "pretty active" but unfortunately, librarian now seems to be abandoned so we might be forced to use r10k in the future, which has a quite different workflow.
One problem with librarian right now is that librarian update clears
any existing git subrepo and re-clones it from scratch. If you have
temporary branches that were not pushed remotely, all of those are
lost forever. That's really bad and annoying! it's by design: it
"takes over your modules directory", as she explains in the talk and
everything comes from the Puppetfile.
Librarian does resolve dependencies recursively and store the decided versions in a lockfile which allow us to "see" what happens when you update from a Puppetfile.
But there's no cryptographic chain of trust between the repository where the Puppetfile is and the modules that are checked out. Unless the module is checked out from git (which isn't the default), only version range specifiers constrain which code is checked out, which gives a huge surface area for arbitrary code injection in the entire puppet infrastructure (e.g. MITM, forge compromise, hostile upstream attacks)
r10k
r10k was written because librarian was too slow for large
deployments. But it covers more than just managing code: it also
manages environments and is designed to run on the Puppet master. It
doesn't have dependency resolution or a Puppetfile.lock,
however. See this ticket, closed in favor of that one.
r10k is more complex and very opiniated: it requires lots of configuration including its own YAML file, hooks into the Puppetmaster and can take a while to deploy. r10k is still in active development and is supported by Puppetlabs, so there's official documentation in the Puppet documentation.
Often used in conjunction with librarian for dependency resolution.
One cool feature is that r10k allows you to create dynamic environments based on branch names. All you need is a single repo with a Puppetfile and r10k handles the rest. The problem, of course, is that you need to trust it's going to do the right thing. There's the security issue, but there's also the problem of resolving dependencies and you do end up double-committing in the end if you use branches in sub-repositories. But maybe that is unavoidable.
(Note that there are ways of resolving dependencies with external tools, like generate-puppetfile (introduction) or this hack that reformats librarian output or those rake tasks. there's also a go rewrite called g10k that is much faster, but with similar limitations.)
git subtree
This article mentions git subtrees from the point of view of Puppet management quickly. It outline how it's cool that the history of the subtree gets merged as is in the parent repo, which gives us the best of both world (individual, per-module history view along with a global view in the parent repo). It makes, however, rebasing in subtrees impossible, as it breaks the parent merge. You do end up with some of the disadvantages of the monorepo in the all the code is actually committed in the parent repo and you do have to commit twice as well.
subrepo
The git-subrepo is "an improvement from git-submodule and
git-subtree". It is a mix between a monorepo and a submodule system,
with modules being stored in a .gitrepo file. It is somewhat less
well known than the other alternatives, presumably because it's newer?
It is entirely written in bash, which I find somewhat scary. It is
not packaged in Debian yet but might be soon.
It works around the "double-commit issue" by having a special git
subrepo commit command that "does the right thing". That, in general,
is its major flaw: it reproduces many git commands like init,
push, pull as subcommands, so you need to remember which command
to run. To quote the (rather terse) manual:
All the subrepo commands use names of actual Git commands and try to do operations that are similar to their Git counterparts. They also attempt to give similar output in an attempt to make the subrepo usage intuitive to experienced Git users.
Please note that the commands are not exact equivalents, and do not take all the same arguments
Still, its feature set is impressive and could be the perfect mix between the "submodules" and "subtree" approach of still keeping a monorepo while avoiding the double-commit issue.
myrepos
myrepos is one of many solutions to manage multiple git repositories. It has been used in the past at my old workplace (Koumbit.org) to manage and checkout multiple git repositories.
Like Puppetfile without locks, it doesn't enforce cryptographic integrity between the master repositories and the subrepositories: all it does is define remotes and their locations.
Like r10k it doesn't handle dependencies and will require extra setup, although it's much lighter than r10k.
Its main disadvantage is that it isn't well known and might seem esoteric to people. It also has weird failure modes, but could be used in parallel with a monorepo. For example, it might allow us to setup specific remotes in subdirectories of the monorepo automatically.
Summary table
| Approach | Pros | Cons | Summary |
|---|---|---|---|
| Monorepo | Simple | Double-commit | Status quo |
| Submodules | Well-known | Hard to use, double-commit | Not great |
| Librarian | Dep resolution client-side | Unmaintained, bad integration with git | Not sufficient on its own |
| r10k | Standard | Hard to deploy, opiniated | To evaluate further |
| Subtree | "best of both worlds" | Still get double-commit, rebase problems | Not sure it's worth it |
| Subrepo | subtree + optional | Unusual, new commands to learn | To evaluate further |
| myrepos | Flexible | Esoteric | might be useful with our monorepo |
Best practices survey
I made a survey of the community (mostly the shared puppet modules and Voxpupuli groups) to find out what the best current practices are.
Koumbit uses foreman/puppet but pinned at version 10.1 because it is
the last one supporting "passenger" (the puppetmaster deployment
method currently available in Debian, deprecated and dropped from
puppet 6). They patched it to support puppetlabs/apache < 6.
They push to a bare repo on the puppet master, then they have
validation hooks (the inspiration for our own hook implementation, see
issue 31226), and a hook deploys the code to the right branch.
They were using r10k but stopped because they had issues when r10k would fail to deploy code atomically, leaving the puppetmaster (and all nodes!) in an unusable state. This would happen when their git servers were down without a locally cached copy. They also implemented branch cleanup on deletion (although that could have been done some other way). That issue was apparently reported against r10k but never got a response. They now use puppet-librarian in their custom hook. Note that it's possible r10k does not actually have that issue because they found the issue they filed and it was... against librarian!
Some people in #voxpupuli seem to use the Puppetlabs Debian packages and therefore puppetserver, r10k and puppetboards. Their Monolithic master architecture uses an external git repository, which pings the puppetmaster through a webhook which deploys a control-repo (example) and calls r10k to deploy the code. They also use foreman as a node classifier. that procedure uses the following modules:
- puppet/puppetserver
- puppetlabs/puppet_agent
- puppetlabs/puppetdb
- puppetlabs/puppet_metrics_dashboard
- voxpupuli/puppet_webhook
- r10k or g10k
- Foreman
They also have a master of masters architecture for scaling to larger setups. For scaling, I have found this article to be more interesting, that said.
So, in short, it seems people are converging towards r10k with a web hook. To validate git repositories, they mirror the repositories to a private git host.
After writing this document, anarcat decided to try a setup with a
"control-repo" and g10k, because the latter can cryptographically
verify third-party repositories, either through a git hash or tarball
checksum. There's still only a single environment (I haven't
implemented the "create an environment on a new branch" hook). And it
often means two checkins when we work on shared modules, but that can
be alleviated by skipping the cryptographic check and trusting
transport by having the Puppetfile chase a branch name instead of a
checksum, during development. In production, of course, a checksum can
then be pinned again, but that is the biggest flaw in that workflow.
Other alternatives
- josh: "Combine the advantages of a monorepo with those of multirepo setups by leveraging a blazingly-fast, incremental, and reversible implementation of git history filtering."
- lerna: Node/JS multi-project management
- lite: git repo splitter
- git-subsplit: "Automate and simplify the process of managing one-way read-only subtree splits"