The "static component" or "static mirror" system is a set of servers, scripts and services designed to publish content over the world wide web (HTTP/HTTPS). It is designed to be highly available and distributed, a sort of content distribution network (CDN).

[[TOC]]

Tutorial

This documentation is about administrating the static site components, from a sysadmin perspective. User documentation lives in doc/static-sites.

How-to

Adding a new component

add the component to Puppet, in modules/staticsync/data/common.yaml:

onionperf.torproject.org:
  master: staticiforme.torproject.org
  source: staticiforme.torproject.org:/srv/onionperf.torproject.org/htdocs/

create the directory on staticiforme:

ssh staticiforme "mkdir -p /srv/onionperf.torproject.org/htdocs/ \
    && chown torwww:torwww /srv/onionperf.torproject.org/{,htdocs}" \
    && chmod 770 /srv/onionperf.torproject.org/{,htdocs}"

add the host to DNS, if not already present, see howto/dns, for example add this line in dns/domains/torproject.org:
```
onionperf   IN  CNAME   static
```
add an Apache virtual host, by adding a line like this in howto/puppet to modules/roles/templates/static-mirroring/vhost/static-vhosts.erb:
```
vhost(lines, 'onionperf.torproject.org')
```
add an SSL service, by adding a line in howto/puppet to modules/roles/manifests/static_mirror_web.pp:
```
ssl::service { onionperf.torproject.org': ensure => 'ifstatic', notify  => Exec['service apache2 reload'], key => true, }
```
This also requires generating an X509 certificate, for which we use Let's encrypt. See howto/letsencrypt for details.
add an onion service, by adding another onion::service line in howto/puppet to modules/roles/manifests/static_mirror_onion.pp:
```
onion::service {
    [...]
    'onionperf.torproject.org',
    [...]
}
```
run Puppet on the master and mirrors:
```
ssh staticiforme puppet agent -t
cumin 'C:roles::static_mirror_web' 'puppet agent -t'
```
The latter is done with howto/cumin, see also howto/puppet for a way to do jobs on all hosts.

consider creating a new role and group for the component if none match its purpose, see howto/create-a-new-user for details:

ssh alberti.torproject.org ldapvi -ZZ --encoding=ASCII --ldap-conf -H ldap://db.torproject.org -D "uid=$USER,ou=users,dc=torproject,dc=org"

if you created a new group, you will probably need to modify the legacy_sudoers file to grant a user access to the role/group, see modules/profile/files/sudo/legacy_sudoers in the tor-puppet repository (and howto/puppet to learn about how to make changes to Puppet). onionperf is a good example of how to create a sudoers file. edit the file with visudo so it checks the syntax:
```
visudo -f modules/profile/files/sudo/legacy_sudoers
```
This, for example, is the line that was added for onionperf:
```
%torwww,%metrics        STATICMASTER=(mirroradm)    NOPASSWD: /usr/local/bin/static-master-update-component onionperf.torproject.org, /usr/local/bin/static-update-component onionperf.torproject.org
```

Removing a component

This procedure can be followed if we remove a static component. We should, however, generally keep a redirection to another place to avoid breaking links, so the instructions also include notes on how to keep a "vanity site" around.

This procedure is common to all cases:

remove the component to Puppet, in modules/staticsync/data/common.yaml
remove the Apache virtual host, by removing a line like this in howto/puppet to modules/roles/templates/static-mirroring/vhost/static-vhosts.erb:
```
vhost(lines, 'onionperf.torproject.org')
```

remove an SSL service, by removing a line in howto/puppet to modules/roles/manifests/static_mirror_web.pp:

ssl::service { onionperf.torproject.org': ensure => 'ifstatic', notify  => Exec['service apache2 reload'], key => true, }

remove onion service, by removing another onion::service line in howto/puppet to modules/roles/manifests/static_mirror_onion.pp:
```
onion::service {
    [...]
    'onionperf.torproject.org',
    [...]
}
```
remove the sudo rules for the role user
If we do want to keep a vanity site for the redirection, we should also do this:
- add an entry to roles::static_mirror_web_vanity, in the ssl::service block of modules/roles/manifests/static_mirror_web_vanity.pp
- add a redirect in the template (modules/roles/templates/static-mirroring/vhost/vanity-vhosts.erb), for example:
  
  Use vanity-host onionperf.torproject.org ^/(.*)$ https://gitlab.torproject.org/tpo/metrics/team/-/wikis/onionperf

deploy the changes globally, replacing {staticsource} with the components source server hostname, often staticiforme or static-gitlab-shim

ssh {staticsource} puppet agent -t
ssh static-master-fsn puppet agent -t
cumin 'C:roles::static_mirror_web or C:roles::static_mirror_web_vanity' 'puppet agent -t'

remove the home directory specified on the server:

ssh {staticsource} "mv /srv/onionperf.torproject.org/htdocs/ /srv/onionperf.torproject.org/htdocs-OLD ; echo rm -rf /srv/onionperf.torproject.org/htdocs-OLD | at now + 7 days"
ssh static-master-fsn "rm -rf /srv/static.torproject.org/master/onionperf.torproject.org*"
cumin -o txt 'C:roles::static_mirror_web' 'mv /srv/static.torproject.org/mirrors/onionperf.torproject.org /srv/static.torproject.org/mirrors/onionperf.torproject.org-OLD'
cumin -o txt 'C:roles::static_mirror_web' 'echo rm -rf /srv/static.torproject.org/mirrors/onionperf.torproject.org-OLD | at now + 7 days'

consider removing the role user and group in LDAP, if there are no files left owned by that user

If we do not want to keep a vanity site, we should also do this:

remove the host to DNS, if not already present, see howto/dns. this can be either in dns/domains.git or dns/auto-dns.git
remove the Let's encrypt certificate, see howto/letsencrypt for details

Out of date mirror

WARNING: this playbook is out of date, as this alert was retired in the Prometheus migration. There's a long-term plan to restore it, but considering those alerts were mostly noise, it has not been prioritized, see tpo/tpa/team#42007.

If you see an error like this in Nagios:

mirror static sync - deb: CRITICAL: 1 mirror(s) not in sync (from oldest to newest): 95.216.163.36

It means that Nagios has checked the given host (hetzner-hel1-03.torproject.org, in this case) is not in sync for the deb component, which is https://deb.torproject.org.

In this case, it was because of a prolonged outage on that host, which made it unreachable to the master server (tpo/tpa/team#40432).

The solution is to run a manual sync. This can be done by, for example, running a deploy job in GitLab (see static-shim) or running static-update-component by hand, see doc/static-sites.

In this particular case, the solution is simply to run this on the static source (palmeri at the time of writing):

static-update-component deb.torproject.org

Disaster recovery

TODO: add a disaster recovery.

Restoring a site from backups

The first thing you need to decide is where you want to restore from. Typically you want to restore the site from the source server. If you do not know where the source server is, you can find it in tor-puppet.git, in the modules/staticsync/data/common.yaml.

Then head to the Bacula director to perform the restore:

ssh bacula-director-01

And run the restore procedure. Enter the bacula console:

# bconsole

Then the procedure, in this case we're restoring from static-gitlab-shim:

restore
5 # (restores latest backup from a host)
77 # (picks static-gitlab-shim from the list)
mark /srv/static-gitlab-shim/status.torproject.org
done
yes

Then wait for the backup to complete. You can check the progress by typing mess to dump all messages (warning: that floods your console) or status director. When the backup is done, you can type quit.

It will be directly on the host, in /var/tmp/bacula-restores. You can change that path to restore in-place in the last step, by typing mod instead of yes. The rest of the guide assumes the restored files are in /var/tmp/bacula-restores/.

Now go on the source server:

ssh static-gitlab-shim.torproject.org

If you haven't restored in place, you should move the current site aside, if present:

mv /srv/static-gitlab-shim/status.torproject.org /srv/static-gitlab-shim/status.torproject.org.orig

Check the permissions are correct on the restored directory:

ls -l /var/tmp/bacula-restores/srv/static-gitlab-shim/status.torproject.org/ /srv/static-gitlab-shim/status.torproject.org.orig/

Typically, you will want to give the files to the shim:

chown -R static-gitlab-shim:static-gitlab-shim /srv/static-gitlab-shim/status.torproject.org/

Then rsync the site in place:

rsync -a /var/tmp/bacula-restores/srv/static-gitlab-shim/status.torproject.org/ /srv/static-gitlab-shim/status.torproject.org/

We rsync the site in case whatever happened to destroy the site will happen again. This will give us a fresh copy of the backup in /var/tmp.

Once that is completed, you need to trigger a static component update:

static-update-component status.torproject.org

The site is now restored.

Reference

Installation

Servers are mostly configured in Puppet, with some exceptions. See the design section section below for details on the Puppet classes in use. Typically, a web mirror will use roles::static_mirror_web, for example.

Web mirror setup

To setup a web mirror, create a new server with the following entries in LDAP:

allowedGroups: mirroradm
allowedGroups: weblogsync

Then run these commands on the LDAP server:

puppet agent -t
sudo -u sshdist ud-generate
sudo -H ud-replicate

This will ensure the mirroradm user is created on the host.

Then the host needs the following Puppet configuration in Hiera-ENC:

classes:
  - roles::static_mirror_web

The following should also be added to the node's Hiera data:

staticsync::static_mirror::get_triggered: false

The get_triggered parameter ensures the host will not block static site updates while it's doing its first sync.

Then Puppet can be ran on the host, after apache2 is installed to make sure the apache2 puppet module picks it up:

apt install apache2
puppet agent -t

You might need to reboot to get some firewall rules to load correctly:

reboot

The server should start a sync after reboot. However, it's likely that the SSH keys it uses to sync have not been propagated to the master server. If the sync fails, you might receive an email with lots of lines like:

[MSM] STAGE1-START (2021-03-11 19:38:59+00:00 on web-chi-03.torproject.org)

It might be worth running the sync by hand, with:

screen sudo -u mirroradm static-mirror-run-all

The server may also need to be added to the static component configuration in modules/staticsync/data/common.yaml, if it is to carry a full mirror, or exclude some components. For example, web-fsn-01 and web-chi-03 both carry all components, so they need to be added to all limit-mirrors statements, like this:

components:
  # [...]
  dist.torproject.org:
    master: static-master-fsn.torproject.org
    source: staticiforme.torproject.org:/srv/dist-master.torproject.org/htdocs
    limit-mirrors:
      - archive-01.torproject.org
      - web-cymru-01.torproject.org
      - web-fsn-01.torproject.org
      - web-fsn-02.torproject.org
      - web-chi-03.torproject.org

Once that is changed, make sure to run puppet agent -t on the relevant static master. After running puppet on the static master, the static-mirror-run-all command needs to be rerun on the new mirror (although it will also run on the next reboot).

When the sync is finished, you can remove this line:

staticsync::static_mirror::get_triggered: false

... and the node can be added to the various files in dns/auto-dns.git.

Then, to be added to Fastly, this was also added to Hiera:

roles::cdn_torproject_org::fastly_backend: true

Once that change is propagated, you need to change the Fastly configuration using the tools in the cdn-config-fastly repository. Note that only one of the nodes is a "backend" for Fastly, and typically not the nodes that are in the main rotation (so that the Fastly frontend survives if the main rotation dies). But the main rotation servers act as a backup for the main backend.

Troubleshooting a new mirror setup

While setting up a new web mirror, you may run into some roadblocks.

Running puppet agent -t produces fails after adding the mirror to puppet:

Error: Cannot create /srv/static.torproject.org/mirrors/blog.staging.torproject.net; parent directory /srv/static.torproject.org/mirrors does not exist

This error happens when running puppet before running an initial sync on the mirror. Run screen sudo -u mirroradm static-mirror-run-all and then re-run puppet.

Running an initial sync on the new mirror fails with this error:

mirroradm@static-master-fsn.torproject.org: Permission denied (publickey).
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3]

The mirror's SSH keys haven't been been added to the static master yet. Run puppet agent -t on the relevant static mirror (in this case static-master-fsn.torproject.org)

Running an initial sync fails with this error:

Error: Could not find user mirroradm

Puppet hasn't run on the LDAP server, so ud-replicate wasn't able to open a connection to the new mirror. Run this command on the LDAP server, and then try the sync again:

puppet agent -t
sudo -u sshdist ud-generate
sudo -H ud-replicate

SLA

This service is designed to be highly available. All web sites should keep working (maybe with some performance degradation) even if one of the hosts goes down. It should also absorb and tolerate moderate denial of service attacks.

Design

The static mirror system is built of three kinds of hosts:

source - builds and hosts the original content (roles::static_source in Puppet)
master - receives the contents from the source, dispatches it (atomically) to the mirrors (roles::static_master in Puppet)
mirror - serves the contents to the user (roles::static_mirror_web in Puppet)

Content is split into different "components", which are units of content that get synchronized atomically across the different hosts. Those components are defined in a YAML file in the tor-puppet.git repository (modules/staticsync/data/common.yaml at the time of writing, but it might move to Hiera, see issue 30020 and puppet).

The GitLab service is used to maintain source code that is behind some websites in the static mirror system. GitLab CI deploys built sites to a static-shim which ultimately serves as a static source that deploys to the master and mirrors.

This diagram summarizes how those components talk to each other graphically:

Static mirrors architecture diagram

A narrative of how changes get propagated through the mirror network is detailed below.

A key advantage of that infrastructure is the higher availability it provides: whereas individual virtual machines are power-cycled for scheduled maintenance (e.g. kernel upgrades), static mirroring machines are removed from the DNS during their maintenance.

Change process

When data changes, the source is responsible for running static-update-component, which instructs the master via SSH to run static-master-update-component, transfers a new copy of the source data to the master using rsync(1) and, upon successful copy, swaps it with the current copy.

The current copy on the master is then distributed to all actual mirrors, again placing a new copy alongside their current copy using rsync(1).

Once the data successfully made it to all mirrors, the mirrors are instructed to swap the new copy with their current copy, at which point the updated data will be served to end users.

Source code inventory

The source code of the static mirror system is spread out in different files and directories in the tor-puppet.git repository:

modules/staticsync/data/common.yaml lists the "components"
modules/roles/manifests/ holds the different Puppet roles:
roles::static_mirror - a generic mirror, see staticsync::static_mirror below
roles::static_mirror_web - a web mirror, including most (but not necessarily all) components defined in the YAML configuration. configures Apache (which the above doesn't). includes roles::static_mirror (and therefore staticsync::static_mirror)
roles::static_mirror_onion - configures the hidden services for the web mirrors defined above
roles::static_source - a generic static source, see staticsync::static_source, below
roles::static_master - a generic static master, see staticsync::static_master below
modules/staticsync/ is the core Puppet module holding most of the source code:
staticsync::static_source - source, which:
- exports the static user SSH key to the master, punching a hole in the firewall
- collects the SSH keys from the master(s)
staticsync::static_mirror - a mirror which does the above and:
- deploys the static-mirror-run and static-mirror-run-all scripts (see below)
- configures a cron job for static-mirror-run-all
- exports a configuration snippet of /etc/static-clients.conf for the master
staticsync::static_master - a master which:
- deploys the static-master-run and static-master-update-component scripts (see below)
- collects the static-clients.conf configuration file, which is the hostname ($::fqdn) of each of the static_sync::static_mirror exports
- configures the basedir (currently /srv/static.torproject.org) and user home directory (currently /home/mirroradm)
- collects the SSH keys from sources, mirrors and other masters
- exports the SSH key to the mirrors and sources
staticsync::base, included by all of the above, deploys:
- /etc/static-components.conf: a file derived from the modules/staticsync/data/common.yaml configuration file
- /etc/staticsync.conf: polyglot (bash and Python) configuration file propagating the base (currently /srv/static.torproject.org, masterbase (currently $base/master) and staticuser (currently mirroradm) settings
- staticsync-ssh-wrap and static-update-component (see below)

TODO: try to figure out why we have /etc/static-components.conf and not directly the YAML file shipped to hosts, in staticsync::base. See the static-components.conf.erb Puppet template.

NOTE: the modules/staticsync/data/common.yaml was previously known as modules/roles/misc/static-components.yaml but was migrated into Hiera as part of tpo/tpa/team#30020.

Scripts walk through

static-update-component is run by the user on the source host.

If not run under sudo as the staticuser already, it sudo's to the staticuser, re-executing itself. It then SSH to the static-master for that component to run static-master-update-component.

LOCKING: none, but see static-master-update-component

static-master-update-component is run on the master host

It rsync's the contents from the source host to the static master, and then triggers static-master-run to push the content to the mirrors.

The sync happens to a new <component>-updating.incoming-XXXXXX directory. On sync success, <component> is replaced with that new tree, and the static-master-run trigger happens.

LOCKING: exclusive locks are held on <component>.lock

static-master-run triggers all the mirrors for a component to initiate syncs.

When all mirrors have an up-to-date tree, they are instructed to update the cur symlink to the new tree.

To begin with, static-master-run copies <component> to <component>-current-push.

This is the tree all the mirrors then sync from. If the push was successful, <component>-current-push is renamed to <component>-current-live.

LOCKING: exclusive locks are held on <component>.lock

static-mirror-run runs on a mirror and syncs components.

There is a symlink called cur that points to either tree-a or tree-b for each component. the cur tree is the one that is live, the other one usually does not exist, except when a sync is ongoing (or a previous one failed and we keep a partial tree).

During a sync, we sync to the tree-<X> that is not the live one. When instructed by static-master-run, we update the symlink and remove the old tree.

static-mirror-run rsync's either -current-push or -current-live for a component.

LOCKING: during all of static-mirror-run, we keep an exclusive lock on the <component> directory, i.e., the directory that holds tree-[ab] and cur.

static-mirror-run-all

Run static-mirror-run for all components on this mirror, fetching the -live- tree.

LOCKING: none, but see static-mirror-run.

staticsync-ssh-wrap

wrapper for ssh job dispatching on source, master, and mirror.

LOCKING: on master, when syncing -live- trees, a shared lock is held on <component>.lock during the rsync process.

The scripts are written in bash except static-master-run, written in Python 2.

Authentication

The authentication between the static site hosts is entirely done through SSH. The source hosts are accessible by normal users, which can sudo to a "role" user which has privileges to run the static sync scripts as sync user. That user then has privileges to contact the master server which, in turn, can login to the mirrors over SSH as well.

The user's sudo configuration is therefore critical and that sudoers configuration could also be considered part of the static mirror system.

The GitLab runners have SSH access to the static-shim service infrastructure, so it can build and push websites, through a private key kept in the project, the public part of which is deployed by Puppet.

Jenkins build jobs

WARNING: Jenkins was retired in late 2021. This documentation is now irrelevant and is kept only for historical purposes. The static-shim with GitLab CI has replaced this.

Jenkins is used to build some websites and push them to the static mirror infrastructure. The Jenkins jobs get triggered from git-rw git hooks, and are (partially) defined in jenkins/tools.git and jenkins/jobs.git. Those are fed into jenkins-job-builder to build the actual job. Those jobs actually build the site with hugo or lektor and package an archive that is then fetched by the static source.

The build scripts are deployed on staticiforme, in the ~torwww home directory. Those get triggered through the ~torwww/bin/ssh-wrap program, hardcoded in /etc/ssh/userkeys/torwww, which picks the right build job based on the argument provided by the Jenkins job, for example:

    - shell: "cat incoming/output.tar.gz | ssh torwww@staticiforme.torproject.org hugo-website-{site}"

Then the wrapper eventually does something like this to update the static component on the static source:

rsync --delete -v -r "${tmpdir}/incoming/output/." "${basedir}"
static-update-component "$component"

Issues

There is no issue tracker specifically for this project, File or search for issues in the team issue tracker with the ~static-component label.

Monitoring and testing

Static site synchronisation is monitored in Nagios, using a block in nagios-master.cfg which looks like:

-
    name: mirror static sync - extra
    check: "dsa_check_staticsync!extra.torproject.org"
    hosts: global
    servicegroups: mirror

That script (actually called dsa-check-mirrorsync) actually makes an HTTP request to every mirror and checks the timestamp inside a "trace" file (.serial) to make sure everyone has the same copy of the site.

There's also a miniature reimplementation of Nagios called mininag which runs on the DNS server. It performs health checks on the mirrors and takes them out of the DNS zonefiles if they become unavailable or have a scheduled reboot. This makes it possible to reboot a server and have the server taken out of rotation automatically.

Logs and metrics

All tor webservers keep a minimal amount of logs. The IP address and time (but not the date) are clear (00:00:00). The referrer is disabled on the client side by sending the Referrer-Policy "no-referrer" header.

The IP addresses are replaced with:

0.0.0.0 - HTTP request
0.0.0.1 - HTTPS request
0.0.0.2 - hidden service request

Logs are kept for two weeks.

Errors may be sent by email.

Metrics are scraped by Prometheus using the "Apache" exporter.

Backups

The source hosts are backed up with Bacula without any special provision.

TODO: check if master / mirror nodes need to be backup. Probably not?

Other documentation

Discussion

Overview

The goal of this discussion section is to consider improvements to the static site mirror system at torproject.org. It might also apply to debian.org, but the focus is currently on TPO.

The static site mirror system has been designed for hosting Debian.org content. Interestingly, it is not used for the operating system mirrors itself, which are synchronized using another, separate system (archvsync).

The static mirror system was written for Debian.org by Peter Palfrader. It has also been patches by other DSA members (Stephen Gran and Julien Cristau both have more than 100 commits on the old code base).

This service is critical: it distributes the main torproject.org websites, but also software releases like the tor project source code and other websites.

Limitations

The maintenance status of the mirror code is unclear: while it is still in use at Debian.org, it is made of a few sets of components which are not bundled in a single package. This makes it hard to follow "upstream", although, in theory, it should be possible to follow the dsa-puppet repository. In practice, that's pretty difficult because the dsa-puppet and tor-puppet repositories have disconnected histories. Even if they would have a common ancestor, the code is spread in multiple directories, which makes it hard to track. There has been some refactoring to move most of the code in a staticsync module, but we still have files strewn over other modules.

The static site system has no unit tests, linting, release process, or CI. Code is deployed directly through Puppet, on the live servers.

There hasn't been a security audit of the system, as far as we could tell.

Python 2 porting is probably the most pressing issue in this project: the static-master-run program is written in old Python 2.4 code. Thankfully it is fairly short and should be easy to port.

The YAML configuration duplicates the YAML parsing and data structures present in Hiera, see issue 30020 and puppet).

Jenkins integration

NOTE: this section is now irrelevant, because Jenkins was retired in favor of the static-shim to GitLab CI. A new site now requires only a change in GitLab and Puppet, successfully reducing this list to 2 services and 2 repositories.

For certain sites, the static site system requires Jenkins to build websites, which further complicates deployments. A static site deployment requiring Jenkins needs updates on 5 different repositories, across 4 different services:

a new static component in the (private) tor-puppet.git repository
a build script in the jenkins/tools.git repository
a build job in the jenkins/jobs.git repository
a new entry in the ssh wrapper in the admin/static-builds.git repository
a new entry in the gitolite-admin.git repository

Goals

Must have

high availability: continue serving content even if one (or a few?) servers go down
atomicity: the deployed content must be coherent
high performance: should be able to saturate a gigabit link and withstand simple DDOS attacks

Nice to have

cache-busting: changes to a CSS or JavaScript file must be propagated to the client reasonably quickly
possibly host Debian and RPM package repositories

Non-Goals

implement our own global content distribution network

Approvals required

Should be approved by TPA.

Proposed Solution

The static mirror system certainly has its merits: it's flexible, powerful and provides a reasonably easy to deploy, high availability service, at the cost of some level of obscurity, complexity, and high disk space requirements.

Cost

Staff, mostly. We expect a reduction in cost if we reduce the number of copies of the sites we have to keep around.

Alternatives considered

GitLab pages could be used as a source?
the cache system could be used as a replacement in the front-end

TODO: benchmark gitlab pages vs (say) apache or nginx.

GitLab pages replacement

It should be possible to replace parts or the entirety of the system progressively, however. A few ideas:

the mirror hosts could be replaced by the cache system. this would possibly require shifting the web service from the mirror to the master or at least some significant re-architecture
the source hosts could be replaced by some parts of the GitLab Pages system. unfortunately, that system relies on a custom webserver, but it might be possible to bypass that and directly access the on-disk files provided by the CI.

The architecture would look something like this:

Static system redesign architecture diagram

Details of the GitLab pages design and installation is available in our GitLab documentation.

Concerns about this approach:

GitLab pages is a custom webserver which issues TLS certs for the custom domains and serves the content, it's unclear how reliable or performant that server is
The pages design assumes the existence of a shared filesystem to deploy content, currently NFS, but they are switching to S3 (as explained above), which introduces significant complexity and moves away from the classic "everything is a file" approach
The new design also introduces a dependency on the main GitLab rails API for availability, which could be a concern, especially since that is usually a "non-free" feature (e.g. PostgreSQL replication and failover, Database load-balancing, traffic load balancer, Geo disaster recovery and, generally, all of Geo and most availability components are non-free).
In general, this increases dependency on GitLab for deployments

Next steps (OBSOLETE, see next section):

[ ] check if the GitLab Pages subsystem provides atomic updates
[x] see how GitLab Pages can be distributed to multiple hosts and how scalable it actually is or if we'll need to run the cache frontend in front of it. update: it can, but with significant caveats in terms of complexity, see above
[ ] setup GitLab pages to test with small, non-critical websites (e.g. API documentation, etc)
[ ] test the GitLab pages API-based configuration and see how it handles outages of the main rails API
[ ] test the object storage system and see if it is usable, debuggable, highly available and performant enough for our needs
[ ] keep track of upstream development of the GitLab pages architecture, see this comment from anarcat outlining some of those concerns

GitLab pages and Minio replacement

The above approach doesn't scale easily: the old GitLab pages implementation relied on NFS to share files between the main server and the GitLab pages server, so it was hard to deploy and scale.

The newer implementation relies on "object storage" (ie. S3) for content, and pings the main GitLab rails app for configuration.

In this comment of the related architecture update, it was acknowledged that "the transition from NFS to API seems like something that eventually will reduce the availability of Pages" but:

it is not that simple because how Pages discovers configuration has impact on availability too. In environments operating in a high scale, NFS is actually a bottleneck, something that reduces the overall availability, and this is certainly true at GitLab. Moving to API allows us to simplify Pages <-> GitLab communication and optimize it beyond what would be possible with modeling communication using NFS.

[...] But requests to GitLab API are also cached so GitLab Pages can survive a short outage of GitLab API. Cache expiration policy is currently hard-coded in the codebase, but once we address issue #281 we might be able to make it configurable for users running their GitLab on-premises too. This can help with reducing the dependency on the GitLab API.

Object storage itself (typically implemented with minio) is itself scalable and highly available, including Active-Active replicas. Object storage could also be used for other artifacts like Docker images, packages, and so on.

That design would take an approach similar to the above, but possibly discarding the cache system in favor of GitLab pages as caching frontends. In that sense:

the mirror hosts could be replaced by the GitLab pages and Minio
the source hosts could be replaced by some parts of the GitLab Pages system. unfortunately, that system relies on a custom webserver, but it might be possible to bypass that and directly access the on-disk files provided by the CI.
there would be no master intermediate service

The architecture would look something like this:

Static system redesign with Minio architecture diagram

This would deprecate the entire static-component architecture, which would eventually be completely retired.

The next step is to figure out a plan for this. We could start by testing custom domains (see tpo/tpa/team#42197 for that request) in a limited way, to see how it behaves and if we're liking it. We would need to see how it interacts with torproject.org domains and there's automation we could do there. We would also need to scale GitLab first (tpo/tpa/team#40479) and possibly wait for the "webserver/website" stages of the Tails merge (TPA-RFC-73) before moving ahead.

This could look something like this:

merge websites/web servers with Tails (tpo/tpa/team#41947)
make an inventory of all static components and evaluate how they could migrate to GitLab pages
limited custom domains tests (tpo/tpa/team#42197)
figure out how to create/manage torproject.org custom domains
scale gitlab (tpo/tpa/team#40479)
scale gitlab pages for HA across multiple points of presence
migrate test sites (e.g. status.tpo)
migrate prod sites progressively
retire static-components system

This implies a migration of all static sites into GitLab CI, by the way. Many sites are currently hand-crafted through shell commands, so that would need collaboration between multiple teams. dist.tpo might be particularly challenging, but has been due for a refactoring for a while anyways.

Replacing Jenkins with GitLab CI as a builder

NOTE: See also the Jenkins documentation and ticket 40364 for more information on the discussion on the different options that were considered on that front.

We have settled for the "SSH shim" design, which is documented in the static-shim page.

This is the original architecture design as it was before the migration:

Static mirrors architecture diagram

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search