Continuous Integration is the system that allows tests to be ran and packages to be built, automatically, when new code is pushed to the version control system (currently git).

Note that the CI system is implemented with GitLab, which has its own documentation. This page, however, documents the GitLab CI things specific to TPA.

This service was setup as a replacement to the previous CI system, Jenkins, which has its own documentation, for historical purposes.

[[TOC]]

Tutorial

GitLab CI has good documentation upstream. This section documents frequent questions we might get about the work.

Getting started

The GitLab CI quickstart should get you started here. Note that there are some "shared runners" you can already use, and which should be available to all projects. So your main task here is basically to write a .gitlab-ci.yml file.

How-to

Why is my CI job not running?

There might be too many jobs in the queue. You can monitor the queue in our Grafana dashboard.

Enabling/disabling runners

If a runner is misbehaving, it might be worth "pausing" it while we investigate, so that jobs don't all fail on that runner. For this, head for the runner admin interface and hit the "pause" button on the runner.

Registering your own runner

While we already have shared runners, in some cases it can be useful to set up a personal runner in your own infrastructure. This can be useful to experiment with a runner with a specialized configuration, or to supplement the capacity of TPA's shared runners.

Setting up a personal runner is fairly easy. Gitlab's runners poll the gitlab instance rather than vice versa, so there is generally no need to deal with firewall rules, NAT traversal, etc. The runner will only run jobs for your project. In general, a personal runner set up on your development machine can work well.

For this you need to first install a runner and register it in GitLab.

You will probably want to configure your runner to use a Docker executor, which is what TPA's runners are. For this you will also need to install Docker engine.

Example (after installing gitlab-runner and docker):

# Get your project's registration token. See
# https://docs.gitlab.com/runner/register/
REGISTRATION_TOKEN="mytoken"

# Get the tags that your project uses for their jobs.
# Generally you can get these by inspecting `.gitlab-ci.yml`
# or inspecting past jobs in the gitlab UI.
# See also
# https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/ci#runner-tags
TAG_LIST="amd64"

# Example runner setup with a basic configuration.
# See `gitlab-runner register --help` for more options.
sudo gitlab-runner register \
  --non-interactive \
  --url=https://gitlab.torproject.org/ \
  --registration-token="$REGISTRATION_TOKEN" \
  --executor=docker \
  --tag-list="$TAG_LIST" \
  --docker-image=ubuntu:latest

# Start the runner
sudo service gitlab-runner start

Converting a Jenkins job

See static-shim for how to migrate jobs from Jenkins.

Finding largest volumes users

See Runner disk fills up.

Running a job locally

It used to be possible to run pipelines locally using gitlab-runner exec but this was deprecated a while ago and the feature is now removed from latest versions of the runner.

According to the GitLab issue tracker the feature is currently redesigned to be more complete, as the above method had important limitations.

An alternative that's reported to be working reasonably well is the 3rd-party gitlab-ci-local project.

Build Docker images with kaniko

It is possible do build Docker images in our Gitlab CI without requiring user namespace support using kaniko. The Gitlab documentation has examples to get started with that task. There are some caveats, though, at the moment:

  1. One needs to pass --force to kaniko's executor or use a different workaround due to a bug in kaniko
  2. Pushing images to the Docker hub is not working out of the box. One rather needs to use the v1 endpoint at the moment due to a bug. Right now passing something like

--destination "index.docker.io/gktpo/${CI_REGISTRY_IMAGE}:oldstable"

to kaniko's executor does the trick for me.

Additionally, as we want to build our images reproducibly, passing --reproducible to the executor is recommended as well.

One final note: the Gitlab CI examples show that a debug image is used as a base image in Gitlab CI. That is important as the non-debug flavor does not come with a shell which is a requirement for Gitlab CI.

This work came out of issue #90 which may have more background information or alternative implementations. In particular, it documents attempts at building containers with buildah and Docker.

TPA-maintained images

Consider using the TPA-maintained images for your CI jobs, in cases where there is one that suits your needs. e.g. consider setting image to something like containers.torproject.org/tpo/tpa/base-images/debian:bookworm instead of just debian:bookworm.

In contrast, "bare" image names like debian:bookworm implicitly pull from the runner's default container registry, which is currently dockerhub. This can be problematic due to dockerhub applying rate-limiting, causing some image-pull requests to fail. Using the TPA-maintained images instead both avoids image-pull failures for your own job, and reduces the CI runner's request-load on dockerhub, thus reducing the incidence of such failures for other jobs that do still pull from there (e.g. for images for which there aren't TPA-maintained alternatives).

FAQ

  • do runners have network access? yes, but that might eventually change
  • how to build from multiple git repositories? install git and clone the extra repositories. using git submodules might work around eventual network access restrictions
  • how do I trust runners? you can setup your own runner for your own project in the GitLab app, but in any case you need to trust the GitLab app. we are considering options for this, see security
  • how do i control the image used by the runners? the docker image is specified in the .gitlab-ci.yml file. but through Docker image policies, it might be possible for specific runners to be restricted to specific, controlled, Docker images.
  • do we provide, build, or host our own Docker images? not yet (but see how to build Docker images with kaniko below). ideally, we would never use images straight from hub.docker.com and build our own ecosystem of images, built FROM scratch or from debootstrap

Finding a runner

Runners are registered with the GitLab rails app under a given code name. Say you're running a job on "#356 (bkQZPa1B) TPA-managed runner groups, includes ci-runner-x86-02 and ci-runner-x86-03, maybe more". That code name (bkQZPa1B) should be present in the runner, in /etc/gitlab-runner/config.toml:

root@ci-runner-x86-02:~# grep bkQZPa1B /etc/gitlab-runner/config.toml
token = "glrt-t1_bkQZPa1Bf5GxtcyTQrbL"

Inversely, if you're on a VM and are wondering which runner is associated with that configuration, you need to look at a substring of the token variable, specifically the first 8 characters following the underscore.

Also note that multiple runners, on different machines, can be registered with the same token.

Pager playbook

A runner fails all jobs

Pause the runner.

Jobs pile up

If too many jobs pile up in the queue, consider inspecting which jobs those are in the job admin interface. Jobs can be canceled there by GitLab admins. For really long jobs, consider talking with the project maintainers and see how those jobs can be optimized.

Runner disk fills up

If you see a warning like:

DISK WARNING - free space: /srv 6483 MB (11% inode=82%):

It's because the runner is taking up all the disk space. This is usually containers, images, or caches from the runner. Those are normally purged regularly but some extra load on the CI system might use up too much space all of a sudden.

To diagnose this issue better, you can see the running containers with (as the gitlab-runner user):

podman ps

... and include stopped or dead containers with:

podman ps -a

Images are visible with:

podman images

And volumes with:

podman volume ls

... although that output is often not very informative because GitLab runner uses volumes to cache data and uses opaque volume names.

If there are any obvious offenders, they can be removed with docker rm (for containers), docker image rm (for images) and docker volume rm (for volumes). But usually, you should probably just run the cleanup jobs by hand, in order:

podman system prune --filter until=72h

The time frame can be lowered for a more aggressive cleanup. Volumes can be cleaned with:

podman system prune --volumes

And images can be cleaned with:

podman system prune --force --all --filter until=72h

Those commands mostly come from the profile::podman::cleanup class, which might have other commands already. Other cleanup commands are also set in profile::gitlab::runner::docker.

The tpa-du-gl-volumes script can also be used to analyse which project is using the most disk space:

tpa-du-gl-volumes ~gitlab-runner/.local/share/containers/storage/volumes/*

Then those pipelines can be adjusted to cache less.

Disk full on GitLab server

Similar to the above, but typically happens on the GitLab server. Documented in the GitLab documentation, see Disk full on GitLab server.

DNS resolution failures

Under certain circumstances (upgrades?) Docker loses DNS resolution (and possibly all of networking?). A symptom is that it simply fails to clone the repository at the start of the job, for example:

fatal: unable to access 'https://gitlab-ci-token:[MASKED]@gitlab.torproject.org/tpo/network-health/sbws.git/': Could not resolve host: gitlab.torproject.org

A workaround is to reboot the runner's virtual machine. It might be that we need to do some more configuration of Docker, see upstream issue 6644, although it's unclear why this problem is happening right now. Still to be more fully investigated, see tpo/tpa/gitlab#93.

"unadvertised object" error

If a project's pipeline fails to clone submodules with this error:

Updating/initializing submodules recursively with git depth set to 1...
Submodule 'lego' (https://git.torproject.org/project/web/lego.git) registered for path 'lego'
Cloning into '/builds/tpo/web/tpo/lego'...
error: Server does not allow request for unadvertised object 0d9efebbaec064730fba8438dda2d666585247a0
Fetched in submodule path 'lego', but it did not contain 0d9efebbaec064730fba8438dda2d666585247a0. Direct fetching of that commit failed.

that is because the depth configuration is too shallow. In the above, we see:

Updating/initializing submodules recursively with git depth set to 1...

In this case, the submodule is being cloned with only the latest commit attached. If the project refers to a previous version of that submodule, this will fail.

To fix this, change the Git shallow clone value to a higher one. The default is 50, but you can set it to zero or empty to disable shallow clones. See also "Limit the number of changes fetched during clone" in the upstream documentation.

gitlab-runner package upgrade

See howto/upgrades#gitlab-runner-upgrades.

CI templates checks failing on 403

If the test job in the ci-templates project fails with:

ERROR: failed to call API endpoint: 403 Client Error: Forbidden for url: https://gitlab.torproject.org/api/v4/projects/1156/ci/lint, is the token valid?

It's probably because the access token used by the job expired. To fix this:

  1. go to the project's access tokens page

  2. select Add new token and make a token with the following parameters:

    • name: tpo/tpa/ci-templates#17
    • expiration: "cleared" (will never expire)
    • role: Maintainer
    • scope: api
  3. copy the secret and paste it in the CI/CD "Variables" section, in the GITLAB_PRIVATE_TOKEN variable

See the gitlab-ci.yml templates section for a discussion.

Job failed because the runner picked an i386 image

Some jobs may fail to run due to tpo/tpa/team#41656 even though the CI configuration didn't request an i386 and would be instead expected to run with an amd64 image. This issue is tracked in tpo/tpa/team#41621.

The workaround is to configure jobs to pull an architecture-specific version of the image instead of one using a multi-arch manifest. For Docker Official Images, this can be done by prefixing with amd64/; e.g. amd64/debian:stable instead of debian:stable. See GitHub's "Architectures other than amd64".

When trying to check what arch the current container is built for, uname -m doesn't work, since that gives the arch of the host kernel, which can still be amd64 inside of an i386 container. You can instead use dpkg --print-architecture (for debian-based images), or apk --print-arch (for alpine-based images).

Disaster recovery

Runners should be disposable: if a runner is destroyed, at most the jobs it is currently running will be lost. Otherwise artifacts should be present on the GitLab server, so to recover a runner is as "simple" as creating a new one.

Reference

Installation

Since GitLab CI is basically GitLab with external runners hooked up to it, this section documents how to install and register runners into GitLab.

Docker on Debian

A first runner (ci-runner-01) was setup by Puppet in the gnt-chi cluster, using this command:

gnt-instance add \
      -o debootstrap+buster \
      -t drbd --no-wait-for-sync \
      --net 0:ip=pool,network=gnt-chi-01 \
      --no-ip-check \
      --no-name-check \
      --disk 0:size=10G \
      --disk 1:size=2G,name=swap \
      --disk 2:size=60G \
      --backend-parameters memory=64g,vcpus=8 \
      ci-runner-01.torproject.org

The role::gitlab::runner Puppet class deploys the GitLab runner code and hooks it into GitLab. It uses the gitlab_ci_runner module from Voxpupuli to avoid reinventing the wheel. But before enabling it on the instance, the following operations need to be performed:

  1. setup the large partition in /srv, and bind-mount it to cover for Docker:

    mkfs -t ext4 -j /dev/sdc1
    echo "UUID=$(blkid /dev/sdc1 -s PARTUUID -o value)  /srv    ext4    defaults    1   2" >> /etc/fstab
    echo "/srv/docker   /var/lib/docker none    bind    0   0" >> /etc/fstab
    mount /srv
    mount /var/lib/docker
    
  2. disable module loading:

    touch /etc/no_modules_disabled
    reboot
    

    ... otherwise the Docker package will fail to install because it will try to load extra kernel modules.

  3. the default gitlab::runner role deploys a single docker runner on the host. For group- or project-specific runners which need special parameters (eg. for Docker), then a new role may be created to pass those to the profile::gitlab::runner class using Hiera. See hiera/roles/gitlab::runner::shadow.yaml for an example.

  4. ONLY THEN the Puppet agent may run to configure the executor, install gitlab-runner and register it with GitLab.

NOTE: we originally used the Debian packages (docker.io and gitlab-runner) instead of the upstream official packages, because those have a somewhat messed up installer and weird key deployment policies. In other words, we would rather avoid having to trust the upstream packages for runners, even though we use them for the GitLab omnibus install. The Debian packages are both somewhat out of date, and the latter is not available in Debian buster (current stable), so it had to be installed from bullseye.

UPDATE: the above turned out to fail during the bullseye freeze (2021-04-27), as gitlab-runner was removed from bullseye, because of an unpatched security issue. We have switched to the upstream Debian packages, since they are used for GitLab itself anyways, which is unfortunate, but will have to do for now.

We also avoided using the puppetlabs/docker module because we "only" need to setup Docker, and not specifically deal with containers, volumes and so on right now. All that is (currently) handled by GitLab runner.

IMPORTANT: when installing a new runner, it is likely to run into rate limiting if it is put into the main rotation immediately. Either slowly add it to the pool by not allowing it to "run untagged jobs" or pre-fetch them from a list generated on another runner.

Podman on Debian

A Podman runner was configured to see if we could workaround limitations in image building (currently requiring Kaniko) and avoid possible issues with Docker itself, specifically those intermittent failures.

The machine was built with less disk space than ci-runner-x86-01 (above), but more or less the same specifications, see this ticket for details on the installation.

After installation, the following steps were taken:

  1. setup the large partition in /srv, and bind-mount it to cover for GitLab Runner's home which includes the Podman images:

    mkfs -t ext4 -j /dev/sda
    echo "/dev/sda  /srv    ext4    defaults    1   2" >> /etc/fstab
    echo "/srv/gitlab-runner    /home/gitlab-runner none    bind    0   0" >> /etc/fstab
    mount /srv
    mount /home/gitlab-runner
    
  2. disable module loading:

    touch /etc/no_modules_disabled
    reboot
    

    ... otherwise Podman will fail to load extra kernel modules. There is a post-startup hook in Puppet that runs a container to load at least part of the module stack, but some jobs failed to start with failed to create bridge "cni-podman0": could not add "cni-podman0": operation not supported (linux_set.go:105:0s).

  3. add the role::gitlab::runner class to the node in Puppet

  4. add the following blob in tor-puppet.git's hiera/nodes/ci-runner-x86-02.torproject.org.yaml:

    profile::user_namespaces::enabled: true
    profile::gitlab::runner::docker::backend: "podman"
    profile::gitlab::runner::defaults:
      executor: 'docker'
      run_untagged: false
      docker_host: "unix:///run/user/999/podman/podman.sock"
      docker_tlsverify: false
      docker_image: "quay.io/podman/stable"
    
  5. run Puppet to deploy gitlab-runner, podman

  6. reboot to get the user session started correctly

  7. run a test job on the host

The last step, specifically, was done by removing all tags from the runner (those were tpa, linux, amd64, kvm, x86_64, x86-64, 16 CPU, 94.30 GiB, debug-terminal, docker), adding a podman tag, and unchecking the "run untagged jobs" checkbox in the UI.

Note that this is currently in testing, see issue 41296 and TPA-RFC-58.

IMPORTANT: when installing a new runner, it is likely to run into rate limiting if it is put into the main rotation immediately. Either slowly add it to the pool by not allowing it to "run untagged jobs" or pre-fetch them from a list generated on another runner.

MacOS/Windows

A special machine (currently chi-node-13) was built to allow builds to run on MacOS and Windows virtual machines. The machine was installed in the Cymru cluster (so following new-machine-cymru). On top of that procedure, the following extra steps were taken on the machine:

  1. a bridge (br0) was setup
  2. a basic libvirt configuration was built in Puppet (within roles::gitlab::ci::foreign)

The gitlab-ci-admin role user and group have access to the machine.

TODO: The remaining procedure still needs to be implemented and documented, here, and eventually converted into a Puppet manifest, see issue 40095. @ahf document how MacOS/Windows images are created and runners are setup. don't hesitate to create separate headings for Windows vs MacOS and for image creation vs runner setup.

Pre-seeding container images

pre-seed the images by fetching them from a list generated from another runner.

Here's how to generate a list of images from an existing runner:

docker images --format "{{.Repository}}:{{.Tag}}" | sort -u | grep -v -e '<none>' -e registry.gitlab.com > images

Note that we skipped untagged images (<none>) and runner-specific images (from registry.gitlab.com). The latter might match more images than needed but it was just a quick hack. The actual image we are ignoring is registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper.

Then that images file can be copied on another host and then read to pull all images at once:

while read image ; do
    if podman images --format "{{.Repository}}:{{.Tag}}" | grep "$image" ; then 
        echo "$image already present"
    else
        while ! podman pull "$image"; do 
            printf "failed to pull image, sleeping 240 seconds, now is: "; date
            sleep 240
        done
    fi 
done < images

This will probably run into rate limiting, but should gently retry once it hits it to match the 100 queries / 6h (one query every 216 seconds, technically) rate limit.

Distributed cache

In order to increase the efficiency of the GitLab CI caching mechanism, job caches configured via the cache: key in .gitlab-ci.yml are uploaded to object storage at the end of jobs, in the gitlab-ci-runner-cache bucket. This means that it doesn't matter on which runner a job is run, it will always get the latest copy of its cache.

This feature is enabled via the runner instance configuration located in /etc/gitlab-runner/config.toml, and is also configured on the OSUOSL-hosted runners.

More details about caching in GitLab CI can be found here: https://docs.gitlab.com/ee/ci/caching/

SLA

The GitLab CI service is offered on a "best effort" basis and might not be fully available.

Design

The CI service was served by Jenkins until the end of the 2021 roadmap. This section documents how the new GitLab CI service is built. See Jenkins section below for more information about the old Jenkins service.

GitLab CI architecture

GitLab CI sits somewhat outside of the main GitLab architecture, in that it is not featured prominently in the GitLab architecture documentation. In practice, it is a core component of GitLab in that the continuous integration and deployment features of GitLab have become a key feature and selling point for the project.

GitLab CI works by scheduling "pipelines" which are made of one or many "jobs", defined in a project's git repository (the .gitlab-ci.yml file). Those jobs then get picked up by one of many "runners". Those runners are separate processes, usually running on a different host than the main GitLab server.

GitLab runner is a program written in Golong which clocks at about 800,000 SLOC, including vendored dependencies, 80,000 SLOC without.

Runners regularly poll the central GitLab for jobs and execute those inside an "executor". We currently support only "Docker" as an executor but are working on different ones, like a custom "podman" (for more trusted runners, see below) or KVM executor (for foreign platforms like MacOS or Windows).

What the runner effectively does is basically this:

  1. it fetches the git repository of the project
  2. it runs a sequence of shell commands on the project inside the executor (e.g. inside a Docker container) with specific environment variables populated from the project's settings
  3. it collects artifacts and logs and uploads those back to the main GitLab server

The jobs are therefore affected by the .gitlab-ci.yml file but also the configuration of each project. It's a simple yet powerful design.

Types of runners

There are three types of runners:

  • shared: "shared" across all projects, they will pick up any job from any project
  • group: those are restricted to run jobs only within a specific group
  • project: those will only run job within a specific project

In addition, jobs can be targeted at specific runners by assigning them a "tag".

Runner tags

Whether a runner will pick a job depends on a few things:

We currently use the following tags:

  • architecture:
  • amd64: popular 64-bit Intel/AMD architecture (equivalents: x86_64 and x86-64)
  • aarch64: the 64-bit ARM extension (equivalents: arm64 and arm64-v8a)
  • i386: 32-bit Intel/AMD architecture (equivalents: x86)
  • ppc64le: IBM Power architecture
  • s390x: Linux on IBM Z architecture
  • OS: linux is usually implicit but other tags might eventually be added for other OS
  • executor type: docker, KVM, etc. docker are the typical runners, KVM runners are possibly more powerful and can, for example, run Docker-inside-Docker (DinD), note that docker can also mean a podman runner, which is tagged on top of docker, as a feature
  • memory size: 64GB, 32GB, 4GB, etc.
  • hosting provider:
  • tpa: runners managed by the sysadmin team
  • fdroid: provided as a courtesy by the F-Droid project
  • osuosl: runners provided by the OSUOSL
  • features:
  • privileged: those containers have actual root access and should explicitly be able to run "Docker in Docker"
  • debug-terminal: supports interactively debugging jobs
  • large: have access to 100% system memory via /dev/shm but only one such job may run at a time on a given runner
  • verylarge: same as large, with sysctl tweaks to allow high numbers of processes (runners with >1TB memory)
  • podman: a docker executor which talks to the podman socket instead of Docker, might be better suited to build container images
  • runner name: for debugging purposes only! allows pipelines to target a specific runner, do not use as runners can come and go without prior warning

Use tags in your configuration only if your job can be fulfilled by only some of those runners. For example, only specify a memory tag if your job requires a lot of memory.

If your job requires the amd64 architecture, specifying this tag by itself is redundant because only runners with this architecture are configured to run untagged jobs. Jobs without any tags will only run on amd64 runners.

Upstream release schedules

GitLab CI is an integral part of GitLab itself and gets released along with the core releases. GitLab runner is a separate software project but usually gets released alongside GitLab.

Security

We do not currently trust GitLab runners for security purposes: at most we trust them to correctly report errors in test suite, but we do not trust it with compiling and publishing artifacts, so they have a low value in our trust chain.

This might eventually change: we may eventually want to build artefacts (e.g. tarballs, binaries, Docker images!) through GitLab CI and even deploy code, at which point GitLab runners could actually become important "trust anchors" with a smaller attack surface than the entire GitLab infrastructure.

The tag-, group-, and project- based allocation of runners is based on a secret token handled on the GitLab server. It is technically possible for an attacker to compromise the GitLab server and access a runner, which makes those restrictions depend on the security of the GitLab server as a whole. Thankfully, the permission model of runners now actually reflects the permissions in GitLab itself, so there are some constraints in place.

Inversely, if a runner's token is leaked, it could be used to impersonate the runner and "steal" jobs from projects. Normally, runners do not leak their own token, but this could happen through, for example, a virtualization or container escape.

Runners currently have full network access: this could be abused by an hostile contributor to use the runner as a start point for scanning or attacking other entities on the network, and even without our network. We might eventually want to firewall runners to prevent them from accessing certain network resources, but that is currently not implemented.

The runner documentation has a section on security which this section is based on.

We are considering a tiered approach to container configuration and access to limit the impact of those security issues.

Image, volume and container storage and caching

GitLab runner creates quite a few containers, volumes and images in the course of its regular work. Those tend to pile up, unless they get cleaned. Upstream suggests a fairly naive shell script to do this cleanup, but it has a number of issues:

  1. it is noisy (tried to patch this locally with this MR, but was refused upstream)
  2. it might be too aggressive

Also note that documentation on this inside GitLab runner is inconsistent at best, see this other MR and this issue.

So we're not using the upstream cleanup script, and we suspect upstream itself is not using it at all (i.e. on gitlab.com) because it's fundamentally ineffective.

Instead, we have a set of cron jobs (in profile::gitlab::runner::docker) which does the following:

  1. clear all volumes and dead containers, daily (equivalent of the upstream clear-docker-cache for volumes, basically)
  2. clear images older than 30 days, daily (unless used by a running container)
  3. clear all dangling (ie. untagged) images, daily
  4. clear all "nightly" images, daily

Note that this documentation might be out of date and the Puppet code should be considered authoritative on this policy, as we've frequently had to tweak this to deal with out of disk issues.

rootless containers

We are testing podman for running containers more securely: because they can run containers "rootless" (without running as root on the host), they are generally thought to be better immune against container escapes.

This could also possibly make it easier to build containers inside GitLab CI, which would otherwise require docker-in-docker (DinD), unsupported by upstream. See those GitLab instructions for details.

Current services

GitLab CI, at TPO, currently runs the following services:

  • continuous integration: mostly testing after commit
  • static website building and deployment
  • shadow simulations, large and small

This is currently used by many teams and is a critical service.

Possible services

It could eventually also run those services:

  • web page hosting through GitLab pages or the existing static site system. this is a requirement to replace Jenkins
  • continuous deployment: applications and services could be deployed directly from GitLab CI/CD, for example through a Kubernetes cluster or just with plain Docker
  • artifact publication: tarballs, binaries and Docker images could be built by GitLab runners and published on the GitLab server (or elsewhere). this is a requirement to replace Jenkins

gitlab-ci.yml templates

TPA offers a set of CI templates files that can be used to do tasks common to multiple projects. It is currently mostly used to build websites and deploy them to the static mirror system but could be expanded for other things.

Each template is validated through CI itself when changes are proposed. This is done through a Python script shipped inside the repository which assumes the GITLAB_PRIVATE_TOKEN variable contains a valid access token with privileges (specifically Maintainer role with api scope).

That access token is currently a project-level access token that needs to be renewed yearly, see tpo/tpa/ci-templates#17 for an incident where that expired. Ideally, the ephemeral CI_JOB_TOKEN should be usable for this, see upstream gitlab-org/gitlab#438781 for that proposal.

Docker Hub mirror

To workaround issues with Docker Hub's pull rate limit (eg. #40335, #42245), we deployed a container registry that acts as a read-only pull-through proxy cache (#42181), effectively serving as a mirror of Docker Hub. All our Docker GitLab Runners are automatically configured to transparently pull from the mirror when trying to fetch container images from the docker.io namespace.

The service is available at https://dockerhub-mirror.torproject.org (initially deployed at dockerhub-mirror-01.torproject.org) but only Docker GitLab Runners managed by TPA are allowed to connect.

The service is managed via the [role::registry_mirror][] role and [profile::registry_mirror][] profile and deploys:

  • an Nginx frontend with a Let's Encrypt TLS certificate that listens on the public addresses and acts as a reverse-proxy to the backend,
  • a registry mirror backend that is provided by the [docker-registry package in Debian][], and
  • configuration for storing all registry data (i.e. image metadata and layers) in the MinIO object storage.

The registry mirror expires the cache after 7 days, by default, and periodically removes old content to save disk space.

Issues

File or search for issues in our GitLab issue tracker with the ~CI label. Upstream has of course an issue tracker for GitLab runner and a project page.

Known upstream issues

  • job log files (job.log) do not get automatically purged, even if their related artifacts get purged (see upstream feature request 17245).

  • the web interface might not correctly count disk usage of objects related to a project (upstream issue 228681) and certainly doesn't count container images or volumes in disk usage

  • kept artifacts cannot be unkept

  • GitLab doesn't track wait times for jobs, we approximate this by tracking queue size and with runner-specific metrics like concurrency limit hits

  • Runners in a virtualised environment such as Ganeti are unable to run i386 container images for an unknown reason, this is being tracked in issue tpo/tpa/team#41656

Monitoring and metrics

CI metrics are aggregated in the GitLab CI Overview Grafana dashboard. It features multiple exporter sources:

  1. the GitLab rails exporter which gives us the queue size
  2. the GitLab runner exporters, which show many jobs are running in parallel (see the upstream documentation)
  3. a home made exporter that queries the GitLab database to extract queue wait times
  4. and finally the node exporter to show memory usage, load and disk usage

Note that not all runners registered on GitLab are directly managed by TPA, so they might not show up in our dashboards.

Tests

To test a runner, it can be registered only with a project, to run non-critical jobs against it. See the installation section for details on the setup.

Logs and metrics

GitLab runners send logs to syslog and systemd. They contain minimal private information: the most I could find were Git repository and Docker image URLs, which do contain usernames. Those end up in /var/log/daemon.log, which gets rotated daily, with a one-week retention.

Backups

This service requires no backups: all configuration should be performed by Puppet and/or documented in this wiki page. A lost runner should be rebuilt from scratch, as per disaster recover.

Other documentation

Discussion

Tor currently previously used Jenkins to run tests, builds and various automated jobs. This discussion was about if and how to replace this with GitLab CI. This was done and GitLab CI is now the preferred CI tool.

Overview

Ever since the GitLab migration, we have discussed the possibility of replacing Jenkins with GitLab CI, or at least using GitLab CI in some way.

Tor currently utilizes a mixture of different CI systems to ensure some form of quality assurance as part of the software development process:

  • Jenkins (provided by TPA)
  • Gitlab CI (currently Docker builders kindly provided by the FDroid project via Hans from The Guardian Project)
  • Travis CI (used by some of our projects such as tpo/core/tor.git for Linux and MacOS builds)
  • Appveyor (used by tpo/core/tor.git for Windows builds)

By the end of 2020 however, pricing changes at Travis CI made it difficult for the network team to continue running the Mac OS builds there. Furthermore, it was felt that Appveyor was too slow to be useful for builds, so it was proposed (issue 40095) to create a pair of bare metal machines to run those builds, through a libvirt architecture. This is an exception to TPA-RFC 7: tools which was formally proposed in TPA-RFC-8.

Goals

In general, the idea here is to evaluate GitLab CI as a unified platform to replace Travis, and Appveyor in the short term, but also, in the longer term, Jenkins itself.

Must have

  • automated configuration: setting up new builders should be done through Puppet
  • the above requires excellent documentation of the setup procedure in the development stages, so that TPA can transform that into a working Puppet manifest
  • Linux, Windows, Mac OS support
  • x86-64 architecture ("64-bit version of the x86 instruction set", AKA x64, AMD64, Intel 64, what most people use on their computers)
  • Travis replacement
  • autonomy: users should be able to setup new builds without intervention from the service (or system!) administrators
  • clean environments: each build should run in a clean VM

Nice to have

  • fast: the runners should be fast (as in: powerful CPUs, good disks, lots of RAM to cache filesystems, CoW disks) and impose little overhead above running the code natively (as in: no emulation)
  • ARM64 architecture
  • Apple M-1 support
  • Jenkins replacement
  • Appveyor replacement
  • BSD support (FreeBSD, OpenBSD, and NetBSD in that order)

Non-Goals

  • in the short term, we don't aim at doing "Continuous Deployment". this is one of the possible goal of the GitLab CI deployment, but it is considered out of scope for now. see also the LDAP proposed solutions section

Approvals required

TPA's approbation required for the libvirt exception, see TPA-RFC-8.

Proposed Solution

The original proposal from @ahf was as follows:

[...] Reserve two (ideally) "fast" Debian-based machines on TPO infrastructure to build the following:

  • Run Gitlab CI runners via KVM (initially with focus on Windows x86-64 and macOS x86-64). This will replace the need for Travis CI and Appveyor. This should allow both the network team, application team, and anti-censorship team to test software on these platforms (either by building in the VMs or by fetching cross-compiled binaries on the hosts via the Gitlab CI pipeline feature). Since none(?) of our engineering staff are working full-time on MacOS and Windows, we rely quite a bit on this for QA.
  • Run Gitlab CI runners via KVM for the BSD's. Same argument as above, but is much less urgent.
  • Spare capacity (once we have measured it) can be used a generic Gitlab CI Docker runner in addition to the FDroid builders.
  • The faster the CPU the faster the builds.
  • Lots of RAM allows us to do things such as having CoW filesystems in memory for the ephemeral builders and should speed up builds due to faster I/O.

All this would be implemented through a GitLab custom executor using libvirt (see this example implementation).

This is an excerpt from the proposal sent to TPA:

[TPA would] build two (bare metal) machines (in the Cymru cluster) to manage those runners. The machines would grant the GitLab runner (and also @ahf) access to the libvirt environment (through a role user).

ahf would be responsible for creating the base image and deploying the first machine, documenting every step of the way in the TPA wiki. The second machine would be built with Puppet, using those instructions, so that the first machine can be rebuilt or replaced. Once the second machine is built, the first machine should be destroyed and rebuilt, unless we are absolutely confident the machines are identical.

Cost

The machines used were donated, but that is still an "hardware opportunity cost" that is currently undefined.

Staff costs, naturally, should be counted. It is estimated the initial runner setup should take less than two weeks.

Alternatives considered

Ganeti

Ganeti has been considered as an orchestration/deployment platform for the runners, but there is no known integration between GitLab CI runners and Ganeti.

If we find the time or an existing implementation, this would still be a nice improvement.

SSH/shell executors

This works by using an existing machine as a place to run the jobs. Problem is it doesn't run with a clean environment, so it's not a good fit.

Parallels/VirtualBox

Note: couldn't figure out what the difference is between Parallels and VirtualBox, nor if it matters.

Obviously, VirtualBox could be used to run Windows (and possibly MacOS?) images (and maybe BSDs?) but unfortunately, Oracle has made of mess of VirtualBox which keeps it out of Debian so this could be a problematic deployment as well.

Docker

Support in Debian has improved, but is still hit-and-miss. no support for Windows or MacOS, as far as I know, so not a complete solution, but could be used for Linux runners.

Docker machine

This was abandoned upstream and is considered irrelevant.

Kubernetes

@anarcat has been thinking about setting up a Kubernetes cluster for GitLab. There are high hopes that it will help us not only with GitLab CI, but also the "CD" (Continuous Deployment) side of things. This approach was briefly discussed in the LDAP audit, but basically the idea would be to replace the "SSH + role user" approach we currently use for service with GitLab CI.

As explained in the goals section above, this is currently out of scope, but could be considered instead of Docker for runners.

Jenkins

See the Jenkins replacement discussion for more details about that alternative.