title: TPA-RFC-56: large file storage costs: included approval: TPA affected users: deadline: One week from 2023-07-03, that is 2023-07-10. status: standard discussion: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40478

[[TOC]]

Summary: setup a new, 1TiB SSD object storage in the gnt-dal cluster using MinIO. Also includes in-depth discussion of alternatives and storage expansion costs in gnt-dal, which could give us an extra 20TiB of storage for 1800$USD.

Background

We've had multiple incident with servers running out of disk space in the past. This RFC aims at collecting a summary of those issues and giving a proposal of a solution that should cover most of them.

Those are the issues that were raised in the past with servers running out of disk space:

GitLab; #40475 (closed), #40615 (closed), #41139: "gitlab-02 running out of disk space". CI artifacts, and non-linear growth events.
GitLab CI; #40431 (closed): "ci-runner-01 invalid ubuntu package signatures"; gitlab#95 (closed): "Occasionally clean-up Gitlab CI storage". Non-linear, possibly explosive and unpredictable growth. Cache sharing issues between runners. Somewhat under control now that we have more runners, but current aggressive cache purging degrades performance.
Backups; #40477 (closed): "backup failure: disk full on bungei". Was non-linear, mostly due to archive-01 but also GitLab. A workaround good for ~8 months (from October 2021, so until June 2022) was deployed and usage seems stable since September 2022.
Metrics; #40442 (closed): "meronense running out of disk space". Linear growth. Current allocation (512GB) seem sufficient for a few more years, conversion to a new storage backend planned (see below).
Collector; #40535 (closed): "colchicifolium disk full". Linear growth, about 200GB used per year, 1TB allocated in June 2023, therefore possibly good for 5 years.
Archives; #40779 (closed): "archive-01 running out of disk space". Added 2TB in May 2022, seem to be using about 500GB per year, good for 2-3 more years.
Legacy Git; #40778 (closed): "vineale out of disk space", May 2022. Negligible (64GB), scheduled for retirement (see TPA-RFC-36).

There are also design and performance issues that are relevant in this discussion:

Ganeti virtual machines storage. A full reboot of all nodes in the cluster takes hours, because all machines need to be migrated between the nodes (which is fine) and do not migrate back to their original pattern (which is not). Improvements have been made to the migration algorithm, but it could also be fixed by changing storage away from DRBD to another storage backend like Ceph.
Large file storage. We were asked where to put large VM images (3x8GB), and we answered "git(lab) LFS" with the intention of moving to object storage if we run out of space on the main VM, see #40767 (closed) for the discussion. We also were requested to host a container registry in tpo/tpa/gitlab#89.
Metrics database. tpo/network-health/metrics/collector#40012 (closed): "Come up with a plan to make past descriptors etc. easier available and queryable (giant database)" (in onionoo/collector storage). This is currently being rebuilt as a Victoria Metrics server (tpo/tpa/team#41130).
Collector storage. #40650 (closed): "colchicifolium backups are barely functional". Backups take days to complete, possible solution is to "Move collector storage from file based to object storage" (tpo/network-health/metrics/collector#40023 (closed), currently on hold).
GitLab scalability. GitLab needs to be scaled up for performance reasons as well, which primarily involves splitting it in multiple machines, see #40479 for that discussion. It's partly in scope of this discussion in the sense that a solution chosen here should be compatible with GitLab's design.

Much of the above and this RFC come from the brainstorm established in issue tpo/tpa/team#40478.

Storage usage analysis

According to Grafana, TPA manages over 60TiB of storage with a capacity of over 160TiB, which includes 60TiB of un-allocated space on LVM volume groups.

About 40TiB of storage is used by the backup storage server and 7TiB by the archive servers, which puts our normal disk usage at less than 15TiB spread over a little over 60 virtual machines.

Top 10 largest disk consumers are:

Backups: 41TiB
archive-01: 6TiB
Tor Browser builders: 4TiB
metrics: 3.6TiB
mirrors: ~948GiB total, ~100-200GiB each mirror/source
people.torproject.org: 743GiB
GitLab: 700GiB (350GiB for main instance, 90GiB per runner)
Prometheus: 150GiB
Gitolite & GitWeb: 175GiB
BTCPayserver: 125GiB

The remaining servers all individually use less than 100GiB and are negligible compared to the above mastodons.

The above is important because it shows we do not have that much storage to handle: all of the above could probably fit in a couple of 8TiB hard drives (HDD) that cost less than 300$ a piece. The question is, of course, how to offer good and reliability performance for that data, and for that HDDs don't quite cut it.

Ganeti clusters capacity

In terms of capacity, the two Ganeti clusters have vastly different specifications and capacity.

The new, high performance gnt-dal cluster has limited disk space, for a total of 22TiB and 9TiB in use, including an unused 5TiB of NVMe storage.

The older gnt-fsn cluster has more than double that capacity, at 48TiB with 19TiB in use, but ~40TiB out of that is made of hard disk drives. The remaining 7TiB of NVMe storage is more than 50% used, at 4TiB.

So we do have good capacity for fast storage on the new cluster, and also good archive capacity on the older cluster.

Proposal

Create a virtual machine to test MinIO as an object storage backend, called minio-01.torproject.org. The VM will deploy MinIO using podman on Debian bookworm and will hold about 1TB of disk space, on the new gnt-dal cluster.

We'll start by using the SSD (vg_ganeti, default) volume group but may provision an extra NVMe volume if MinIO allows it (and if we need lower-latency buckets). We may need to provision extra SSDs to cover for the additional storage needs.

The first user of this cache will be the GitLab registry, which will be enabled using the cache as a storage backend, with the understanding that the service may become unavailable if the object storage system fails somewhat.

Backups will be done using our normal backup procedures which might mean inconsistent backups. An alternative would be to periodically export a snapshot of the object storage to the storage server or locally, but this means duplicating the entire object storage pool.

If this experiment is successful, GitLab runners will start using the object storage server as a cache, using a separate bucket.

More and more services will be migrated to object storage as time goes on and the service is seen as reliable. The full list of services is out of scope of this, but we're thinking of migrating first:

job artifacts and logs
backups
LFS objects
everything else

Each service should be setup with its own bucket for isolation, where possible. Bucket-level encryption will be enabled, if possible.

Eventually, TPA may be able to offer this service outside the team, of other teams express an interest.

We do not consider this a permanent commitment to MinIO. Because the object storage protocol is relatively standard, it's typically "easy" to transfer between two clusters, even if they have different backends. The catch is, of course, the "weight" of the data, which needs to be duplicated to migrated between two solutions. But it should still be possible thanks to bucket replication or even just plain and simple tools like rclone.

Alternatives considered

The above was is proposed following a lengthy evaluation of different alternatives, detailed below.

It should be noted, however, that TPA previously brainstormed this in a meeting , where we said:

We considered the following technologies for the broader problem:

S3 object storage for gitlab

ceph block storage for ganeti

filesystem snapshots for gitlab / metrics servers backups

We'll look at setting up a VM with MinIO for testing. We could first test the service with the CI runners image/cache storage backends, which can easily be rebuilt/migrated if we want to drop that test.

This would disregard the block storage problem, but we could pretend this would be solved at the service level eventually (e.g. redesign the metrics storage, split up the gitlab server). Anyways, migrating away from DRBD to Ceph is a major undertaking that would require a lot of work. It would also be part of the largest "trusted high performance cluster" work that we recently de-prioritized.

This is partly why MinIO was picked over the other alternatives (mainly Ceph and Garage).

Ceph

Ceph is (according to Wikipedia) a "software-defined storage platform that provides object storage, block storage, and file storage built on a common distributed cluster foundation. Ceph provides completely distributed operation without a single point of failure and scalability to the exabyte level, and is freely available."

It's kind of a beast. It's written in C++ and Python and is packaged in Debian. It provides a lot of features we are looking for here:

redundancy ("a la" DRBD)
load-balancing (read/write to multiple servers)
far-ranging object storage compatibility
native Ganeti integration with an iSCSI backend
Puppet module
Grafana and Prometheus dashboards, both packaged in Debian

More features:

block device snapshots and mirroring
erasure coding
self-healing
used at CERN, OVH, and Digital Ocean
yearly release cycle with two-year support lifetime
cache tiering (e.g. use SSDs as caches)
also provides a networked filesystem (CephFS) with an optional NFS frontend

Downsides:

complexity: at least 3-4 daemons to manager a cluster, although this could might be easier to live with thanks to the Debian packages
high hardware requirements (quad-core, 64-128GB RAM, 10gbps), although their minimum requirements are actually quite attainable

Rejected because of its complexity. If we do reconsider our use of DRBD, we might reconsider Ceph again, as we would then be able to run a single storage cluster for all nodes. But then it feels a little dangerous to share object storage access to the block storage system, so that's actually a reason against Ceph.

Scalability promises

CERN started with a 3PB Ceph deployment around 2015. It seems it's still in use:

2017, 65PB
2018, 300PB?
2019, 1PB/day, 115PB/year?
2021, 65PB?

... although, as you can see, it's not exactly clear to me how much data is managed by ceph. they seem to have a good experience with Ceph in any case, with three active committers, and they say it's a "great community", which is certainly a plus.

On the other hand, managing lots of data is part of their core mission, in a sense, so they can probably afford putting more people on the problem than we can.

Complexity and other concerns concerns

GitLab tried to move from the cloud to bare metal. Issue 727 and issue #1 track their attempt to migrate to Ceph which failed. They moved back to the cloud. A choice quote from this deployment issue:

While it's true that we lean towards PostgreSQL, our usage of CephFS was not for the database server, but for the git repositories. In the end we abandoned our usage of CephFS for shared storage and reverted back to a sharded NFS design.

Jeff Atwood also described his experience, presumably from StackOverflow's attempts:

We had disastrous experiences with Ceph and Gluster on bare metal. I think this says more about the immaturity (and difficulty) of distributed file systems than the cloud per se.

This was a Hacker News comment in response to the first article from GitLab.com above, which ended up being correct as GitLab went back to the cloud.

One key thing to keep in mind is that GitLab were looking for an NFS replacement, but we don't use NFS anywhere right now (thank god) so that is not a requirement for us. So those issues might be less of a problem, as the above "horror stories" might not be the same with other storage mechanisms. Indeed, there's a big difference between using Ceph as a filesystem (ie. CephFS) and an object storage (RadosGW) or block storage (RBD), which might be better targets for us.

In particular, we could use Ceph as a block device -- for Ganeti instance disks, which Ganeti has good support for -- or object storage -- for GitLab's "things", which it is now also designed for. And indeed, "NFS" (ie. real filesystem) is now (14.x?) deprecated in GitLab, so shared data storage is expected to go through S3-like "object storage" APIs from here on.

Some more Ceph war stories:

A Ceph war story - a major outage and recovery due to XFS and firmware problems
File systems unfit as distributed storage backends: lessons from ten years of Ceph evolution - how Ceph migrated from normal filesystem backends to their own native block device store ("BlueStore"), an approach also used by recent MinIO versions

Garage

Garage is another alternative, written in Rust. They provide a Docker image and binaries. It is not packaged in Debian.

It was written from scratch by a french association called deuxfleurs.fr. The first release was funded by a NLNet grant and has been renewed for a year in May 2023.

Features:

apparently faster than MinIO on higher-latency links (100ms+)
Prometheus monitoring (see metrics list) and Grafana dashboard
regular releases with actual release numbers, although not yet 1.0 (current is 0.8.2, released 4 months ago as of June 2023, apparently stable enough for production, "Improvements to the recovery behavior and the layout algorithm are planned before v1.0 can come out")
read-after-write consistency (stronger than Amazon S3's eventual consistency)
support for asynchronous replicas (so-called "dangerous" mode that returns to the client as soon as the local write finishes), see the replication mode for details
static website hosting

Missing and downsides:

possibly slower (-10%) than MinIO in raw bandwidth and IOPS, according to this self-benchmark
purposefully no erasure coding, which implies full data duplication across nodes
designed for smaller, "home lab" distributed setups, might not be our target
no built-in authentication system, no support for S3 policies or ACLs
non-goals also include "extreme performance" and features above the S3 API
uses a CRDT and Dynamo instead of Raft, see this discussion for tradeoffs and the design page
no live migration, upgrade procedure currently imply short downtimes
backups require live filesystem snapshots or shutdown, example backup script
no bucket versioning
no object locking
no server-side encryption, they argue for client-side encryption, full disk encryption, and transport encryption instead in their encryption section
no HTTPS support out of the box, can be easily fixed with a proxy

See also their comparison with other software including MinIO. A lot of the information in this section was gleaned from this Hacker News discussion and this other one.

Garage was seriously considered for adoption, especially with our multi-site, heterogeneous environment.

That said, it didn't seem quite mature enough: the lack of bucket encryption, in particular, feels like a deal-breaker. We do not accept the theory that server-side encryption is useless, on the contrary: there's been many cases of S3 buckets being leaked for botched access policies, something that might very well happen to us as well. Adding bucket encryption adds another layer of protection on top of our existing transport (TLS) and at-rest (LUKS) encryption. The latter particularly doesn't address the "leaked bucket" attack vector.

The backup story is also not much better than MinIO, which could have been a deal-breaker giving Garage a win. Unfortunately, it also doesn't keep its own filesystem clean, but it might be cleaner than MinIO, as the developers indicate filesystem snapshots could provide a clean copy, something that's not offered by MinIO.

Still, we might reconsider Garage if we do need a more distributed, high-availability setup. This is currently not part of the GitLab SLA so not a strong enough requirement to move forward with a less popular alternative.

MinIO

MinIO is suggested/shipped by gitlab omnibus now? It is not packaged in Debian. Container deployment probably the only reasonable solution, but watch out for network overhead. no release numbers, unclear support policy. Written in Golang.

Features:

active-active replication, although with low latency (<20ms) and loss requirements (< 0.01%), requires a load balancer for HA
asynchronous replication, can survive replicas going down (data gets cached and resynced after)
bucket replication
erasure coding
rolling upgrades with "a few seconds" downtime (presumably compensated by client-side retries)
object versioning, immutability
Prometheus and InfluxDB monitoring, also includes bucket event notifications
audit logs
external identity providers: LDAP, OIDC (Keycloak specifically)
object server-side encryption through external Key Management Services (e.g. Hashicorp Vault)
built-in TLS support
recommended hardware setups although probably very expensive
self-diagnostics and hardware tests
lifecycle management
FTP/SFTP/FTPS support
has detailed instructions for Linux, MacOS, Windows, Kubernetes and Docker/Podman

Missing and downsides:

only two-node replication
possible licensing issues (see below)
upgrades and pool expansions require all servers to restart at once
cannot resize existing server pools, in other words, a resize means building a new larger server and retiring the old one (!) (note that this only affects multi-node pools, for single-node "test" setups, storage can be scaled from the underlying filesystem transparently)
very high hardware requirements (4 nodes with each 32 cores, 128GB RAM, 8 drives, 25-100GbE for 2-4k clients)
backups need to be done through bucket replication or site replication, difficult to backup using our normal backup systems
some "open core", features are hidden behind a paywall even in the free version, for example profiling, health diagnostics and performance tests
docker version is limited to setting up a "Single-Node Single-Drive MinIO server onto Docker or Podman for early development and evaluation of MinIO Object Storage and its S3-compatible API layer"
that simpler setup, in turn, seems less supported for production and has lots of warnings around risk of data loss
no cache tiering (can't use SSD as a cache for HDDs...)
other limitations

Licensing dispute

MinIO are involved in a licensing dispute with commercial storage providers (Weka and Nutanix) because the latter used MinIO in their products without giving attribution. See also this hacker news discussion.

It should also be noted that they switched to the AGPL relatively recently.

This is not seen as a deal-breaker in using MinIO for TPA.

First run

The quickstart guide is easy enough to follow to get us started, for example:

PASSWORD=$(tr -dc '[:alnum:]' < /dev/urandom | head -c 32)
mkdir -p ~/minio/data

podman run \
   -p 9000:9000 \
   -p 9090:9090 \
   -v ~/minio/data:/data \
   -e "MINIO_ROOT_USER=root" \
   -e "MINIO_ROOT_PASSWORD=$PASSWORD" \
   quay.io/minio/minio server /data --console-address ":9090"

... will start with an admin interface on https://localhost:9090 and the API on https://localhost:9000 (even though the console messages will say otherwise).

You can use the web interface to create the buckets, or the mc client which is also available as a Docker container.

We tested this procedure and it seemed simple enough, didn't even require creating a configuration file.

OpenIO

The openio project mentioned in one of the GitLab threads. The main website (https://www.openio.io/) seems down (SSL_ERROR_NO_CYPHER_OVERLAP) but some information can be gleamed from the documentation site.

It is not packaged in Debian.

Features:

Object Storage (S3)
OpenStack Swift support
minimal hardware requirements (1 CPU, 512MB RAM, 1 NIC, 4GB storage)
no need to pre-plan cluster size
dynamic load-balancing
multi-tenant
progressive offloading to avoid rebalancing
lifecycle management, versioning, snapshots
no single point of failure
geo-redundancy
metadata indexing

Downsides, missing features:

partial S3 implementation, notably missing:
encryption? the above S3 compatibility page says it's incompatible, but this page says it is implemented, unclear
website hosting
bucket policy
bucket replication
bucket notifications
a lot of "open core" features ("part of our paid plans", which is difficult to figure out because said plans are not visible in latest Firefox because of aforementioned "SSL" issue)
OpenIO FS
IAM
design seems awfully complicated
requires disabling apparmor (!?)
supported OS page clearly out of date or not supporting stable Debian releases
no release in almost a year (as of 2023-06-28, last release is from August 2022)

Not seriously considered because of missing bucket encryption, the weird apparmor limitation, the "open core" business model, the broken website, and the long time without releases.

SeaweedFS

"SeaweedFS is a fast distributed storage system for blobs, objects, files, and data lake, for billions of files!" according to their GitHub page. Not packaged in Debian, written in Golang.

Features:

Blob store has O(1) disk seek, cloud tiering
cross-DC active-active replication
Kubernetes
POSIX FUSE mount
S3 API
S3 Gateway
Hadoop
WebDAV
encryption
Erasure Coding
optimized for small files

Not considered because of focus on small files.

Kubernetes

In Kubernetes, storage is typically managed by some sort of operator that provides volumes to the otherwise stateless "pods" (collections of containers). Those, in turn, are then designed to offer large storage capacity that automatically scales as well. Here are two possible options:

https://longhorn.io/ - Kubernetes volumes, native-only, no legacy support?
https://rook.io/ - Ceph operator

Those were not evaluated any further. Kubernetes itself is quite a beast and seems overkill to fix the immediate problem at hand, although it could be interesting to manage our growing fleet of containers eventually.

Other ideas

Those are other, thinking outside the box ideas, also rejected.

Throw hardware at it

One solution to the aforementioned problem is to "just throw hardware at it", that is scaling up our hardware resources to match the storage requirements, without any redesign.

We believe this is impractical because of the non-linear expansion of the storage systems. Those patterns make it hard to match the expansion on generic infrastructure.

By picking a separate system for large file storage, we are able to isolate this problem in a separate service which makes it easier to scale.

To give a concrete example, we could throw another terabyte or two at the main GitLab server, but that wouldn't solve the problems the metrics team is suffering from. It would also not help the storage problem the GitLab runners are having, as they wouldn't be able to share a cache, which something that can be solved with shared object storage cache.

Storage Area Network (SAN)

We could go with a SAN, home-grown or commercial, but i would rather avoid proprietary stuff, which means we'd have to build our own, and i'm not sure how we would do that. ZFS replication maybe? and that would only solve the Ganeti storage problems. we'd still need an S3 storage, but we could use something like MinIO for that specifically.

Upstream provider

According to this, one of our upstream provider has terabytes of storage where we could run a VM to have a secondary storage server for Bacula. This requires a bit too much trust in them that we'd like to avoid for now, but could be considered later.

Backup-specific solutions

We could fix the backup problems by ditching Bacula and switching to something like borg. We'd need an offsite server to "pull" the backups, however (because borg is push, which means a compromised backup server can trash its own backups). We could build this with ZFS/BTRFS replication, for example.

Another caveat with borg is that restores are kind of slow. Bacula seems to be really fast at restores, at least in our experience restoring websites in issue #40501 (closed).

This is considered out of scope for this proposal and kept for future evaluation.

Costs

Probably less, in the long term, than keeping all storage distributed.

Extra storage requirements could be fulfilled by ordering new SSDs. The current model is the Intel® SSD D3-S4510 Series which goes for around 210$USD at Newegg or 180$USD at Amazon. Therefore, expanding the fleet with 6 of those drives would gain us 11.5TB (6 × 1.92TB, or 10.4TIB, 5.2TiB after RAID) at a cost of about 1200$USD before tax. With a cold spare, it goes up to around 1400$USD.

Alternatively, we could add higher capacity drives. 3.84TB drives are getting cheaper (per byte) than 1.92TB drives. For example, at the time of writing, there's a Intel D3-S4510 3.84TB drive for sale at 255$USD at Amazon. Expanding with 6 such drive would give us an extra 23TB (3.84TB × 6 or 20.9TiB, 10.5TiB after RAID) of storage at a cost of about 1530$USD, 1800\$USD with a spare.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search