title: TPA-RFC-56: large file storage costs: included approval: TPA affected users: deadline: One week from 2023-07-03, that is 2023-07-10. status: standard discussion: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40478
[[TOC]]
Summary: setup a new, 1TiB SSD object storage in the gnt-dal cluster using MinIO. Also includes in-depth discussion of alternatives and storage expansion costs in gnt-dal, which could give us an extra 20TiB of storage for 1800$USD.
Background
We've had multiple incident with servers running out of disk space in the past. This RFC aims at collecting a summary of those issues and giving a proposal of a solution that should cover most of them.
Those are the issues that were raised in the past with servers running out of disk space:
-
GitLab; #40475 (closed), #40615 (closed), #41139: "
gitlab-02running out of disk space". CI artifacts, and non-linear growth events. -
GitLab CI; #40431 (closed): "
ci-runner-01invalid ubuntu package signatures"; gitlab#95 (closed): "Occasionally clean-up Gitlab CI storage". Non-linear, possibly explosive and unpredictable growth. Cache sharing issues between runners. Somewhat under control now that we have more runners, but current aggressive cache purging degrades performance. -
Backups; #40477 (closed): "backup failure: disk full on
bungei". Was non-linear, mostly due toarchive-01but also GitLab. A workaround good for ~8 months (from October 2021, so until June 2022) was deployed and usage seems stable since September 2022. -
Metrics; #40442 (closed): "
meronenserunning out of disk space". Linear growth. Current allocation (512GB) seem sufficient for a few more years, conversion to a new storage backend planned (see below). -
Collector; #40535 (closed): "
colchicifoliumdisk full". Linear growth, about 200GB used per year, 1TB allocated in June 2023, therefore possibly good for 5 years. -
Archives; #40779 (closed): "
archive-01running out of disk space". Added 2TB in May 2022, seem to be using about 500GB per year, good for 2-3 more years. -
Legacy Git; #40778 (closed): "
vinealeout of disk space", May 2022. Negligible (64GB), scheduled for retirement (see TPA-RFC-36).
There are also design and performance issues that are relevant in this discussion:
-
Ganeti virtual machines storage. A full reboot of all nodes in the cluster takes hours, because all machines need to be migrated between the nodes (which is fine) and do not migrate back to their original pattern (which is not). Improvements have been made to the migration algorithm, but it could also be fixed by changing storage away from DRBD to another storage backend like Ceph.
-
Large file storage. We were asked where to put large VM images (3x8GB), and we answered "git(lab) LFS" with the intention of moving to object storage if we run out of space on the main VM, see #40767 (closed) for the discussion. We also were requested to host a container registry in tpo/tpa/gitlab#89.
-
Metrics database. tpo/network-health/metrics/collector#40012 (closed): "Come up with a plan to make past descriptors etc. easier available and queryable (giant database)" (in onionoo/collector storage). This is currently being rebuilt as a Victoria Metrics server (tpo/tpa/team#41130).
-
Collector storage. #40650 (closed): "colchicifolium backups are barely functional". Backups take days to complete, possible solution is to "Move collector storage from file based to object storage" (tpo/network-health/metrics/collector#40023 (closed), currently on hold).
-
GitLab scalability. GitLab needs to be scaled up for performance reasons as well, which primarily involves splitting it in multiple machines, see #40479 for that discussion. It's partly in scope of this discussion in the sense that a solution chosen here should be compatible with GitLab's design.
Much of the above and this RFC come from the brainstorm established in issue tpo/tpa/team#40478.
Storage usage analysis
According to Grafana, TPA manages over 60TiB of storage with a capacity of over 160TiB, which includes 60TiB of un-allocated space on LVM volume groups.
About 40TiB of storage is used by the backup storage server and 7TiB by the archive servers, which puts our normal disk usage at less than 15TiB spread over a little over 60 virtual machines.
Top 10 largest disk consumers are:
- Backups: 41TiB
- archive-01: 6TiB
- Tor Browser builders: 4TiB
- metrics: 3.6TiB
- mirrors: ~948GiB total, ~100-200GiB each mirror/source
- people.torproject.org: 743GiB
- GitLab: 700GiB (350GiB for main instance, 90GiB per runner)
- Prometheus: 150GiB
- Gitolite & GitWeb: 175GiB
- BTCPayserver: 125GiB
The remaining servers all individually use less than 100GiB and are negligible compared to the above mastodons.
The above is important because it shows we do not have that much storage to handle: all of the above could probably fit in a couple of 8TiB hard drives (HDD) that cost less than 300$ a piece. The question is, of course, how to offer good and reliability performance for that data, and for that HDDs don't quite cut it.
Ganeti clusters capacity
In terms of capacity, the two Ganeti clusters have vastly different specifications and capacity.
The new, high performance gnt-dal cluster has limited disk space,
for a total of 22TiB and 9TiB in use, including an unused 5TiB of
NVMe storage.
The older gnt-fsn cluster has more than double that capacity, at
48TiB with 19TiB in use, but ~40TiB out of that is made of hard
disk drives. The remaining 7TiB of NVMe storage is more than 50% used,
at 4TiB.
So we do have good capacity for fast storage on the new cluster, and also good archive capacity on the older cluster.
Proposal
Create a virtual machine to test MinIO as an object storage backend,
called minio-01.torproject.org. The VM will deploy MinIO using
podman on Debian bookworm and will hold about 1TB of disk space, on
the new gnt-dal cluster.
We'll start by using the SSD (vg_ganeti, default) volume group but
may provision an extra NVMe volume if MinIO allows it (and if we need
lower-latency buckets). We may need to provision extra SSDs to cover
for the additional storage needs.
The first user of this cache will be the GitLab registry, which will be enabled using the cache as a storage backend, with the understanding that the service may become unavailable if the object storage system fails somewhat.
Backups will be done using our normal backup procedures which might mean inconsistent backups. An alternative would be to periodically export a snapshot of the object storage to the storage server or locally, but this means duplicating the entire object storage pool.
If this experiment is successful, GitLab runners will start using the object storage server as a cache, using a separate bucket.
More and more services will be migrated to object storage as time goes on and the service is seen as reliable. The full list of services is out of scope of this, but we're thinking of migrating first:
- job artifacts and logs
- backups
- LFS objects
- everything else
Each service should be setup with its own bucket for isolation, where possible. Bucket-level encryption will be enabled, if possible.
Eventually, TPA may be able to offer this service outside the team, of other teams express an interest.
We do not consider this a permanent commitment to MinIO. Because the object storage protocol is relatively standard, it's typically "easy" to transfer between two clusters, even if they have different backends. The catch is, of course, the "weight" of the data, which needs to be duplicated to migrated between two solutions. But it should still be possible thanks to bucket replication or even just plain and simple tools like rclone.
Alternatives considered
The above was is proposed following a lengthy evaluation of different alternatives, detailed below.
It should be noted, however, that TPA previously brainstormed this in a meeting , where we said:
We considered the following technologies for the broader problem:
- S3 object storage for gitlab
- ceph block storage for ganeti
- filesystem snapshots for gitlab / metrics servers backups
We'll look at setting up a VM with MinIO for testing. We could first test the service with the CI runners image/cache storage backends, which can easily be rebuilt/migrated if we want to drop that test.
This would disregard the block storage problem, but we could pretend this would be solved at the service level eventually (e.g. redesign the metrics storage, split up the gitlab server). Anyways, migrating away from DRBD to Ceph is a major undertaking that would require a lot of work. It would also be part of the largest "trusted high performance cluster" work that we recently de-prioritized.
This is partly why MinIO was picked over the other alternatives (mainly Ceph and Garage).
Ceph
Ceph is (according to Wikipedia) a "software-defined storage platform that provides object storage, block storage, and file storage built on a common distributed cluster foundation. Ceph provides completely distributed operation without a single point of failure and scalability to the exabyte level, and is freely available."
It's kind of a beast. It's written in C++ and Python and is packaged in Debian. It provides a lot of features we are looking for here:
- redundancy ("a la" DRBD)
- load-balancing (read/write to multiple servers)
- far-ranging object storage compatibility
- native Ganeti integration with an iSCSI backend
- Puppet module
- Grafana and Prometheus dashboards, both packaged in Debian
More features:
- block device snapshots and mirroring
- erasure coding
- self-healing
- used at CERN, OVH, and Digital Ocean
- yearly release cycle with two-year support lifetime
- cache tiering (e.g. use SSDs as caches)
- also provides a networked filesystem (CephFS) with an optional NFS frontend
Downsides:
- complexity: at least 3-4 daemons to manager a cluster, although this could might be easier to live with thanks to the Debian packages
- high hardware requirements (quad-core, 64-128GB RAM, 10gbps), although their minimum requirements are actually quite attainable
Rejected because of its complexity. If we do reconsider our use of DRBD, we might reconsider Ceph again, as we would then be able to run a single storage cluster for all nodes. But then it feels a little dangerous to share object storage access to the block storage system, so that's actually a reason against Ceph.
Scalability promises
CERN started with a 3PB Ceph deployment around 2015. It seems it's still in use:
... although, as you can see, it's not exactly clear to me how much data is managed by ceph. they seem to have a good experience with Ceph in any case, with three active committers, and they say it's a "great community", which is certainly a plus.
On the other hand, managing lots of data is part of their core mission, in a sense, so they can probably afford putting more people on the problem than we can.
Complexity and other concerns concerns
GitLab tried to move from the cloud to bare metal. Issue 727 and issue #1 track their attempt to migrate to Ceph which failed. They moved back to the cloud. A choice quote from this deployment issue:
While it's true that we lean towards PostgreSQL, our usage of CephFS was not for the database server, but for the git repositories. In the end we abandoned our usage of CephFS for shared storage and reverted back to a sharded NFS design.
Jeff Atwood also described his experience, presumably from StackOverflow's attempts:
We had disastrous experiences with Ceph and Gluster on bare metal. I think this says more about the immaturity (and difficulty) of distributed file systems than the cloud per se.
This was a Hacker News comment in response to the first article from GitLab.com above, which ended up being correct as GitLab went back to the cloud.
One key thing to keep in mind is that GitLab were looking for an NFS replacement, but we don't use NFS anywhere right now (thank god) so that is not a requirement for us. So those issues might be less of a problem, as the above "horror stories" might not be the same with other storage mechanisms. Indeed, there's a big difference between using Ceph as a filesystem (ie. CephFS) and an object storage (RadosGW) or block storage (RBD), which might be better targets for us.
In particular, we could use Ceph as a block device -- for Ganeti instance disks, which Ganeti has good support for -- or object storage -- for GitLab's "things", which it is now also designed for. And indeed, "NFS" (ie. real filesystem) is now (14.x?) deprecated in GitLab, so shared data storage is expected to go through S3-like "object storage" APIs from here on.
Some more Ceph war stories:
- A Ceph war story - a major outage and recovery due to XFS and firmware problems
- File systems unfit as distributed storage backends: lessons from ten years of Ceph evolution - how Ceph migrated from normal filesystem backends to their own native block device store ("BlueStore"), an approach also used by recent MinIO versions
Garage
Garage is another alternative, written in Rust. They provide a Docker image and binaries. It is not packaged in Debian.
It was written from scratch by a french association called deuxfleurs.fr. The first release was funded by a NLNet grant and has been renewed for a year in May 2023.
Features:
- apparently faster than MinIO on higher-latency links (100ms+)
- Prometheus monitoring (see metrics list) and Grafana dashboard
- regular releases with actual release numbers, although not yet 1.0 (current is 0.8.2, released 4 months ago as of June 2023, apparently stable enough for production, "Improvements to the recovery behavior and the layout algorithm are planned before v1.0 can come out")
- read-after-write consistency (stronger than Amazon S3's eventual consistency)
- support for asynchronous replicas (so-called "dangerous" mode that returns to the client as soon as the local write finishes), see the replication mode for details
- static website hosting
Missing and downsides:
- possibly slower (-10%) than MinIO in raw bandwidth and IOPS, according to this self-benchmark
- purposefully no erasure coding, which implies full data duplication across nodes
- designed for smaller, "home lab" distributed setups, might not be our target
- no built-in authentication system, no support for S3 policies or ACLs
- non-goals also include "extreme performance" and features above the S3 API
- uses a CRDT and Dynamo instead of Raft, see this discussion for tradeoffs and the design page
- no live migration, upgrade procedure currently imply short downtimes
- backups require live filesystem snapshots or shutdown, example backup script
- no bucket versioning
- no object locking
- no server-side encryption, they argue for client-side encryption, full disk encryption, and transport encryption instead in their encryption section
- no HTTPS support out of the box, can be easily fixed with a proxy
See also their comparison with other software including MinIO. A lot of the information in this section was gleaned from this Hacker News discussion and this other one.
Garage was seriously considered for adoption, especially with our multi-site, heterogeneous environment.
That said, it didn't seem quite mature enough: the lack of bucket encryption, in particular, feels like a deal-breaker. We do not accept the theory that server-side encryption is useless, on the contrary: there's been many cases of S3 buckets being leaked for botched access policies, something that might very well happen to us as well. Adding bucket encryption adds another layer of protection on top of our existing transport (TLS) and at-rest (LUKS) encryption. The latter particularly doesn't address the "leaked bucket" attack vector.
The backup story is also not much better than MinIO, which could have been a deal-breaker giving Garage a win. Unfortunately, it also doesn't keep its own filesystem clean, but it might be cleaner than MinIO, as the developers indicate filesystem snapshots could provide a clean copy, something that's not offered by MinIO.
Still, we might reconsider Garage if we do need a more distributed, high-availability setup. This is currently not part of the GitLab SLA so not a strong enough requirement to move forward with a less popular alternative.
MinIO
MinIO is suggested/shipped by gitlab omnibus now? It is not packaged in Debian. Container deployment probably the only reasonable solution, but watch out for network overhead. no release numbers, unclear support policy. Written in Golang.
Features:
- active-active replication, although with low latency (<20ms) and loss requirements (< 0.01%), requires a load balancer for HA
- asynchronous replication, can survive replicas going down (data gets cached and resynced after)
- bucket replication
- erasure coding
- rolling upgrades with "a few seconds" downtime (presumably compensated by client-side retries)
- object versioning, immutability
- Prometheus and InfluxDB monitoring, also includes bucket event notifications
- audit logs
- external identity providers: LDAP, OIDC (Keycloak specifically)
- object server-side encryption through external Key Management Services (e.g. Hashicorp Vault)
- built-in TLS support
- recommended hardware setups although probably very expensive
- self-diagnostics and hardware tests
- lifecycle management
- FTP/SFTP/FTPS support
- has detailed instructions for Linux, MacOS, Windows, Kubernetes and Docker/Podman
Missing and downsides:
- only two-node replication
- possible licensing issues (see below)
- upgrades and pool expansions require all servers to restart at once
- cannot resize existing server pools, in other words, a resize means building a new larger server and retiring the old one (!) (note that this only affects multi-node pools, for single-node "test" setups, storage can be scaled from the underlying filesystem transparently)
- very high hardware requirements (4 nodes with each 32 cores, 128GB RAM, 8 drives, 25-100GbE for 2-4k clients)
- backups need to be done through bucket replication or site replication, difficult to backup using our normal backup systems
- some "open core", features are hidden behind a paywall even in the free version, for example profiling, health diagnostics and performance tests
- docker version is limited to setting up a "Single-Node Single-Drive MinIO server onto Docker or Podman for early development and evaluation of MinIO Object Storage and its S3-compatible API layer"
- that simpler setup, in turn, seems less supported for production and has lots of warnings around risk of data loss
- no cache tiering (can't use SSD as a cache for HDDs...)
- other limitations
Licensing dispute
MinIO are involved in a licensing dispute with commercial storage providers (Weka and Nutanix) because the latter used MinIO in their products without giving attribution. See also this hacker news discussion.
It should also be noted that they switched to the AGPL relatively recently.
This is not seen as a deal-breaker in using MinIO for TPA.
First run
The quickstart guide is easy enough to follow to get us started, for example:
PASSWORD=$(tr -dc '[:alnum:]' < /dev/urandom | head -c 32)
mkdir -p ~/minio/data
podman run \
-p 9000:9000 \
-p 9090:9090 \
-v ~/minio/data:/data \
-e "MINIO_ROOT_USER=root" \
-e "MINIO_ROOT_PASSWORD=$PASSWORD" \
quay.io/minio/minio server /data --console-address ":9090"
... will start with an admin interface on https://localhost:9090 and the API on https://localhost:9000 (even though the console messages will say otherwise).
You can use the web interface to create the buckets, or the mc client which is also available as a Docker container.
We tested this procedure and it seemed simple enough, didn't even require creating a configuration file.
OpenIO
The openio project mentioned in one of the GitLab threads. The
main website (https://www.openio.io/) seems down
(SSL_ERROR_NO_CYPHER_OVERLAP) but some information can be gleamed
from the documentation site.
It is not packaged in Debian.
Features:
- Object Storage (S3)
- OpenStack Swift support
- minimal hardware requirements (1 CPU, 512MB RAM, 1 NIC, 4GB storage)
- no need to pre-plan cluster size
- dynamic load-balancing
- multi-tenant
- progressive offloading to avoid rebalancing
- lifecycle management, versioning, snapshots
- no single point of failure
- geo-redundancy
- metadata indexing
Downsides, missing features:
- partial S3 implementation, notably missing:
- encryption? the above S3 compatibility page says it's incompatible, but this page says it is implemented, unclear
- website hosting
- bucket policy
- bucket replication
- bucket notifications
- a lot of "open core" features ("part of our paid plans", which is difficult to figure out because said plans are not visible in latest Firefox because of aforementioned "SSL" issue)
- OpenIO FS
- IAM
- design seems awfully complicated
- requires disabling apparmor (!?)
- supported OS page clearly out of date or not supporting stable Debian releases
- no release in almost a year (as of 2023-06-28, last release is from August 2022)
Not seriously considered because of missing bucket encryption, the weird apparmor limitation, the "open core" business model, the broken website, and the long time without releases.
SeaweedFS
"SeaweedFS is a fast distributed storage system for blobs, objects, files, and data lake, for billions of files!" according to their GitHub page. Not packaged in Debian, written in Golang.
Features:
- Blob store has O(1) disk seek, cloud tiering
- cross-DC active-active replication
- Kubernetes
- POSIX FUSE mount
- S3 API
- S3 Gateway
- Hadoop
- WebDAV
- encryption
- Erasure Coding
- optimized for small files
Not considered because of focus on small files.
Kubernetes
In Kubernetes, storage is typically managed by some sort of operator that provides volumes to the otherwise stateless "pods" (collections of containers). Those, in turn, are then designed to offer large storage capacity that automatically scales as well. Here are two possible options:
- https://longhorn.io/ - Kubernetes volumes, native-only, no legacy support?
- https://rook.io/ - Ceph operator
Those were not evaluated any further. Kubernetes itself is quite a beast and seems overkill to fix the immediate problem at hand, although it could be interesting to manage our growing fleet of containers eventually.
Other ideas
Those are other, thinking outside the box ideas, also rejected.
Throw hardware at it
One solution to the aforementioned problem is to "just throw hardware at it", that is scaling up our hardware resources to match the storage requirements, without any redesign.
We believe this is impractical because of the non-linear expansion of the storage systems. Those patterns make it hard to match the expansion on generic infrastructure.
By picking a separate system for large file storage, we are able to isolate this problem in a separate service which makes it easier to scale.
To give a concrete example, we could throw another terabyte or two at the main GitLab server, but that wouldn't solve the problems the metrics team is suffering from. It would also not help the storage problem the GitLab runners are having, as they wouldn't be able to share a cache, which something that can be solved with shared object storage cache.
Storage Area Network (SAN)
We could go with a SAN, home-grown or commercial, but i would rather avoid proprietary stuff, which means we'd have to build our own, and i'm not sure how we would do that. ZFS replication maybe? and that would only solve the Ganeti storage problems. we'd still need an S3 storage, but we could use something like MinIO for that specifically.
Upstream provider
According to this, one of our upstream provider has terabytes of storage where we could run a VM to have a secondary storage server for Bacula. This requires a bit too much trust in them that we'd like to avoid for now, but could be considered later.
Backup-specific solutions
We could fix the backup problems by ditching Bacula and switching to something like borg. We'd need an offsite server to "pull" the backups, however (because borg is push, which means a compromised backup server can trash its own backups). We could build this with ZFS/BTRFS replication, for example.
Another caveat with borg is that restores are kind of slow. Bacula seems to be really fast at restores, at least in our experience restoring websites in issue #40501 (closed).
This is considered out of scope for this proposal and kept for future evaluation.
Costs
Probably less, in the long term, than keeping all storage distributed.
Extra storage requirements could be fulfilled by ordering new SSDs. The current model is the Intel® SSD D3-S4510 Series which goes for around 210$USD at Newegg or 180$USD at Amazon. Therefore, expanding the fleet with 6 of those drives would gain us 11.5TB (6 × 1.92TB, or 10.4TIB, 5.2TiB after RAID) at a cost of about 1200$USD before tax. With a cold spare, it goes up to around 1400$USD.
Alternatively, we could add higher capacity drives. 3.84TB drives are getting cheaper (per byte) than 1.92TB drives. For example, at the time of writing, there's a Intel D3-S4510 3.84TB drive for sale at 255$USD at Amazon. Expanding with 6 such drive would give us an extra 23TB (3.84TB × 6 or 20.9TiB, 10.5TiB after RAID) of storage at a cost of about 1530$USD, 1800\$USD with a spare.