title: TPA-RFC-43: Cymru migration plan costs: 600$/mth, 41k$ hardware, 5.5-11.5 week staff approval: TPA, accounting, ED affected users: service admins, TPA deadline: ASAP status: standard discussion: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40929
Summary: creation of a new, high-performance Ganeti cluster in a
trusted colocation facility in the US (600$), with the acquisition of
servers to host at said colo (42,000$); migration of the existing
"shadow simulation" server (chi-node-14) to that new colo; and
retirement of the rest of the gnt-chi cluster.
[[TOC]]
Background
In TPA-RFC-40, we established a rough budget for migrating away from Cymru, but not the exact numbers of the budget or a concreate plan on how we would do so. This proposal aims at clarifying what we will be doing, where, how, and for how much.
Colocation specifications
This is the specifications we are looking for in a colocation provider:
- 4U rack space
- enough power to feed four machines, the three specified below and
chi-node-14 (
Dell PowerEdge R640) - 1 or ideally 10gbit uplink unlimited
- IPv4: /24, or at least a /27 in the short term
- IPv6: we currently only have a /64
- out of band access (IPMI or serial)
- rescue systems (e.g. PXE booting)
- remote hands SLA ("how long to replace a broken hard drive?")
- private VLANs
- ideally not in Europe (where we already have lots of resources)
- reverse DNS
This is similar to the specification detailed in TPA-RFC-40, but modified slight as we found out issues when evaluating providers.
Goals
Must have
- full migration away from team Cymru infrastructure
- compatibility with the colo specifications above
- enough capacity to cover the current services hosted at Team Cymru
(see
gnt-chiandmolyin the Appendix for the current inventory)
Nice to have
- enough capacity to cover the services hosted at the Hetzner Ganeti
cluster (
gnt-fsn, in the appendix)
Non-Goals
- reviewing the architectural design of the services hosted at Team Cymru and elsewhere
Proposal
The proposal is to migrate all services off of Cymru to a trusted colocation provider.
Migration process
The migration process will happen with a few things going off in parallel.
New colocation facility access
In this step, we pick the colocation provider and establish contact.
- get credentials for OOB management
- get address to ship servers
- get emergency/support contact information
This step needs to happen before the following steps are completed (at least the "servers shipping" step.
chi-node-14 transfer
This is essentially the work to transfer chi-node-14 to the new colocation facility.
- maintenance window announced to shadow people
- server shutdown in preparation for shipping
- server is shipped
- server is racked and connected
- server is renumbered and brought back online
- end of the maintenance window
This can happen in parallel with the following tasks.
new hardware deployment
- budget approval (TPA-RFC-40 is standard)
- server selection is confirmed
- servers are ordered
- servers are shipped
- servers are racked and connected
- burn-in
At the end of this step, the three servers are build, shipped, connected, and remotely available for install, but not installed just yet.
This step can happen in parallel with the chi-node-14 transfer and the software migration preparation.
Software migration preparation
This can happen in parallel with the previous tasks.
- confirm a full instance migration between
gnt-fsnandgnt-chi - send notifications for migrated VMs, see table below
- confirm public IP allocation for the new Ganeti cluster
- establish private IP allocation for the backend network
- establish reverse DNS delegation
Cluster configuration
This needs all the previous steps (but chi-node-14) to be done before it can go ahead.
- install first node
- Ganeti cluster initialization
- install second node, confirm DRBD networking and live migrations are operational
- VM migration "wet run" (try to migrate one VM and confirm it works)
- mass VM migration setup (the move-instance command)
- mass migration and renumbering
The third node can be installed in parallel with step 4 and later.
Single VM migration example
A single VM migration may look something like this:
- instance stopped on source node
- instance exported on source node
- instance imported on target node
- instance started
- instance renumbered
- instance rebooted
- old instance destroyed after 7 days
If the mass-migration process works, steps 1-4 possibly happen in parallel and operators basically only have to renumber the instances and test.
Costs
Colocation services
TPA proposes we go with colocation provider A, at 600$ per month for 4U.
Hardware acquisition
This is a quote established on 2022-10-06 by lavamind for TPA-RFC-40. It's from http://interpromicro.com which is a supplier used by Riseup, and it has been updated last on 2022-11-02.
- SuperMicro 1114CS-TNR 1U
- AMD Milan (EPYC) 7713P 64C/128T @ 2.00Ghz 256M cache
- 512G DDR4 RAM (8x64G)
- 2x Micron 7450 PRO, 480GB PCIe 4.0 NVMe*, M.2 SSD
- 6x Intel S4510 1.92T SATA3 SSD
- 2x Intel DC P4610 1.60T NVMe SSD
- Subtotal: 12,950$USD
- Spares:
- Micron 7450 PRO, 480GB PCIe 4.0 NVMe*, M.2 SSD: 135$
- Intel® S4510, 1.92TB, 6Gb/s 2.5" SATA3 SSD(TLC), 1DWPD: 345$
- Intel® P4610, 1.6TB NVMe* 2.5" SSD(TLC), 3DWPD: 455$
- DIMM (64GB): 275$
- labour: 55$/server
- Total: 40,225$USD
- TODO: final quote to be confirmed
- Extras, still missing:
- shipping costs: was around 250$ by this shipping estimate, provider is charging 350$
- Grand total: 41,000$USD (estimate)
Labor
Initial setup: one week
Ganeti cluster setup costs:
| Task | Estimate | Uncertainty | Total | Notes |
|---|---|---|---|---|
| Node setup | 3 days | low | 3.3d | 1 d / machine |
| VLANs | 1 day | medium | 1.5d | could involve IPsec |
| Cluster setup | 0.5 day | low | 0.6d | |
| Total | 4.5 days | 5.4d |
This gets us a basic cluster setup, into which virtual machines can be imported (or created).
Batch migration: 1-2 weeks, worst case full rebuild (4-6w)
We assume each VM will take 30 minutes of work to migrate which, if all goes well, means that we can basically migrate all the machines in one day of work.
| Task | Estimate | Uncertainty | Total | Notes |
|---|---|---|---|---|
| research and testing | 1 day | extreme | 5d | half a day of this already spent |
| total VM migration time | 1 day | extreme | 5d | |
| Total | 2 day | extreme | 10 days |
It might take more time to do the actual transfers, but the assumption is the work can be done in parallel and therefore transfer rates are non-blocking. So that "day" of work would actually be spread over a week of time.
There is a lot of uncertainty in this estimate. It's possible the migration procedure doesn't work at all, and in fact has proven to be problematic in our first tests. Further testing showed it was possible to migrate a virtual machine so it is believed we will be able to streamline this process.
It's therefore possible that we could batch migrate everything in one fell swoop. We would then just have to do manual changes in LDAP and inside the VM to reset IP addresses.
Worst case: full rebuild, 3.5-4.5 weeks
The worst case here is a fall back to the full rebuild case that we computed for the cloud, below.
To this, we need to add a "VM bootstrap" cost. I'd say 1h hour per VM, medium uncertainty in Ganeti, so 1.5h per VM or ~22h (~3 days).
Instance table
This table is an inventory of the current machines, at the time of writing, that needs to be migrated away from Cymru. It details what will happen to each machine, concretely. This is a preliminary plan and might change if problems come up during migration.
| machine | location | fate | users |
|---|---|---|---|
| btcpayserver-02 | gnt-chi, drbd | migrate | none |
| ci-runner-x86-01 | gnt-chi, blockdev | rebuild | GitLab CI |
| dangerzone-01 | gnt-chi, drbd | migrate | none |
| gitlab-dev-01 | gnt-chi, blockdev | migrate or rebuild | none |
| metrics-psqlts-01 | gnt-chi, drbd | migrate | metrics |
| onionbalance-02 | gnt-chi, drbd | migrate | none |
| probetelemetry-01 | gnt-chi, drbd | migrate | anti-censorship |
| rdsys-frontend-01 | gnt-chi, drbd | migrate | anti-censorship |
| static-gitlab-shim | gnt-chi, drbd | migrate | none |
| survey-01 | gnt-chi, drbd | migrate | none |
| tb-pkgstage-01 | gnt-chi, drbd | migrate | applications |
| tb-tester-01 | gnt-chi, drbd | migrate | applications |
| telegram-bot-01 | gnt-chi, blockdev | migrate | anti-censorship |
| fallax | moly | rebuild | none |
| build-x86-05 | moly | retire | weasel |
| build-x86-06 | moly | retire | weasel |
| moly | Chicago? | retire | none |
| chi-node-01 | Chicago | retire | none |
| chi-node-02 | Chicago | retire | none |
| chi-node-03 | Chicago | retire | none |
| chi-node-04 | Chicago | retire | none |
| chi-node-05 | Chicago | retire | none |
| chi-node-06 | Chicago | retire | none |
| chi-node-07 | Chicago | retire | none |
| chi-node-08 | Chicago | retire | none |
| chi-node-09 | Chicago | retire | none |
| chi-node-10 | Chicago | retire | none |
| chi-node-11 | Chicago | retire | none |
| chi-node-12 | Chicago | retire | none |
| chi-node-13 | Chicago | retire | ahf |
| chi-node-14 | Chicago | ship | GitLab CI / shadow |
The columns are:
machine: which machine to managelocation: where the machine is currently hosted, examples:Chicago: a physical machine in a datacenter somewhere in Chicago, Illinois, United States of Americamoly: a virtual machine hosted on the physical machinemolygnt-chi: a virtual machine hosted on the Ganetichicluster, made of thechi-node-Xphysical machinesdrbd: a normal VM backed by two DRBD devicesblockdeva VM backed by a SAN, may not be migratablefate: what will happen to the machine, either:retire: the machine will not be rebuilt and instead just retiredmigrate: machine will be moved and renumbered with either the massmove-instancecommand orexport/importmechanismsrebuild: the machine will be retired a new machine will be rebuilt in its place in the new clustership: the physical server will be shipped to the new colousers: notes which users are affected by the change, mostly because of the IP renumbering or downtime, and which should be notified. some services are marked asnoneeven though they have users; in that case it is assume that the migration will not cause a downtime, or at worst a short down time (DNS TTL propagation) during the migration.
Affected users
Some services at Cymru will be have their IP addresses renumbered, which may affect access control lists. A separate communication will be addressed to affected parties before and after the change.
The affected users are detailed in the instance table above.
Alternatives considered
In TPA-RFC-40, other options were considered instead of hosting new servers in a colocation facility. Those options are discussed below.
Dedicated hosting
In this scenario, we rent machines from a provider (probably a commercial provider).
The main problem with this approach is that it's unclear whether we will be able to reproduce the Ganeti setup the way we need to, as we do not always get the private VLAN we need to setup the storage backend. At Hetzner, for example, this setup has proven to be costly and brittle.
Monthly costs are also higher than in the self-hosting solution. The migration costs were not explicitly estimated, but were assumed to be within the higher range of the self-hosting option. In effect, dedicated hosting is the worst of both world: we get to configure a lot, like in the self-hosting option, but without its flexibility, and we get to pay the cloud premium as well.
Cloud hosting
In this scenario, each virtual machine is moved to cloud. It's unclear how that would happen exactly, which is the main reason behind the far ranging time estimates.
In general, large simulations seem costly in this environment as well, at least if we run them full time.
The uncertainty around cloud hosting is large: the minimum time estimate is similar to the self-hosting option, but the maximum time is 50% longer than the self-hosting worst case scenario. Monthly costs are also higher.
The main problem with migrating to the cloud is that each server basically needs to be rebuilt from scratch, as we are unsure we can easily migrate server images into a proprietary cloud provider. If we could have a cloud provider offering Ganeti hosting, we might have been able to do batch migration procedures.
That, in turn, shows that our choice of Ganeti impairs our capacity at quickly evacuating to another provider, as the software isn't very popular, let alone standard. Using tools like OpenStack or Kubernetes might help alleviate that problem in the future, but that is a major architectural change that is out of scope of this discussion.
Provider evaluation
In this section, we summarize the different providers that were evaluated for colocation services and hardware acquisition.
Colocation
For privacy reasons, the provider evaluation is performed in a confidential GitLab issue, see this comment in issue 40929.
But we can detail that, in TPA-RFC-40, we have established prices from three providers:
- Provider A: 600$/mth (4 x 150$ per 1U, discounted from 350$)
- Provider B: 900$/mth (4 x 225$ per 1U)
- Provider C: 2,300$/mth (20 x a1.xlarge + 1 x r6g.12xlarge at Amazon AWS, public prices extracted from https://calculator.aws, includes hardware)
The actual provider chosen and its associated costs are detailed in costs, in the colocation services section.
Other providers
Other providers were found after this project was completed and are documented in this section.
- Deft: large commercial colo provider, no public pricing, used by 37signals/Basecamp
- Coloclue: community colo, good prices, interesting project, public weather map, looking glass, peering status, status page, MANRS member, relatively cheap, (EUR 0,4168/kWh is €540,17/mth for a 15A*120V circuit, unmetered gbit included), reasonable OOB management
Hardware
In TPA-RFC-40, we have established prices from three providers:
- Provider D: 35,334$ (48 480$ CAD = 3 x 16,160$CAD for SuperMicro 1114CS-THR 1U, AMD Milan (EPYC) 7713P 64C/128T @ 2.00Ghz 256M cache, 512G DDR4 RAM, 6x 1.92T SATA3 SSD, 2x 1.60T NVMe SSD, NIC 2x10GbE SFP+)
- Provider E: 36,450$ (3 x 12,150$ USD for Super 1114CS-TNR, AMD Milan 7713P-2.0Ghz/64C/128T, 512GB DDR4 RAM, 6x 1.92T SATA3 SSD, 2x 1.60T NVMe SSD, NIC 2x 10GB/SFP+)
- Provider F: 35,470$ (48,680$ CAD = 3 x 16,226$CAD for Supermicro 1U AS -1114CS-TNR, Milan 7713P UP 64C/128T 2.0G 256M, 8x 64GB DDR4-3200 RAM, 6x Intel D3 S4520 1.92TB SSD, 2x IntelD7-P5520 1.92TB NVMe, NIC 2-port 10G SFP+)
The costs of the hardware picked are detailed in costs, in the hardware acquisition section.
For three such servers, we have:
- 192 cores, 384 threads
- 1536GB RAM (1.5TB)
- 34.56TB SSD storage (17TB after RAID-1)
- 9.6TB NVMe storage (4.8TB after RAID-1)
- Total: 40,936$USD
Other options were proposed in TPA-RFC-40: doubling the RAM (+3k$), doubling the SATA3 SSD capacity (+2k$), doubling the NVMe capacity (+800$), or faster CPUs with less cores (+200$). But the current build seems sufficient, given that it would have enough capacity to host both gnt-chi (800GB) and gnt-fsn (17TB, including 13TB on HDD and 4TB on NVMe).
Note that none of this takes into account DRBD replication, but neither those the original specification anyways, so that is abstracted away.
We also considered using fiber connections, with SFP modules it is for $570 extra (2 per servers, so 6x$95, AOM-TSR-FS, 10G/1G Ethernet 10GBase-SR/SW 1000Base-SX Dual Rate SFP+ 850nm LC Transceiver) on top of the quotes with AOC NIC 2x10GbE SFP+ NICs.
Timeline
Some basic constraints:
- we want to leave as soon as humanely possible
- the quote with provider A is valid until June 2023
- hardware support is available with Cymru until the end of December 2023
Tentative timeline:
- November 2022
- W47: adopt this proposal
- W47: order servers
- W47: confirm colo contract
- W47: New colocation facility access
- W48-W49: chi-node-14 transfer (outage)
- December 2022
- waiting for servers
- W52: end of hardware support from Cymru
- W52: holidays
- January 2023
- W1: holidays
- W2: ideal: servers shipped (5 weeks)
- W2: new hardware deployment
- W3: Software migration preparation
- W3-W4: Cluster configuration and batch migration
- February 2023:
- W1: gnt-chi cluster retirement, ideal date
- W7: worst case: servers shipped (10 weeks, second week of February)
- March 2023:
- W12: worst case: full build
- W13: worst case: gnt-chi cluster retirement (end of March)
This timeline will evolve as the proposal is adopted and contracts are confirmed.
Deadline
This is basically as soon as possible, with the understanding we do not have the (human) resources to rebuild everything in the cloud or (hardware) resources to rebuild everything elsewhere, immediately.
The most pressing migrations (the two web mirrors) were already migrated to OVH cloud.
This actual proposal will be considered adopted by TPA on Monday November 14th, unless there are oppositions before then, or during check-in.
The proposal will then be brought to accounting and the executive director, and they decide the deadline.
References
- TPA-RFC-40: Cymru migration budget
- discussion ticket
Appendix
Inventory
This is from TPA-RFC-40, copied here for convenience.
gnt-chi
In the Ganeti (gnt-chi) cluster, we have 12 machines hosting about
17 virtual machines, of which 14 much absolutely be migrated.
Those machines count for:
- memory: 262GB used out of 474GB allocated to VMs, including 300GB for a single runner
- CPUs: 78 vcores allocated
- Disk: 800GB disk allocated on SAS disks, about 400GB allocated on the SAN
- SAN: basically 1TB used, mostly for the two mirrors
- a /24 of IP addresses
- unlimited gigabit
- 2 private VLANs for management and data
This does not include:
- shadow simulator: 40 cores + 1.5TB RAM (
chi-node-14) - moly: another server considered negligible in terms of hardware (3 small VMs, one to rebuild)
Those machines are:
root@chi-node-01:~# gnt-instance list --no-headers -o name | sed 's/.torproject.org//'
btcpayserver-02
ci-runner-01
ci-runner-x86-01
ci-runner-x86-05
dangerzone-01
gitlab-dev-01
metrics-psqlts-01
onionbalance-02
probetelemetry-01
rdsys-frontend-01
static-gitlab-shim
survey-01
tb-pkgstage-01
tb-tester-01
telegram-bot-01
root@chi-node-01:~# gnt-instance list --no-headers -o name | sed 's/.torproject.org//' | wc -l
15
gnt-fsn
While we are not looking at replacing the existing gnt-fsn cluster, it's still worthwhile to look at the capacity and usage there, in case we need to replace that cluster as well, or grow the gnt-chi cluster to similar usage.
-
gnt-fsn has 4x10TB + 1x5TB HDD and 8x1TB NVMe (after raid), according to
gnt-nodes list-storage, for a total of 45TB HDD, 8TB NVMe after RAID -
out of that, around 17TB is in use (basically:
ssh fsn-node-02 gnt-node list-storage --no-header | awk '{print $5}' | sed 's/T/G * 1000/;s/G/Gbyte/;s/$/ + /' | qalc), 13TB of which on HDD -
memory: ~500GB (8*62GB = 496GB), out of this 224GB is allocated
-
cores: 48 (8*12 = 96 threads), out of this 107 vCPUs are allocated
moly
| instance | memory | vCPU | disk |
|---|---|---|---|
| fallax | 512MiB | 1 | 4GB |
| build-x86-05 | 14GB | 6 | 90GB |
| build-x86-06 | 14GB | 6 | 90GB |