Saving >60% cost on VMs while maintaining reliability

Sat, Dec 30, 2023

Inspired by Chinmay Naik’s CTO <> SRE series, I would also like to share a real-life scenario.

You’re an SRE responsible for managing a fleet of event ingestion HTTP VM nodes. The CTO wants you to drastically reduce the costs for this infra without compromising the reliability.

You come up with a solution that looks ridiculous at first but makes total sense.

Context

You’re managing a service that ingests event that requires 50+ resource-intensive VMs and can handle 100s of thousands of events per second. The service is stateless, processes short-lived jobs, and can sustain sudden failures for a short duration, without any loss in data (events are queued on the client’s end as well until ack).

Everything is hosted in GCP, and you’re bounded by the existing region, due to latency concerns. The cost of running the infra is in the range of thousands of dollars/month. It’s Q4 of 2023, so you’re running low on budget, and the CTO asks you to reduce the costs drastically.

All this - without any downtime and reliability!

Your approach

You’re thinking, trying to analyze the workloads and all the details you can leverage in reducing the costs.

There must be a way that can fit all the criteria…

You think of a few approaches, but none of them fit with the reliability and cost you want. You’re desperate to write to the CTO that this can’t be done. You can’t reduce the costs, given the constraints.

You decide to sleep over and send the update tomorrow.

The next morning, while taking the shower, you have a crazy thought. You hurriedly finish the bath and start figuring out things step by step, while still being in your bathrobe.

You can’t stop smiling, but you’re not sure whether the CTO will approve your idea.

First thing, you set up a call with the CTO, and start explaining your idea of moving the fleet to spot preemptible instances… before the CTO can say anything about the preemption of instances, you ask him to hear you out.

You go:

See, I did a lot of thinking and tried many hypotheses, but we’re running a tight ship. The only way to save >60% of the cost in the same region is to migrate to spot instances. I did a paper napkin math, and I think we can save at least 60% in GCP infra cost.

Since the numbers look good, you start to explain your thought process:

Migrating to k8s wouldn’t solve the preemption concern :P
Hosting over a cheaper IaaS platform (buyVM, e2e, vultr, hetzner etc) isn’t feasible because of latency (the cost of spot instances on GCP vs VMs on the mentioned platforms is similar - 1/6th the price of standard VM)
Migrating to arm-based VMs could be a feasible option, but it’s not available in the current region yet

So the only way was to make spot instances work. You figured out that traffic behind GCP ALB (application load balancer) can be distributed to multiple MIGs (managed instance groups), in a weighted round-robin fashion and availability.

You decide to create 3 MIG, with a 45-45-10 traffic split:

2 groups of spot instances with different machine types, so in case there’s a lack of machines of a particular type, the other machine type can be scaled. So, even in the case of pre-emption, only a segment of traffic will be impacted, until the other groups can scale - having an additional buffer in instance utilization will compensate till instances are scaled.
10% of traffic will still be redirected standard instance group (this is worst case scenario, in case both the machine types get preempted)

The CTO is happy with the details and your approach. The entire migration took less than a couple of hours, and you end up saving upwards of 60% cost month on month.

Migrating to spot instances was ridiculous at first glance, but considering the details of the problem space, it turned out to be a feasible approach - saving as much as moving out of GCP itself.