Saving >60% cost on VMs while maintaining reliability

Inspired by Chinmay Naik’s CTO <> SRE series, I would also like to share a real-life scenario.

You’re an SRE responsible for managing a fleet of event ingestion HTTP VM nodes. The CTO wants you to drastically reduce the costs for this infra without compromising the reliability.

You come up with a solution that looks ridiculous at first but makes total sense.

Context

You’re managing a service that ingests event that requires 50+ resource-intensive VMs and can handle 100s of thousands of events per second. The service is stateless, processes short-lived jobs, and can sustain sudden failures for a short duration, without any loss in data (events are queued on the client’s end as well until ack).

Everything is hosted in GCP, and you’re bounded by the existing region, due to latency concerns. The cost of running the infra is in the range of thousands of dollars/month. It’s Q4 of 2023, so you’re running low on budget, and the CTO asks you to reduce the costs drastically.

All this - without any downtime and reliability!

Your approach

You’re thinking, trying to analyze the workloads and all the details you can leverage in reducing the costs.

There must be a way that can fit all the criteria…

You think of a few approaches, but none of them fit with the reliability and cost you want. You’re desperate to write to the CTO that this can’t be done. You can’t reduce the costs, given the constraints.

You decide to sleep over and send the update tomorrow.

The next morning, while taking the shower, you have a crazy thought. You hurriedly finish the bath and start figuring out things step by step, while still being in your bathrobe.

You can’t stop smiling, but you’re not sure whether the CTO will approve your idea.

First thing, you set up a call with the CTO, and start explaining your idea of moving the fleet to spot preemptible instances… before the CTO can say anything about the preemption of instances, you ask him to hear you out.

You go:

See, I did a lot of thinking and tried many hypotheses, but we’re running a tight ship. The only way to save >60% of the cost in the same region is to migrate to spot instances. I did a paper napkin math, and I think we can save at least 60% in GCP infra cost.

Since the numbers look good, you start to explain your thought process:

So the only way was to make spot instances work. You figured out that traffic behind GCP ALB (application load balancer) can be distributed to multiple MIGs (managed instance groups), in a weighted round-robin fashion and availability.

You decide to create 3 MIG, with a 45-45-10 traffic split:

The CTO is happy with the details and your approach. The entire migration took less than a couple of hours, and you end up saving upwards of 60% cost month on month.

Migrating to spot instances was ridiculous at first glance, but considering the details of the problem space, it turned out to be a feasible approach - saving as much as moving out of GCP itself.