Saving >70% on cloud costs by maintaining reliability

In continuation of my previous post on CTO <> SRE series (thanks to all of you it received much more appreciated than I had anticipated). In this post, we’re going to discuss a real-life scenario that is further going to reduce the costs significantly.

Your startup is going through a hyper-growth phase. And you’re forced to reduce the cloud costs. Most of the common strategies don’t seem to help.

You came up with a wise insight and managed to reduce the cloud infra costs by more than 70% while maintaining the same reliability and improved overall performance.

Context

Your complete infrastructure runs on AWS, and scale is growing at a really fast pace, which led to a multi-fold increase in cloud costs month-on-month.

The cost is evenly distributed across bandwidth, computing, storage, etc. The engineering team has managed to run everything in a well-optimized manner, leading to no obvious optimization in the existing setup.

Your Observations

After reviewing the infrastructure, you realized spot(pre-emptive) instances are already being used wherever possible, so that is already out of scope.

Since there’s no way to reduce the ever-increasing costs, you should find an out-of-the-box solution.

The primary concern seems to be a dependency on cloud infrastructure.

You recall attending an SRE conference where migrating to bare-metal servers (managed by 3rd parties like IONOS, Hetzner, or any other local solution provider) might lead to significant cost savings.

However, moving out of the cloud might impact reliability and even complete loss of data.

You share your thought process with the CTO, and while finishing, a clever idea strikes.

What if you leverage a hybrid approach

Migrate out of the cloud to bare metal service providers Only run the read-replicas on AWS, to leverage reliability (only ingress will have negligible bandwidth costs)

The idea looks convincing, and the CTO highlights their long-term concern about being vendor-locked will also be resolved.

Also, the team was already leveraging automated playbooks (recall kubernetes?) for server setup, so efforts for the migration turned out to be almost negligible.

Over a couple of weeks, after tie-ups with multiple vendors, you manage to migrate all the services and traffic gradually.

Learnings

There’s no need to get deciphered by hyper-scale cloud providers, and their reliability claims, and pay ~10-100x costs for everything. Also, you don’t need to manage the overhead of on-prem VMs while moving out of the cloud.

Bare metal VMs yield 10-50x better performance, because of no virtualization overhead.

It’s wise to put engineering efforts toward automating server setup playbooks and have the least dependency on cloud-providers workflows.