Saving >70% on cloud costs by maintaining reliability

In continuation of my previous post on CTO <> SRE series (thanks to all of you it received much more appreciated than I had anticipated). In this post, we’re going to discuss a real-life scenario that is further going to reduce the costs significantly.

Your startup is going through a hyper-growth phase. And you’re forced to reduce the cloud costs. Most of the common strategies don’t seem to help.

You came up with a wise insight and managed to reduce the cloud infra costs by more than 70% while maintaining the same reliability and improved overall performance.

Context

Your complete infrastructure runs on AWS, and scale is growing at a really fast pace, which led to a multi-fold increase in cloud costs month-on-month.

The cost is evenly distributed across bandwidth, computing, storage, etc. The engineering team has managed to run everything in a well-optimized manner, leading to no obvious optimization in the existing setup.

Your Observations

After reviewing the infrastructure, you realized spot(pre-emptive) instances are already being used wherever possible, so that is already out of scope.

Since there’s no way to reduce the ever-increasing costs, you should find an out-of-the-box solution.

The primary concern seems to be a dependency on cloud infrastructure.

You recall attending an SRE conference where migrating to bare-metal servers (managed by 3rd parties like IONOS, Hetzner, or any other local solution provider) might lead to significant cost savings.

However, moving out of the cloud might impact reliability and even complete loss of data.

You share your thought process with the CTO, and while finishing, a clever idea strikes.

What if you leverage a hybrid approach

Migrate out of the cloud to bare metal service providers Only run the read-replicas on AWS, to leverage reliability (only ingress will have negligible bandwidth costs)

The idea looks convincing, and the CTO highlights their long-term concern about being vendor-locked will also be resolved.

Also, the team was already leveraging automated playbooks (recall kubernetes?) for server setup, so efforts for the migration turned out to be almost negligible.

Over a couple of weeks, after tie-ups with multiple vendors, you manage to migrate all the services and traffic gradually.

Learnings

There’s no need to get deciphered by hyper-scale cloud providers, and their reliability claims, and pay ~10-100x costs for everything. Also, you don’t need to manage the overhead of on-prem VMs while moving out of the cloud.

Bare metal VMs yield 10-50x better performance, because of no virtualization overhead.

It’s wise to put engineering efforts toward automating server setup playbooks and have the least dependency on cloud-providers workflows.

Most databases scale much more than we think.

One should know the integrities not just to scale things, but to take full advantage of utilizing existing resources. And necessarily avoid hopping into the dilemma of choosing the best database or scaling strategy.

I’ll bifurcate the post based on pre and post-development strategies.

Pre-development:

  • Fetch only the required data, and if the data is still large -> paginate (this will lay the foundation for schema design)
  • Different queries on the same view should avoid searching the same data multiple times (multiple queries on the same table should be avoided)
  • It’s not necessary to fetch all the data for the view in a single query -> breakdown into components

Essentially, the UI and DB view can be different

Post-development

  • An optimized query can be scaled at least 1000x compared to an un-optimized one
  • There will be a lot of gotchas while creating indexes depending upon the type of DB (there will be more and more surprises as you dig in)
  • Going through the codebase to determine the un-optimized parts is not feasible
  • The observability can be broken down into metrics and logs (on the DB level), along with tracing
  • 95% of the time, the profiler slow log is enough for improvements
  • Metrics are required to reduce the search space for inefficient aspects (e.g is the b-tree the reason for the occasional spike instead of the query)

Till what point improvements are feasible - until there are spikes in the resource usage.

One needs to accommodate the above factors based on the database, again everything might not hold for every scenario, e.g. OLAP databases.

And yes, you don’t need horizontal scaling - and it’s less common than one might think.

Saving >60% cost on VMs while maintaining reliability

Inspired by Chinmay Naik’s CTO <> SRE series, I would also like to share a real-life scenario.

You’re an SRE responsible for managing a fleet of event ingestion HTTP VM nodes. The CTO wants you to drastically reduce the costs for this infra without compromising the reliability.

You come up with a solution that looks ridiculous at first but makes total sense.

Context

You’re managing a service that ingests event that requires 50+ resource-intensive VMs and can handle 100s of thousands of events per second. The service is stateless, processes short-lived jobs, and can sustain sudden failures for a short duration, without any loss in data (events are queued on the client’s end as well until ack).

Everything is hosted in GCP, and you’re bounded by the existing region, due to latency concerns. The cost of running the infra is in the range of thousands of dollars/month. It’s Q4 of 2023, so you’re running low on budget, and the CTO asks you to reduce the costs drastically.

All this - without any downtime and reliability!

Your approach

You’re thinking, trying to analyze the workloads and all the details you can leverage in reducing the costs.

There must be a way that can fit all the criteria…

You think of a few approaches, but none of them fit with the reliability and cost you want. You’re desperate to write to the CTO that this can’t be done. You can’t reduce the costs, given the constraints.

You decide to sleep over and send the update tomorrow.

The next morning, while taking the shower, you have a crazy thought. You hurriedly finish the bath and start figuring out things step by step, while still being in your bathrobe.

You can’t stop smiling, but you’re not sure whether the CTO will approve your idea.

First thing, you set up a call with the CTO, and start explaining your idea of moving the fleet to spot preemptible instances… before the CTO can say anything about the preemption of instances, you ask him to hear you out.

You go:

See, I did a lot of thinking and tried many hypotheses, but we’re running a tight ship. The only way to save >60% of the cost in the same region is to migrate to spot instances. I did a paper napkin math, and I think we can save at least 60% in GCP infra cost.

Since the numbers look good, you start to explain your thought process:

  • Migrating to k8s wouldn’t solve the preemption concern :P

  • Hosting over a cheaper IaaS platform (buyVM, e2e, vultr, hetzner etc) isn’t feasible because of latency (the cost of spot instances on GCP vs VMs on the mentioned platforms is similar - 1/6th the price of standard VM)

  • Migrating to arm-based VMs could be a feasible option, but it’s not available in the current region yet

So the only way was to make spot instances work. You figured out that traffic behind GCP ALB (application load balancer) can be distributed to multiple MIGs (managed instance groups), in a weighted round-robin fashion and availability.

You decide to create 3 MIG, with a 45-45-10 traffic split:

  • 2 groups of spot instances with different machine types, so in case there’s a lack of machines of a particular type, the other machine type can be scaled. So, even in the case of pre-emption, only a segment of traffic will be impacted, until the other groups can scale - having an additional buffer in instance utilization will compensate till instances are scaled.

  • 10% of traffic will still be redirected standard instance group (this is worst case scenario, in case both the machine types get preempted)

The CTO is happy with the details and your approach. The entire migration took less than a couple of hours, and you end up saving upwards of 60% cost month on month.

Migrating to spot instances was ridiculous at first glance, but considering the details of the problem space, it turned out to be a feasible approach - saving as much as moving out of GCP itself.

Django Orm Queries Tutorial

Since Django makes it easier to develop nicely designed web applications easily, by understanding certain methodologies we can make our project development simple and easier.

In this tutorial, I’ll explain how to write efficient django queries and to optimise our codebase in django way!

Examples are defined on the basis of follwing models (I’ve omitted certain fields and parameters to keep the tutorial short and precise):

class Product(models.Model):
    name = models.CharField(max_length=64)
    description = models.CharField(max_length=1024)
    quatity = models.IntegerField()
    seller = models.CharField()

class Item(models.Model):
    product = models.ManyToManyField(Product, related_name='item_product')
    quantity = models.IntegerField()

class Cart(models.Model):
    item = models.ManyToManyField(Item, related_name='cart_item')
    user = models.ForeignKey(User)

While querying, to fetch values from a foreign key table, .values() or .values_list() can be used. For example, to fetch quantity values of Item whose name is null, while querying from Cart, following can be used:

Cart.objects.values('item__quantity').filter( item__product__name__isnull = True  )

The results returned by filter() can be more optimised by using methods like values_list(), values(), defer(), only() & exclude(). For slicing a queryset following method can be used:

Product.objects.all()[5:]

This slicing is django’s internal method which is different from python’s slicing method and works on database level. But, slicing doesn’t works while we use -ve indexing like [-5:]. In that case we can use reverse() operator in following way:

Product.objects.all().reverse()[5:]

If we want to check if entries are present in a queryset, using exists() is more efficient than using count()>0.

Product.objects.filter(name='test product').exists()

To iterate all the items in the cart, we can use .iterator(), which get query results one-by-one unlike a queryset which caches everything in the memory:

for item in Cart.objects.prefetch_related('item').iterator():

In this case, prefetch_related() fetches all the fields from item and stores it in the cache, and hence reduces the lookup to the database for every foreign key in every iteration.

A similar method select_related() can also be used in case non Many to Many relationships. More detailed explaination for both the methods can be found here and here.

Bulk Operations

For updating multiple rows, Query for bulk updating the fields can be used:

Product.objects.filter( name__isnull=True ).update( name='unknown' )

Similarly, bulk_create() can be used in following way:

Product.objects.bulk_create([
    Product( name='test_product1', description='test description', quantity=5 ),
    Product( name='test_product2', description='test description', quantity=5 ),
])

The F() operator

This operator is used to access fields from models(similar to getattr()) and some manipulation of that field itself is required.

Item.objects.filter( quantity__gte=5 ).update( quantity=F('quantity')*2 )

This query doubles the quantity of products with quantities greater than or equals to 5.

Aggregate queries

When we want retrieve details from multiple rows simultaneously, aggregate() can be used:

Product.objects.aggregate( products_count=Sum('quantity') )

This query will return Sum of all the quantity of products.

Hammer of the Thor: Django Annotate Query

annotate() is one of the most powerful query in django and have several versatile use cases. In general, annotate() works as a GROUP BY operator in sql. To query Number of Products sold by every seller, we can query like:

Products.objects.values('seller').annotate( products_sold_by_seller=Count('seller') )

For cases in which we may require WHEN conditions or IF conditions, we can use annotate() in the following way:

Products.objects.annotate(
    quantity_sum_square_of_quantity_greater_than_five = Sum(Case(
        When( quantity__gt=5,
            then = ExpressionWrapper(
                new_quantity = F('quantity')*F('quantity'),
                output_field = IntegerField()
            )),
        default = 0,
        output_field=IntegerField()
    ))
).values_list('quantity_sum_square_of_quantity_greater_than_five', flat=True)

This returns the square of quantity sum of products with quantity greater than five.

Products.objects.values('quantity', 'name').annotate( sum_of_quantities_of_products_with_same_name=Count(F('quantity')) )

The simple query above performs very powerful operation. It matches all the products of same name and return sum of quantities.

Some Caveats while using annotate(): While Querying on multiple models(more than 2) with Many to Many relationship, annotate queries using parameters from both the tables, annotate return row for each of the relationship, and the returned number of rows gets increased for each relationsip, which may lead to unexpected behaviour. distinct() method cannot be used along with annotate().