Tackling Database Resource Utilization

A developers’ guide

Photo by Campaign Creators on Unsplash

Database resource utilisation is frequently overlooked.

In the early project stages, when there is neither load nor a large amount of data, there isn’t much sense in hyper-optimising everything. The main focus is to build features to launch an application.

When the live traffic comes in, some performance degradation starts to appear. Usually, there is some extra indexing happening, but still, at this stage, it’s easier to add extra hardware than trying to “calculate bytes” and spend a tremendous amount of time on measuring, debugging, and fixing performance bottlenecks.

Software engineering is an endless journey of finding the right balance at the right point in time. Remember the famous saying:

“Premature optimization is the root of all evil.” — Donald Knuth

This is always true for writing code/designing a system, and it’s also true for the database usage tuning. Good engineering always considers RoI (return of investment), and up until your company becomes “big tech” (200+ engineers), usually, it makes more sense to invest time and energy in building product features and activities that bring more clients, revenue, traffic, etc.

When the company has big cloud budgets, you wouldn’t do a database optimisation as well. Because there is always a possibility to scale up hardware and have an efficient and fast way to tackle the problem. In some cases, it makes more sense. (Again, engineering time is the most precious resource, isn’t it?)

But Why This Article if We Can Just Upgrade Hardware?

At some point, scaling the hardware is no longer feasible. The amount of data and load grows significantly. The existing hardware scaling is no longer beneficial. It’s either too costly or doesn’t bring any benefit, e.g., some unoptimised query or database schema flaw can eat up 100% of 96 vcpu, which is a maximum on the Google Cloud SQL. History knows such cases.

Going a bit forward — when tackling database resource utilisation, query optimisation is a part of it. A major part, but usually, you cannot get off with only optimising your queries if you have a big distributed application. Some places probably should be revisited, some parts should be rewritten and moved around, and even the DB engine itself can be changed.

This article is not aiming to give an in-depth performance course across several databases(which can take more than a single article), but the purpose is rather to give a high-level overview of common things happening inside various database systems and how to mitigate common problems.
Let’s start.

Write-Heavy or Read-Heavy System?

Throughout the system design articles and software engineering experience, you may probably encounter concepts like “write-intensive” or “read-intensive.” What does it mean? Well, it’s pretty simple:

Write intensive — the system is oriented on handling mostly writes

Read intensive — the system is oriented on handling mostly reads

Many aspects can represent a write-intensive system, for example, IoT telemetry data ingestion system. While handling millions of writes per second (on large systems), the data read happens infrequently. It may be in form of pre-aggregated data(dashboards) or real-time analytics over a sliding time window(monitoring). None of these tasks induce a heavy read workload, especially compared to the number of writes.

For a read-intensive system, imagine a news website. Millions of visits per day, hundreds of thousands of customers, and millions of read QPS. However, edits/updates happen very infrequently (how many times per day content is edited on the news website? Tens of times, probably).

A write-/read-intensive system happens when writes and reads are more or less equal and produce the same amount of load on the system. The most complex scenario includes a lot of tradeoffs, dancing around CAP theorem and aligning with business stakeholders.

While the concept is very simple, based on common logic and predicting usage patterns based on business purpose, it is surprising how often the step of defining this crucial characteristic is skipped.

Because understanding whether our system is read-heavy or write-heavy directly affects one of the most important parts of our design — database selection.

Select the Database

If you have predictable behaviour, you have the luxury to choose a specialised database for your needs, and this will give you a better load/price ratio.

In the above examples, a news website can use something like elasticsearch . It has many useful features like good read/aggregation and index coverage. Also, it’s usually a go-to solution for full-text searching. But everything comes with a price, and elasticsearch is no different. Fast read performance and scalability come at the price of a slow ingestion rate.

For write-intensive workloads and occasional queries, nothing performs better than column-based databases like Cassandra or Clickhouse (Actually, there is much more than these two, but these two are, probably, the most popular).

But in most cases, you will probably end up with a mixed write/read workload and the need to balance them somehow. So the choice will be the general-purpose database or the SQL or noSQL ones like PostgreSQL or Mongodb.

In this article, we will use Mongodb Atlas monitoring metrics for explanatory purposes. Similar tooling (open source and paid) can be found for every popular database.

Start With Measurements

If you don’t know where you’re going, any road will take you there. — Lewis Carrol

It’s quite hard to achieve something without quantifiable objectives. “We need the system to work fast.” It’s a fuzzy objective, and everybody has their own feeling of what “fast” actually is.

For some systems, having 95% of HTTP requests served under one second is OK; for others, one millisecond is already too much. First, we need to define what “fast” means for us in numbers that can be measured.

“(only) What gets measured, gets managed.” — Peter Drucker

An important thing to mention here; the initial “null-hypothesis” is not forever. Do 95 percentile 100ms, aspire for 50ms. It’s a live and constantly changing thing. There isn’t any shame in revisiting the objectives and moving them up and down. The only thing that matters is the business need. If the business is OK with 100ms and 50ms, which will cost four times more in terms of hardware, there is no need to squeeze the system to 50ms.

Three Pillars of Resource Usage

CPU, memory, and disk utilisation (and also network).

The database is an “application” in a wide sense, so the high level can be understood as we understand our applications. Our applications also use CPU, memory, and, occasionally, disk. To properly tune up the database, we need to understand what’s happening there, at least on a high level. So, how do these three pillars of resource usage allocated? Let’s see.

The CPU

CPU is used in every operation. Usually, it’s a main metric for DB monitoring and the most descriptive one. Inside every database, there will be computational work performed. Whether we perform index scans, read data from disk, or do writes there. However, its usage heavily depends on the other two pillars, and optimising memory and disk usage can significantly affect CPU usage. Below, we’ll explain why we should address CPU usage last in the chain.


Tackling Database Resource Utilization was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.

0
(Visited 1 times, 1 visits today)