System Design: Capacity Planning basics
A usually overlooked, yet fundamental part of designing high performing systems is capacity planning. I particularly didn't pay much attention to this topic in the past. I had a tendency to only think of it only at a very early stage of designing systems and thought it was only valuable for massive scale systems.
In fact, capacity planning is an incredibly informative exercise. It can help you understand the feasibility of a particular design and can also hint you better options. Approaching it as part of the design process in any considerable change – "considerable" defined as a function of how the change impacts relative scale – can be tremendously helpful.
That said, I thought I'd share some of the basics of capacity planning for system design.
Clarify the design scope
If you're building a new system from scratch or introducing change, the usual process is understanding what the software is supposed to do. Understanding functional requirements, as what features are needed, and non-functional requirements, as constraints the system should abide by are a good place to start. An example of a functional requirement is adding video upload to a social media post/feed that only supports pictures. A non-functional requirement could be a need that these videos should work in 2g connection.
Once you understand what the change entails, explore its characteristics. For instance, observe read vs write ratios. Different systems will need different optimisations because of these characteristics. For example, a news feed will likely have more reads than writes and a wellness app that tracks your activity while exercising will probably be the other way around.
Another set of characteristics that will impact the overall capacity planning exercise are understanding how non-functional requirements impact storage or bandwidth. For instance, for a messaging app, choosing client over server persistence for messages will drastically impact storage investment. Nonetheless, it will also have its own set of challenges when syncing up data across devices.
At this stage, clarifying what matters will support defining the best trade-offs and potentially will simplify implementation.
Estimate a few options
Once you know the scope, you are ready to move to some estimates. At this stage, you likely have a rough idea of which features to pick for estimation. Your initial measure for quantifying capacity is usually an assumed throughput, inferred from a similar feature or another number such as daily active users.
Firstly come up with a ballpark estimate of how much a request costs in bytes. Rounding up is a good idea to keep figures conservative. Estimates for text should be straightforward as you can assume a set of fields and their sizes in bytes. For object storage, such as images or videos, think of versioning, for instance, the average size of a thumbnail vs full-size, or video in 480p vs 1080p. Note how this interacts with non-functional requirements and which options you have available to fulfil the specification.
Once you know how much each record costs, think of bandwidth per operation. Think of how you can potentially reduce the amount of data sent both ways: client to server and server to client. Especially for reads, you usually have opportunities for savings. For instance, instead of fetching the entire record and all of its fields every time, you can likely send a fraction of the content from the server sometimes. Examples of that are truncating content or rendering versions. In short, not every operation will need the whole record so you can factor it and discount accordingly.
From there on it is just byte maths and back of the envelope calculations. With the average record size per operation you can get estimate measures and understand trade-offs for different design options.
Seek opportunities for optimisation
Next up, optimisation. A very typical opportunity for storage optimisation is defining data retention. Not all systems need – or even can, because of compliance – keep data forever. A typical opportunity is looking into data granularity, for instance, aggregating data by a dimension such as time or user. Especially at large scale, an appropriate data retention policy can drastically improve systems performance and save costs.
Another opportunity to reduce overhead is caching. For read-heavy systems, usually, a small portion of the content gets a disproportionate amount of reads, and caching can reduce bandwidth dramatically for such cases. There's a myriad of ways you can explore optimisation for reads such as lazily loading content based on what's on the viewport, making use of content delivery networks, incrementally improving video/audio quality over user engagement time, and so forth. These are very much dependent on non-functional requirements as well.
Similarly, you can leverage caching for write-heavy systems as a way to reduce storage capacity needs. For instance, in a time-series event-based architecture, omitting writes where there is no state change. Temperature-reading sensors which push updates on a fixed time interval to a server are an excellent example of that. Temperatures aren't prone to change drastically within a short timespan, so you can cache a sensor/temperature pair and only write to longer-term storage when a change happens.
In short, there are many ways to go about capacity optimisation. I believe two key aspects to keep in mind are how optimisation impacts relative capacity, and what you are optimising for.
Be prepared to operate
Much of the truly challenging work for capacity planning, in my opinion, is in operating capacity. Sometimes, assumptions we base off capacity planning at the design stage are untrue in practice. That's why being prepared to operate and thus, continuously plan, is extremely important.
A good practice for detecting misconceptions is writing some of the assumptions made at the design stage in the form of alerts. Let's say at the design stage you assume the system will have 20% cache hit, and in practice, you have 5%. Even though you likely will have monitoring in place, the assumption made can slip off. Monitoring the assumptions made previously gives you the ability to act proactively instead of only knowing it on your next infra bill arrives or when an incident happens.
Even though you have ways to rely on data to forecast and know how load evolves, you can't predict every single scenario. To reduce this risk, what you can do on an ongoing basis is using some concepts of threat modeling to understand which safeguards are missing, what happens if a certain characteristic changes, or if a non-functional requirement changes. These help to indicate where you might have blind spots.
With this said, capacity planning is a deep and fascinating subject. It can inform design decisions, support systems operations, and even help you challenge your initial requirements. Having this as part of the design process can help increasing awareness of the trade-offs you make, and puts you in a stronger position to approach change.