Capacity Planning Infrastructure for Web Applications

 Capacity Planning Infrastructure for Web Applications

At DrupalCon Amsterdam 2019 I presented a session about capacity planning. In this session I explain how to solve a couple of recurring problems:
Site Launch and User expectations

Based on these problems I share some good practice, by answering questions, like:

- How to create a good capacity plan?
- How to forecast resource needs and make it sustainable?
- How to automate that process?

Imagine a customer that provides a set of needs for hardware, sets a date and launches the site, but then he forgets to warn that they have sent out some (thousands of) emails to half the world announcing their new website launch! What do you think it will happen?

Of course launching a Drupal Site involves a lot of preparation steps and there are plenty of guides out there about common Drupal Launch Readiness Checklists which is not a problem anymore.
What we are really missing here is a Plan for Capacity.

Capacity, in Site Reliability Engineering, is the maximum amount of output a product deployment is capable of completing in a given period of time.
Capacity planning, on the other hand, is that process which determines the resources needed, to meet changing demands.
In the Web Application World like Drupal we focus mostly on serving WEB capacity.

Let’s suppose you are a supermarket manager, so one of your tasks is to manage the schedule of cashiers. A challenge for you is finding the right number of cashiers that should be working at any moment.
Because if you assign too few, the checkout lines will become long and the customers upset, if I assign too many at the same time we would end up wasting money. The trick is finding the precise balance.

Now, think of the cashiers as server instances, and the customers as client browsers. Also take into consideration that the supermarket is getting more and more popular.
A seasoned manager will attempt to strike a good balance between keeping customers happy and not spending too much on cashiers:

- Only spend as much as you actually need
- Be ahead of sharp growth
- Avoid emergencies

There are a few ideas explored in this session from the book I co-authored: “Seeking SRE”, specifically on chapter 18 “Machine Learning for SRE”. In that chapter, I shared a few code examples and guides on how to use machine learning to support SRE on forecasting, auto-scaling, and several other problems.

Here is the video: