It’s year 2019 and my team at Shopify has just completed the first scale test ahead of Black Friday to find how our infra keeps up with twice as much traffic and compute.
We are learning that one of the bottlenecks is growing to be the size of the Kubernetes cluster that runs our application servers. As we increase the deployment size of the main Rails container to more than a 1000 replicas, the Kubernetes control plane starts to struggle with rolling deploys and scheduling of so many containers.
In one of the conversations with the Google’s team that works on Kubernetes scalability, we learn about their mental models around that problem.
To quote internal Kubernetes docs:
Scalability dimensions and thresholds are very complex topic. In fact, configurations that Kubernetes supports create Scalability Envelope.
Some the properties of the envelope:
- It's NOT a cube, because dimensions are sometimes not independent.
- It's NOT convex.
- As you move farther along one dimension, your cross-section wrt other dimensions gets smaller.
- It's bounded.
- It's decomposable into smaller envelopes.
In majority of cases, thresholds are NOT hard limits - crossing the limit results in degraded performance and doesn't mean cluster immediately fails over.
It's illustrated by a hypercube:
That mental model and the Scalability Envelope sticked with me because it’s a great way to visualize many factors that play into scalability of a system. And it’s especially good to describing that to less technical people, like Program Managers or Solutions Engineers.
Later in my career, I’ve reused that concept to illustrate scalability of other systems I’ve worked on -- so far, it’s mostly been systems at Shopify :)
Working on Commerce Components, I’ve used it internally to illustrate scalability of the commerce platform as it’s pushed for some of the largest customers.
No matter if it’s a Kubernetes cluster or an ecommerce platform, that envelope still maintains its properties:
- It's bounded
- As you move farther along one dimension, your cross-section wrt other dimensions gets smaller
- In majority of cases, thresholds are NOT hard limits - crossing the limit results in degraded performance and doesn't mean the system immediately fails
Lately, working on Shopify’s Checkout, I was able to reuse the concept once again because it’s such a great way to illustrate scalability.
More resources:
- Kubecon 2019 talk and slides that introduce the concept
- Kubernetes scalability docs