Stripe uses Amazon Web Services to power our infrastructure. With AWS, we can dynamically scale our fleet of servers in real-time. This elasticity enables us to reliably serve a rapidly growing user base and scale along with their businesses. We use AWS Reserved Instances, which allow us to predictably forecast our cloud spend given a dynamic fleet with rapidly changing compute requirements.
One of the biggest problems in cloud computing is capacity planning: the ability to forecast your compute power requirements and manage the budget allocated to AWS servers. At Stripe, we started by solely using reserved instances to manage pricing for individual instances, but today we can dynamically and reliably understand costs as our fleet changes over time. Reserved instances allow us to make cost-effective decisions through careful resource management. We’ve developed an easy-to-use framework for automating our purchase decisions, which we’ll outline in this post.
Reserved instances reduce your AWS pricing (since they’re a commitment to use that server). The most economical way to use reserved instances is to make sure server utilization over the year is higher than 70%; this is the break-even point where it’s more economical to choose reserved instances over on-demand instances. This also fits Stripe’s usage patterns.
Reserved instances are hard to purchase effectively. It’s easy to allocate the wrong number, and hard to predict future compute requirements over time. Deciding which and how many reserved instances to buy is a non-trivial exercise at the nexus of cloud strategy, bin packing, and capacity planning.
Understanding AWS Reserved Instances
There are many dimensions to every reserved instance purchase, some of which are out of scope for this post. Some you may already know, like AWS region, VM tenancy, and OS platform. Other options, like contract length, pricing plan, and the type of reserved instance, are related to your company’s cloud strategy. You need to know what your financial plan looks like over the next few years to make these business decisions; the technical guidance that engineers provide can only offer a limited perspective. At Stripe, we typically use no-upfront convertible reserved instances with a three-year term. This means our pricing is:
- No-upfront: We pay monthly on our normal billing cycle.
- Convertible: We can change our instance types for our reservation.
- Term: We lock in a pricing plan and commit to it for three years.
We think this offers the right trade-off between price efficiency and flexibility.
Of the remaining dimensions, the most impactful decision is scope. Scope is the AWS region or availability zone to which a reserved instance is attached. Your choice of scope affects capacity planning, deployment of your reserved instances, and server upgrades. In Stripe’s case, we reserve our instances with a regional scope.
If you choose to scope your reserved instances to a specific availability zone, they are locked to a specific instance type. This requires you to understand and plan your compute requirements in two dimensions:
- The instance type (e.g.
c5.2xlarge) defines how powerful each instance should be. This is known as vertical scale, since over time you can upgrade each server’s compute power without growing the number of instances.
- The availability zones are where you plan to deploy instances. Adding more instances across availability zones increases your horizontal scale. The more servers you run, the more likely your application will keep running in case of failure.
These require you to predict both how your application load will grow and how dense your cluster will be be years into the future. Any miscalculation means you’ll pay for reserved instances that you won’t actually use.
Compute power varies by the size of each instance: for example, nine c5.xlarge instances on AWS provide the equivalent computer power of one c5.9xlarge instance.
AWS divides its infrastructure into several regions, which include many availability zones. If you choose to scope your reserved instances more broadly by region, AWS allows you to deploy instances of any size, as long as the compute power matches what you’ve reserved. This allows you to purchase high-powered instances up-front and deploy lower-powered instances later on. Even better, AWS will automatically apply the budget you’ve allocated toward reserved instances to as many instances in that region as possible.
Automate your AWS capacity planning
To adopt reserved instances, you first need to estimate your cluster’s total compute requirements. This is the hardest part of capacity planning. AWS defines a scale for compute power of all it’s server sizes: we can use this to calculate an aggregate value. (We’ve provided an example of a SQL query that could generate this report below.)
- Take a snapshot of your fleet using the AWS cost and usage report, which is stored in a Redshift table. You should group the usage by instance family.
- Add up the total compute power for each instance family. Each charge in the report includes a scaled usage amount that you should sum up.
- Pick a standard instance size that you’ll use for your reserved instances.
- Divide the total compute capacity by its scaling factor (e.g. xlarge instances have a scaling factor of 8.0).
- The result is the number of reserved instances you’ll purchase. The budget we’ve calculated here should provide sufficient compute power to drive your fleet.
By choosing regional scope, we naturally define three properties across all our reserved instances: the scope, instance size, and instance family. Once we decide on an exact configuration, we execute a purchase in the AWS console and the reserved instance pricing is instantly applied to our fleet.
Because our fleet can dynamically grow, shrink, or change in compute requirements, we need to be more flexible with how we set the target number of reserved instances to purchase. Instead, we choose an acceptable range for a mix of reserved and on-demand instances in our fleet.
To automate this, we built an ETL process in SQL and Python that detects when we fall outside this band and automatically prepares a purchase for us to approve. This is an evergreen process: the ETL process will continue to analyze and suggest purchases over time as the fleet dynamically scales up and down in compute requirements. We purchase reserved instances once a month.
Here’s an example of the SQL query we regularly run to estimate our required compute power. First, we take a snapshot of our fleet with the cost and usage report:
WITH line_items AS ( SELECT lineitem_normalizedusageamount::float / 8.0 AS usage, product_region AS region, split_part(product_instancetype, '.', 1) AS instance_family, lineitem_lineitemtype AS itemtype FROM aws.cost_and_usage_201806 -- use your cost & usage report WHERE lineitem_productcode = 'AmazonEC2' AND lineitem_lineitemtype IN ('Usage', 'DiscountedUsage') AND product_instancetype <> '' AND lineitem_normalizedusageamount <> '' AND date_trunc('hour', lineitem_usagestartdate::timestamp) = date_trunc('day', CURRENT_DATE) - interval '4 days' )
Next, we select relevant data on usage for our existing reserved instances from our fleet’s total usage:
usage AS ( SELECT region, instance_family, SUM(usage) AS total, SUM(CASE itemtype WHEN 'DiscountedUsage' THEN usage END) as res FROM line_items GROUP BY region, instance_family )
Finally, we compute the number of additional reserved instances we’ll need to purchase to remain within our acceptable range:
SELECT region, instance_family, FLOOR(NVL(res, 0)) AS normalized_reservations, FLOOR(NVL(total, 0)) AS normalized_usage, FLOOR(CASE WHEN res / total NOT BETWEEN 0.70 AND 0.80 THEN 0.75 * total - res ELSE 0 END) AS to_purchase FROM usage ORDER BY region, instance_family
A complete example, including a Python notebook to render the output, can be found in the accompanying gist for this article.
With this approach, you can automatically budget reserved instances in a predictable manner and dynamically recalculate your compute requirements on an ongoing basis. This process can improve flexibility, cost predictability, and efficiency of your AWS fleet. Here are a few things to keep in mind:
- Pick one team to own this problem. Since this is a global optimization across the engineering organization, no individual team will have the necessary perspective to understand overall AWS requirements. Dedicating one team to this problem empowers them to gather a complete picture of the organization’s cloud usage and understand how to apply reserved instances effectively.
- Pick one standard instance size when purchasing reserved instances. Even if the size you choose is larger than the capacity you expect to use for a single application, it’s easier to compare the same size across instance families and understand pricing and compute efficiency.
- Choose your reserved instances for today’s compute requirements. Rather than choosing reserved instances in anticipation of how you plan to grow your fleet, take a clear snapshot of how you’re using your fleet today. Purchase the number of reserved instances required to meet your goals. Then continue to make purchases frequently and consistently.
Like this post? Join the Stripe engineering team. View openings