Blog Engineering

Follow Stripe on Twitter

Effectively using AWS Reserved Instances

Ryan Lopopolo on June 26, 2018 in Engineering

Stripe uses Amazon Web Services to power our infrastructure. With AWS, we can dynamically scale our fleet of servers in real-time. This elasticity enables us to reliably serve a rapidly growing user base and scale along with their businesses. We use AWS Reserved Instances, which allow us to predictably forecast our cloud spend given a dynamic fleet with rapidly changing compute requirements.

One of the biggest problems in cloud computing is capacity planning: the ability to forecast your compute power requirements and manage the budget allocated to AWS servers. At Stripe, we started by solely using reserved instances to manage pricing for individual instances, but today we can dynamically and reliably understand costs as our fleet changes over time. Reserved instances allow us to make cost-effective decisions through careful resource management. We’ve developed an easy-to-use framework for automating our purchase decisions, which we’ll outline in this post.

Reserved instances reduce your AWS pricing (since they’re a commitment to use that server). The most economical way to use reserved instances is to make sure server utilization over the year is higher than 70%; this is the break-even point where it’s more economical to choose reserved instances over on-demand instances. This also fits Stripe’s usage patterns.

Reserved instances are hard to purchase effectively. It’s easy to allocate the wrong number, and hard to predict future compute requirements over time. Deciding which and how many reserved instances to buy is a non-trivial exercise at the nexus of cloud strategy, bin packing, and capacity planning.

Understanding AWS Reserved Instances

There are many dimensions to every reserved instance purchase, some of which are out of scope for this post. Some you may already know, like AWS region, VM tenancy, and OS platform. Other options, like contract length, pricing plan, and the type of reserved instance, are related to your company’s cloud strategy. You need to know what your financial plan looks like over the next few years to make these business decisions; the technical guidance that engineers provide can only offer a limited perspective. At Stripe, we typically use no-upfront convertible reserved instances with a three-year term. This means our pricing is:

  • No-upfront: We pay monthly on our normal billing cycle.
  • Convertible: We can change our instance types for our reservation.
  • Term: We lock in a pricing plan and commit to it for three years.

We think this offers the right trade-off between price efficiency and flexibility.

Of the remaining dimensions, the most impactful decision is scope. Scope is the AWS region or availability zone to which a reserved instance is attached. Your choice of scope affects capacity planning, deployment of your reserved instances, and server upgrades. In Stripe’s case, we reserve our instances with a regional scope.

If you choose to scope your reserved instances to a specific availability zone, they are locked to a specific instance type. This requires you to understand and plan your compute requirements in two dimensions:

  • The instance type (e.g. c5.2xlarge) defines how powerful each instance should be. This is known as vertical scale, since over time you can upgrade each server’s compute power without growing the number of instances.
  • The availability zones are where you plan to deploy instances. Adding more instances across availability zones increases your horizontal scale. The more servers you run, the more likely your application will keep running in case of failure.

These require you to predict both how your application load will grow and how dense your cluster will be be years into the future. Any miscalculation means you’ll pay for reserved instances that you won’t actually use.

Compute power varies by the size of each instance: for example, nine c5.xlarge instances on AWS provide the equivalent computer power of one c5.9xlarge instance.

AWS divides its infrastructure into several regions, which include many availability zones. If you choose to scope your reserved instances more broadly by region, AWS allows you to deploy instances of any size, as long as the compute power matches what you’ve reserved. This allows you to purchase high-powered instances up-front and deploy lower-powered instances later on. Even better, AWS will automatically apply the budget you’ve allocated toward reserved instances to as many instances in that region as possible.

Automate your AWS capacity planning

To adopt reserved instances, you first need to estimate your cluster’s total compute requirements. This is the hardest part of capacity planning. AWS defines a scale for compute power of all it’s server sizes: we can use this to calculate an aggregate value. (We’ve provided an example of a SQL query that could generate this report below.)

  1. Take a snapshot of your fleet using the AWS cost and usage report, which is stored in a Redshift table. You should group the usage by instance family.
  2. Add up the total compute power for each instance family. Each charge in the report includes a scaled usage amount that you should sum up.
  3. Pick a standard instance size that you’ll use for your reserved instances.
  4. Divide the total compute capacity by its scaling factor (e.g. xlarge instances have a scaling factor of 8.0).
  5. The result is the number of reserved instances you’ll purchase. The budget we’ve calculated here should provide sufficient compute power to drive your fleet.

By choosing regional scope, we naturally define three properties across all our reserved instances: the scope, instance size, and instance family. Once we decide on an exact configuration, we execute a purchase in the AWS console and the reserved instance pricing is instantly applied to our fleet.

Because our fleet can dynamically grow, shrink, or change in compute requirements, we need to be more flexible with how we set the target number of reserved instances to purchase. Instead, we choose an acceptable range for a mix of reserved and on-demand instances in our fleet.

To automate this, we built an ETL process in SQL and Python that detects when we fall outside this band and automatically prepares a purchase for us to approve. This is an evergreen process: the ETL process will continue to analyze and suggest purchases over time as the fleet dynamically scales up and down in compute requirements. We purchase reserved instances once a month.

Here’s an example of the SQL query we regularly run to estimate our required compute power. First, we take a snapshot of our fleet with the cost and usage report:

WITH line_items AS (
  lineitem_normalizedusageamount::float / 8.0 AS usage,
  product_region AS region,
  split_part(product_instancetype, '.', 1) AS instance_family,
  lineitem_lineitemtype AS itemtype
  FROM aws.cost_and_usage_201806 -- use your cost & usage report
  WHERE lineitem_productcode = 'AmazonEC2'
  AND lineitem_lineitemtype IN ('Usage', 'DiscountedUsage')
  AND product_instancetype <> ''
  AND lineitem_normalizedusageamount <> ''
  AND date_trunc('hour', lineitem_usagestartdate::timestamp) =
    date_trunc('day', CURRENT_DATE) - interval '4 days'

Next, we select relevant data on usage for our existing reserved instances from our fleet’s total usage:

usage AS (
  SELECT region, instance_family, SUM(usage) AS total,
  SUM(CASE itemtype WHEN 'DiscountedUsage' THEN usage END) as res
  FROM line_items
  GROUP BY region, instance_family

Finally, we compute the number of additional reserved instances we’ll need to purchase to remain within our acceptable range:

region, instance_family,
FLOOR(NVL(res, 0)) AS normalized_reservations,
FLOOR(NVL(total, 0)) AS normalized_usage,
  0.75 * total - res ELSE 0 END) AS to_purchase
FROM usage
ORDER BY region, instance_family

A complete example, including a Python notebook to render the output, can be found in the accompanying gist for this article.

Wrapping up

With this approach, you can automatically budget reserved instances in a predictable manner and dynamically recalculate your compute requirements on an ongoing basis. This process can improve flexibility, cost predictability, and efficiency of your AWS fleet. Here are a few things to keep in mind:

  • Pick one team to own this problem. Since this is a global optimization across the engineering organization, no individual team will have the necessary perspective to understand overall AWS requirements. Dedicating one team to this problem empowers them to gather a complete picture of the organization’s cloud usage and understand how to apply reserved instances effectively.
  • Pick one standard instance size when purchasing reserved instances. Even if the size you choose is larger than the capacity you expect to use for a single application, it’s easier to compare the same size across instance families and understand pricing and compute efficiency.
  • Choose your reserved instances for today’s compute requirements. Rather than choosing reserved instances in anticipation of how you plan to grow your fleet, take a clear snapshot of how you’re using your fleet today. Purchase the number of reserved instances required to meet your goals. Then continue to make purchases frequently and consistently.

Like this post? Join the Stripe engineering team. View openings

June 26, 2018

Learning to operate Kubernetes reliably

Julia Evans on December 20, 2017 in Engineering

We recently built a distributed cron job scheduling system on top of Kubernetes, an exciting new platform for container orchestration. Kubernetes is very popular right now and makes a lot of exciting promises: one of the most exciting is that engineers don’t need to know or care what machines their applications run on.

Distributed systems are really hard, and managing services on distributed systems is one of the hardest problems operations teams face. Breaking in new software in production and learning how to operate it reliably is something we take very seriously. As an example of why learning to operate Kubernetes is important (and why it’s hard!), here’s a fantastic postmortem of a one-hour outage caused by a bug in Kubernetes.

In this post, we’ll explain why we chose to build on top of Kubernetes. We’ll examine how we integrated Kubernetes into our existing infrastructure, our approach to building confidence in (and improving) our Kubernetes’ cluster’s reliability, and the abstractions we’ve built on top of Kubernetes.

What’s Kubernetes?

Kubernetes is a distributed system for scheduling programs to run in a cluster. You can tell Kubernetes to run five copies of a program, and it’ll dynamically schedule them on your worker nodes. Containers are automatically scheduled to increase utilization and save money, powerful deployment primitives allow you to gradually roll out new code, and Security Contexts and Network Policies allow you to run multi-tenant workloads in a secure way.

Kubernetes has a lot of different kinds of scheduling capabilities built into it. It can schedule long-running HTTP services, daemonsets that run on every machine in your cluster, cron jobs that run every hour, and more. There’s a lot more to Kubernetes. If you want to know more, Kelsey Hightower has given a lot of excellent talks: Kubernetes for sysadmins and healthz: Stop reverse engineering applications and start monitoring from the inside are two nice starting points. There’s also a great, supportive community on Slack.

Why Kubernetes?

Every infrastructure project (hopefully!) starts with a business need, and our goal was to improve the reliability and security of an existing distributed cron job system we had. Our requirements were:

  • We needed to be able to build and operate it with a relatively small team (only 2 people were working full time on the project.)
  • We needed to schedule about 500 different cron jobs across around 20 machines reliably.

Here are a few reasons we decided to build on top of Kubernetes:

  • We wanted to build on top of an existing open-source project.
  • Kubernetes includes a distributed cron job scheduler, so we wouldn’t have to write one ourselves.
  • Kubernetes is a very active project and regularly accepts contributions.
  • Kubernetes is written in Go, which is easy to learn. Almost all of our Kubernetes bugfixes were made by inexperienced Go programmers on our team.
  • If we could successfully operate Kubernetes, we could build on top of Kubernetes in the future (for example, we’re currently working on a Kubernetes-based system to train machine learning models.)

We’d previously been using Chronos as a cron job scheduling system, but it was no longer meeting our reliability requirements and it’s mostly unmaintained (1 commit in the last 9 months, and the last time a pull request was merged was March 2016) Because Chronos is unmaintained, we decided it wasn’t worth continuing to invest in improving our existing cluster.

If you’re considering Kubernetes, keep in mind: don’t use Kubernetes just because other companies are using it. Setting up a reliable cluster takes a huge amount of time, and the business case for using it isn’t always obvious. Invest your time in a smart way.

What does reliable mean?

When it comes to operating services, the word reliable isn’t meaningful on its own. To talk about reliability, you first need to establish a SLO (service level objective).

We had three primary goals:

  1. 99.99% of cron jobs should get scheduled and start running within 20 minutes of their scheduled run time. 20 minutes is a pretty wide window, but we interviewed our internal customers and none of them asked for higher precision.
  2. Jobs should run to completion 99.99% of the time (without being terminated).
  3. Our migration to Kubernetes shouldn’t cause any customer-facing incidents.

This meant a few things:

  • Short periods of downtime in the Kubernetes API are acceptable (if it’s down for ten minutes, it’s ok as long as we can recover within five minutes.)
  • Scheduling bugs (where a cron job run gets dropped completely and fails to run at all) are not acceptable. We took reports of scheduling bugs extremely seriously.
  • We needed to be careful about pod evictions and terminating instances safely so that jobs didn’t get terminated too frequently.
  • We needed a good migration plan.

Building a Kubernetes cluster

Our basic approach to setting up our first Kubernetes cluster was to build the cluster from scratch instead of using a tool like kubeadm or kops (using Kubernetes The Hard Way as a reference). We provisioned our configuration with Puppet, our usual configuration management tool. Building from scratch was great for two reasons: we were able to deeply integrate Kubernetes in our architecture, and we developed a deep understanding of its internals.

Building from scratch let us integrate Kubernetes into our existing infrastructure. We wanted seamless integration with our existing systems for logging, certificate management, secrets, network security, monitoring, AWS instance management, deployment, database proxies, internal DNS servers, configuration management, and more. Integrating all those systems sometimes required a little creativity, but overall was easier than trying to shoehorn kubeadm/kops into doing what we wanted.

We already trust and know how to operate all those existing systems, so we wanted to keep using them in our new Kubernetes cluster. For example, secure certificate management is a very hard problem, and we already have a way to issue and manage certificates. We were able to avoid creating a new CA just for Kubernetes with a proper integration.

We were forced to understand exactly how the parameters we were setting affected our Kubernetes setup. For example, there are over a dozen parameters used when configuring the certificates/CAs used for authentication. Understanding all of those parameters made it way easier to debug our setup when we ran into issues with authentication.

Building confidence in Kubernetes

At the beginning of our Kubernetes work, nobody on the team had ever used Kubernetes before (except in some cases for toy projects). How do you get from “None of us have ever used Kubernetes” to “We’re confident running Kubernetes in production”?

Strategy 0: Talk to other companies

We asked a few folks at other companies about their experiences with Kubernetes. They were all using Kubernetes in different ways or on different environments (to run HTTP services, on bare metal, on Google Kubernetes Engine, etc).

Especially when talking about a large and complicated system like Kubernetes, it’s important to think critically about your own use cases, do your own experiments, build confidence in your own environment, and make your own decisions. For example, you should not read this blog post and conclude “Well, Stripe is using Kubernetes successfully, so it will work for us too!”

Here’s what we learned after conversations with several companies operating Kubernetes clusters:

  • Prioritize working on your etcd cluster’s reliability (etcd is where all of your Kubernetes cluster’s state is stored.)
  • Some Kubernetes features are more stable than others, so be cautious of alpha features. Some companies only use stable features after they’ve been stable for more than one release (e.g. if a feature became stable in 1.8, they’d wait for 1.9 or 1.10 before using it.)
  • Consider using a hosted Kubernetes system like GKE/AKS/EKS. Setting up a high-availability Kubernetes system yourself from scratch is a huge amount of work. AWS didn’t have a managed Kubernetes service during this project so this wasn’t an option for us.
  • Be careful about the additional network latency introduced by overlay networks / software defined networking.

Talking to other companies of course didn’t give us a clear answer on whether Kubernetes would work for us, but it did give us questions to ask and things to be cautious about.

Strategy 1: Read the code

We were planning to depend quite heavily on one component of Kubernetes,the cronjob controller. This component was in alpha at the time, which made us a little worried. We’d tried it out in a test cluster, but how could we tell whether it would work for us in production?

Thankfully, all of the cron job controller’s core functionality is just 400 lines of Go. Reading through the source code quickly showed that:

  1. The cron job controller is a stateless service (like every other Kubernetes component, except etcd).
  2. Every ten seconds, this controller calls the syncAll function: go wait.Until(jm.syncAll, 10*time.Second, stopCh)
  3. The syncAll function fetches all cron jobs from the Kubernetes API, iterates through that list, determines which jobs should next run, then starts those jobs.

The core logic seemed relatively easy to understand. More importantly, we felt like if there was a bug in this controller, it was probably something we could fix ourselves.

Strategy 2: Do load testing

Before we started building the cluster in earnest, we did a little bit of load testing. We weren’t worried about how many nodes the Kubernetes cluster could handle (we were planning to deploy around 20 nodes), but we did want to make certain Kubernetes could handle running as many cron jobs as we wanted to run (about 50 per minute).

We ran a test in a 3-node cluster where we created 1,000 cron jobs that each ran every minute. Each of these jobs simply ran bash -c 'echo hello world'. We chose simple jobs because we wanted to test the scheduling and orchestration abilities of the cluster, not the cluster’s total compute capacity.

Our test cluster could not handle 1,000 cron jobs per minute. We observed that every node would only start at most one pod per second, and the cluster was able to run 200 cron jobs per minute without issue. Since we only wanted to run approximately 50 cron jobs per minute, we decided these limits weren’t a blocker (and that we could figure them out later if required). Onwards!

Strategy 3: Prioritize building and testing a high availability etcd cluster

One of the most important things to get right when setting up Kubernetes is running etcd. Etcd is the heart of your Kubernetes cluster—it’s where all of the data about everything in your cluster is stored. Everything other than etcd is stateless. If etcd isn’t running, you can’t make any changes to your Kubernetes cluster (though existing services will continue running!).

This diagram shows how etcd is the heart of your Kubernetes cluster—the API server is a stateless REST/authentication endpoint in front of etcd, and then every other component works by talking to etcd through the API server.

When running, there are two important points to keep in mind:

  • Set up replication so that your cluster doesn’t die if you lose a node. We have three etcd replicas right now.
  • Make sure you have enough I/O bandwidth available. Our version of etcd had an issue where one node with high fsync latency could trigger continuous leader elections, causing unavailability on our cluster. We remediated this by ensuring that all of our nodes had more I/O bandwidth than the number of writes etcd was performing.

Setting up replication isn’t a set-and-forget operation. We carefully tested that we could actually lose an etcd node, and that the cluster gracefully recovered.

Here’s some of the work we did to set up our etcd cluster:

  • Set up replication
  • Monitor that the etcd service is available (if etcd is down, we want to know right away)
  • Write some simple tooling so we could easily spin up new etcd nodes and join them to the cluster
  • Patch etcd’s Consul integration so that we could run more than 1 etcd cluster in our production environment
  • Test recovering from an etcd backup
  • Test that we could rebuild the whole cluster without downtime

We were happy that we did this testing pretty early on. One Friday morning in our production cluster, one of our etcd nodes stopped responding to ping. We got alerted about it, terminated the node, brought up a new one, joined it to the cluster, and in the meantime Kubernetes continued running without incident. Fantastic.

Strategy 4: Incrementally migrate jobs to Kubernetes

One of our major goals was to migrate our jobs to Kubernetes without causing any outages. The secret to running a successful production migrations is not to avoid making any mistakes (that’s impossible), but to design your migration to reduce the impact of mistakes.

We were lucky to have a wide variety of jobs to migrate to our new cluster, so there were some low-impact jobs we could migrate where one or two failures were acceptable.

Before starting the migration, we built easy-to-use tooling that would let us move jobs back and forth between the old and new systems in less than five minutes if necessary. This easy tooling reduced the impact of mistakes—if we moved over a job that had a dependency we hadn’t planned for, no big deal! We could just move it back, fix the issue, and try again later.

Here’s the overall migration strategy we took:

  1. Roughly order the jobs in terms of how critical they were
  2. Repeatedly move some jobs over to Kubernetes. If there’s a new edge case we discover, quickly rollback, fix the issue, and try again.

Strategy 5: Investigate Kubernetes bugs (and fix them)

We set out a rule at the beginning of the project: if Kubernetes does something surprising or unexpected, we have to investigate, figure out why, and come up with a remediation.

Investigating each issue is time consuming, but very important. If we simply dismissed flaky and strange behaviour in Kubernetes as a function of how complex distributed systems can become, we’d feel afraid of being on call for the resulting buggy cluster.

After taking this approach, we discovered (and were able to fix!) several bugs in Kubernetes.

Here are some kinds of issues that we found during these tests:

Fixing these bugs made us feel much better about our use of the Kubernetes project—not only did it work relatively well, but they also accept patches and have a good PR review process.

Kubernetes definitely has bugs, like all software. In particular, we use the scheduler very heavily (because our cron jobs are constantly creating new pods), and the scheduler’s use of caching sometimes results in bugs, regressions, and crashes. Caching is hard! But the codebase is approachable and we’ve been able to handle the bugs we encountered.

One other issue worth mentioning is Kubernetes’ pod eviction logic. Kubernetes has a component called the node controller which is responsible for evicting pods and moving them to another node if a node becomes unresponsive. It’s possible for all nodes to temporarily become unresponsive (e.g. due to a networking or configuration issue), and in that case Kubernetes can terminate all pods in the cluster. This happened to us relatively early on in our testing.

If you’re running a large Kubernetes cluster, carefully read through the node controller documentation, think through the settings carefully, and test extensively. Every time we’ve tested a configuration change to these settings (e.g. --pod-eviction-timeout) by creating network partitions, surprising things have happened. It’s always better to discover these surprises in testing rather than at 3am in production.

Strategy 6: Intentionally cause Kubernetes cluster issues

We’ve discussed running game day exercises at Stripe before, and it’s something we still do very frequently. The idea is to come up with situations you expect to eventually happen in production (e.g. losing a Kubernetes API server) and then intentionally cause those situations in production (during the work day, with warning) to ensure that you can handle them.

After running several exercises on our cluster, they often revealed issues like gaps in monitoring or configuration errors. We were very happy to discover those issues early on in a controlled fashion rather than by surprise six months later.

Here are a few of the game day exercises we ran:

  • Terminate one Kubernetes API server
  • Terminate all the Kubernetes API servers and bring them back up (to our surprise, this worked very well)
  • Terminate an etcd node
  • Cut off worker nodes in our Kubernetes cluster from the API servers (so that they can’t communicate). This resulted in all pods on those nodes being moved to other nodes.

We were really pleased to see how well Kubernetes responded to a lot of the disruptions we threw at it. Kubernetes is designed to be resilient to errors—it has one etcd cluster storing all the state, an API server which is simply a REST interface to that database, and a collection of stateless controllers” that coordinate all cluster management.

If any of the Kubernetes core components (the API server, controller manager, or scheduler) are interrupted or restarted, once they come up they read the relevant state from etcd and continue operating seamlessly. This was one of the things we hoped would be true, and has actually worked very well in practice.

Here are some kinds of issues that we found during these tests:

  • “Weird, I didn’t get paged for that, that really should have paged. Let’s fix our monitoring there.”
  • “When we destroyed our API server instances and brought them back up, they required human intervention. We’d better fix that.”
  • “Sometimes when we do an etcd failover, the API server starts timing out requests until we restart it.”

After running these tests, we developed remediations for the issues we found: we improved monitoring, fixed configuration issues we’d discovered, and filed bugs with Kubernetes.

Making cron jobs easy to use

Let’s briefly explore how we made our Kubernetes-based system easy to use.

Our original goal was to design a system for running cron jobs that our team was confident operating and maintaining. Once we had established our confidence in Kubernetes, we needed to make it easy for our fellow engineers to configure and add new cron jobs. We developed a simple YAML configuration format so that our users didn’t need to understand anything about Kubernetes’ internals to use the system. Here’s the format we developed:

name: job-name-here
  schedule: '15 */2 * * *'
- ruby
- "/path/to/script.rb"
    cpu: 0.1
    memory: 128M
    memory: 1024M

We didn’t do anything very fancy here—we wrote a simple program to take this format and translate it into Kubernetes cron job configurations that we apply with kubectl.

We also wrote a test suite to ensure that job names aren’t too long (Kubernetes cron job names can’t be more than 52 characters) and that all names are unique. We don’t currently use cgroups to enforce memory limits on most of our jobs, but it’s something we plan to roll out in the future.

Our simple format was easy to use, and since we automatically generated both Chronos and Kubernetes cron job definitions from the same format, moving a job between either system was really easy. This was a key part of making our incremental migration work well. Whenever moving a job to Kubernetes caused issues, we could move it back with a simple three-line configuration change and in less than ten minutes.

Monitoring Kubernetes

Monitoring our Kubernetes cluster’s internal state has proven to be very pleasant. We use the kube-state-metrics package for monitoring and a small Go program called veneur-prometheus to scrape the Prometheus metrics kube-state-metrics emits and publish them as statsd metrics to our monitoring system.

For example, here’s a chart of the number of pending pods in our cluster over the last hour. Pending means that they’re waiting to be assigned a worker node to run on. You can see that the number spikes at 11am, because a lot of our cron jobs run at the 0th minute of the hour.

An example chart showing pending pods in a cluster over the last hour

We also have a monitor that checks that no pods are stuck in the Pending state—we check that every pod starts running on a worker node within 5 minutes, or we otherwise receive an alert.

Future plans for Kubernetes

Setting up Kubernetes, getting to a place where we were comfortable running production code and migrating all our cron jobs to the new cluster took us five months with three engineers working full time. One big reason we invested in learning Kubernetes is we expect to be able to use Kubernetes more widely at Stripe.

Here are some principles that apply to operating Kubernetes (or any other complex distributed system):

  • Define a clear business reason for your Kubernetes projects (and all infrastructure projects!). Understanding the business case and the needs of our users made our project significantly easier.
  • Aggressively cut scope. We decided to avoid using many of Kubernetes’ basic features to simplify our cluster. This let us ship more quickly—for example, since pod-to-pod networking wasn’t a requirement for our project, we could firewall off all network connections between nodes and defer thinking about network security in Kubernetes to a future project.
  • Invest a lot of time into learning how to properly operate a Kubernetes cluster. Test edge cases carefully. Distributed systems are extremely complicated and there’s a lot of potential for things to go wrong. Take the example we described earlier: the node controller can kill all pods in your cluster if they lose contact with API servers, depending on your configuration. Learning how Kubernetes behaves after each configuration change takes time and careful focus.

By staying focused on these principles, we’ve been able to use Kubernetes in production with confidence. We’ll continue to grow and evolve how we use Kubernetes over time—for example, we’re watching AWS’s release of EKS with interest. We’re finishing work on another system to train machine learning models and are also exploring moving some HTTP services to Kubernetes. As we continue operating Kubernetes in production, we plan to contribute to the open-source project along the way.

Like this post? Join the Stripe engineering team. View Openings

December 20, 2017

APIs as infrastructure: future-proofing Stripe with versioning

Brandur Leach on August 15, 2017 in Engineering

When it comes to APIs, change isn’t popular. While software developers are used to iterating quickly and often, API developers lose that flexibility as soon as even one user starts consuming their interface. Many of us are familiar with how the Unix operating system evolved. In 1994, The Unix-Haters Handbook was published containing a long list of missives about the software—everything from overly-cryptic command names that were optimized for Teletype machines, to irreversible file deletion, to unintuitive programs with far too many options. Over twenty years later, an overwhelming majority of these complaints are still valid even across the dozens of modern derivatives. Unix had become so widely used that changing its behavior would have challenging implications. For better or worse, it established a contract with its users that defined how Unix interfaces behave.

Similarly, an API represents a contract for communication that can’t be changed without considerable cooperation and effort. Because so many businesses rely on Stripe as infrastructure, we’ve been thinking about these contracts since Stripe started. To date, we’ve maintained compatibility with every version of our API since the company’s inception in 2011. In this article, we’d like to share how we manage API versions at Stripe.

Code written to integrate with an API has certain inherent expectations built into it. If an endpoint returns a boolean field called verified to indicate the status of a bank account, a user might write code like this:

if bank_account[:verified]

If we later replaced the bank account’s verified boolean with a status field that might include the value verified (like we did back in 2014), the code will break because it depends on a field that no longer exists. This type of change is backwards-incompatible, and we avoid making them. Fields that were present before should stay present, and fields should always preserve their same type and name. Not all changes are backwards-incompatible though; for example, it’s safe to add a new API endpoint, or a new field to an existing API endpoint that was never present before.

With enough coordination, we might be able to keep users apprised of changes that we’re about to make and have them update their integrations, but even if that were possible, it wouldn’t be very user-friendly. Like a connected power grid or water supply, after hooking it up, an API should run without interruption for as long as possible.

Our mission at Stripe is to provide the economic infrastructure for the internet. Just like a power company shouldn’t change its voltage every two years, we believe that our users should be able to trust that a web API will be as stable as possible.

API versioning schemes

A common approach to allow forward progress in web APIs is to use versioning. Users specify a version when they make requests and API providers can make the changes they want for their next version while maintaining compatibility in the current one. As new versions are released, users can upgrade when it’s convenient for them.

This is often seen as a major versioning scheme with names like v1, v2, and v3 that are passed as a prefix to a URL (like /v1/widgets) or through an HTTP header like Accept. This can work, but has the major downside of changes between versions being so big and so impactful for users that it’s almost as painful as re-integrating from scratch. It’s also not a clear win because there will be a class of users that are unwilling or unable to upgrade and get trapped on old API versions. Providers then have to make the difficult choice between retiring API versions and by extension cutting those users off, or maintaining the old versions forever at considerable cost. While having providers maintain old versions might seem at first glance to be beneficial to users, they’re also paying indirectly in the form of reduced progress on improvements. Instead of working on new features, engineering time is diverted to maintaining old code.

At Stripe, we implement versioning with rolling versions that are named with the date they’re released (for example, 2017-05-24). Although backwards-incompatible, each one contains a small set of changes that make incremental upgrades relatively easy so that integrations can stay current.

The first time a user makes an API request, their account is automatically pinned to the most recent version available, and from then on, every API call they make is assigned that version implicitly. This approach guarantees that users don’t accidentally receive a breaking change and makes initial integration less painful by reducing the amount of necessary configuration. Users can override the version of any single request by manually setting the Stripe-Version header, or upgrade their account’s pinned version from Stripe’s dashboard.

Some readers might have already noticed that the Stripe API also defines major versions using a prefixed path (like /v1/charges). Although we reserve the right to make use of this at some point, it’s not likely to change for some time. As noted above, major version changes tend to make upgrades painful, and it’s hard for us to imagine an API redesign that’s important enough to justify this level of user impact. Our current approach has been sufficient for almost a hundred backwards-incompatible upgrades over the past six years.

Versioning under the hood

Versioning is always a compromise between improving developer experience and the additional burden of maintaining old versions. We strive to achieve the former while minimizing the cost of the latter, and have implemented a versioning system to help us with it. Let’s take a quick look at how it works. Every possible response from the Stripe API is codified by a class that we call an API resource. API resources define their possible fields using a DSL:

class ChargeAPIResource
  required :id, String
  required :amount, Integer

API resources are written so that the structure they describe is what we’d expect back from the current version of the API. When we need to make a backwards-incompatible change, we encapsulate it in a version change module which defines documentation about the change, a transformation, and the set of API resource types that are eligible to be modified:

class CollapseEventRequest < AbstractVersionChange
  description \
    "Event objects (and webhooks) will now render " \
    "`request` subobject that contains a request ID " \
    "and idempotency key instead of just a string " \
    "request ID."

  response EventAPIResource do
    change :request, type_old: String, type_new: Hash

    run do |data|
      data.merge(:request => data[:request][:id])

Elsewhere, version changes are assigned to a corresponding API version in a master list:

class VersionChanges
    '2017-05-25' => [
    '2017-04-06' => [Change::LegacyTransfers],
    '2017-02-14' => [
    '2017-01-27' => [Change::SourcedTransfersOnBts],

Version changes are written so that they expect to be automatically applied backwards from the current API version and in order. Each version change assumes that although newer changes may exist in front of them, the data they receive will look the same as when they were originally written.

When generating a response, the API initially formats data by describing an API resource at the current version, then determines a target API version from one of:

  • A Stripe-Version header if one was supplied.
  • The version of an authorized OAuth application if the request is made on the user’s behalf.
  • The user’s pinned version, which is set on their very first request to Stripe.

It then walks back through time and applies each version change module that finds along the way until that target version is reached.

Requests are processed by version change modules before returning a response.

Version change modules keep older API versions abstracted out of core code paths. Developers can largely avoid thinking about them while they’re building new products.

Changes with side effects

Most of our backwards-incompatible API changes will modify a response, but that’s not always the case. Sometimes a more complicated change is necessary which leaks out of the module that defines it. We assign these modules a has_side_effects annotation and the transformation they define becomes a no-op:

class LegacyTransfers < AbstractVersionChange
  description "..."

Elsewhere in the code a check will be made to see whether they’re active:

This reduced encapsulation makes changes with side effects more complex to maintain, so we try to avoid them.

Declarative changes

One advantage of self-contained version change modules is that they can declare documentation describing what fields and resources they affect. We can also reuse this to rapidly provide more helpful information to our users. For example, our API changelog is programmatically generated and receives updates as soon as our services are deployed with a new version.

We also tailor our API reference documentation to specific users. It notices who is logged in and annotates fields based on their account API version. Here, we’re warning the developer that there’s been a backwards-incompatible change in the API since their pinned version. The request field of Event was previously a string, but is now a subobject that also contains an idempotency key (produced by the version change that we showed above):

Screenshot of a tooltip in the Stripe API documentation indicating API changes made since the users current version

Our documentation detects the user’s API version and presents relevant warnings.

Minimizing change

Providing extensive backwards compatibility isn’t free; every new version is more code to understand and maintain. We try to keep what we write as clean as possible, but given enough time dozens of checks on version changes that can’t be encapsulated cleanly will be littered throughout the project, making it slower, less readable, and more brittle. We take a few measures to try and avoid incurring this sort of expensive technical debt.

Even with our versioning system available, we do as much as we can to avoid using it by trying to get the design of our APIs right the first time. Outgoing changes are funneled through a lightweight API review process where they’re written up in a brief supporting document and submitted to a mailing list. This gives each proposed change broader visibility throughout the company, and improves the likelihood that we’ll catch errors and inconsistencies before they’re released.

We try to be mindful of balancing stagnation and leverage. Maintaining compatibility is important, but even so, we expect to eventually start retiring our older API versions. Helping users move to newer versions of the API gives them access to new features, and simplifies the foundation that we use to build new features.

Principles of change

The combination of rolling versions and an internal framework to support them has enabled us to onboard vast numbers of users, make enormous changes to our API—all while having minimal impact on existing integrations. The approach is driven by a few principles that we’ve picked up over the years. We think it’s important that API upgrades are:

  • Lightweight. Make upgrades as cheap as possible (for users and for ourselves).
  • First-class. Make versioning a first-class concept in your API so that it can be used to keep documentation and tooling accurate and up-to-date, and to generate a changelog automatically.
  • Fixed-cost. Ensure that old versions add only minimal maintenance cost by tightly encapsulating them in version change modules. Put another way, the less thought that needs to be applied towards old behavior while writing new code, the better.

While we’re excited by the debate and developments around REST vs. GraphQL vs. gRPC, and—more broadly—what the future of web APIs will look like, we expect to continue supporting versioning schemes for a long time to come.

Like this post? Join the Stripe engineering team. View Openings

August 15, 2017

Connect: behind the front-end experience

Benjamin De Cock on June 19, 2017 in Engineering

We recently released a new and improved version of Connect, our suite of tools designed for platforms and marketplaces. Stripe’s design team works hard to create unique landing pages that tell a story for our major products. For this release, we designed Connect’s landing page to reflect its intricate, cutting-edge capabilities while keeping things light and simple on the surface.

In this blog post, we’ll describe how we used several next-generation web technologies to bring Connect to life, and walk through some of the finer technical details (and excitement!) on our front-end journey.

CSS Grid Layout

Earlier this year, three major browsers (Firefox, Chrome, and Safari) almost simultaneously shipped their implementation of the new CSS Grid Layout module. This specification provides authors with a two-dimensional layout system that is easy-to-use and incredibly powerful. Connect’s landing page relies on CSS grids pretty much everywhere, making some seemingly tricky designs almost trivial to achieve. As an example, let’s hide the header’s content and focus on its background:

Historically, we’ve created these background stripes (as we obviously call them) by using absolute positioning to precisely place each stripe on the page. This approach works, but fragile positioning often results in subtle issues: for example, rounding errors can cause a 1px gap between two stripes. CSS stylesheets also quickly become verbose and hard to maintain, since media queries need to be more complex to account for background differences at various viewport sizes.

With CSS Grid, pretty much all our previous issues go away. We simply define a flexible grid and place the stripes in their appropriate cells. Firefox has a handy grid inspector allowing you to visualize the structure of your layout. Let’s see how it looks:

We’ve highlighted three stripes and removed the tilt effect to make things easier to understand. Here’s what the CSS for our grid looks like:

header .stripes {
  display: grid;
  grid: repeat(5, 200px) / repeat(10, 1fr);

header .stripes :nth-child(1) {
  grid-column: span 3;

header .stripes :nth-child(2) {
  grid-area: 3 / span 3 / auto / -1;

header .stripes :nth-child(3) {
  grid-row: 4;
  grid-column: span 5;

We can then just transform the entire .stripes container to produce the tilted background:

header .stripes {
  transform: skewY(-12deg);
  transform-origin: 0;

And voilà! CSS Grid might look intimidating at first sight as it comes with an unusual syntax and many new properties and values, but the mental modal is actually very simple. And if you’re used to Flexbox, you’re already familiar with the Box Alignment module, which means you can reuse the properties you know and love such as justify-content and align-items.


The landing page’s header displays several cubes as a visual metaphor for the building blocks that compose Connect. These floating cubes rotate in 3D at random speeds (within a certain range) and benefit from the same light source, which dynamically illuminates the appropriate faces:

These cubes are simple DOM elements that are generated and animated in JavaScript. Each of them instantiate the same HTML template:

<!-- HTML -->
<template id="cube-template">
  <div class="cube">
    <div class="shadow"></div>
    <div class="sides">
      <div class="back"></div>
      <div class="top"></div>
      <div class="left"></div>
      <div class="front"></div>
      <div class="right"></div>
      <div class="bottom"></div>

// JavaScript
const createCube = () => {
  const template = document.getElementById("cube-template");
  const fragment = document.importNode(template.content, true);
  return fragment;

Pretty straightforward. We can now easily turn these blank and empty elements into a three-dimensional shape. Thanks to 3D transforms, adding perspective and moving the sides along the z-axis is fairly natural:

.cube, .cube * {
  position: absolute;
  width: 100px;
  height: 100px

.sides {
  transform-style: preserve-3d;
  perspective: 600px

.front  { transform: rotateY(0deg)    translateZ(50px) }
.back   { transform: rotateY(-180deg) translateZ(50px) }
.left   { transform: rotateY(-90deg)  translateZ(50px) }
.right  { transform: rotateY(90deg)   translateZ(50px) }
.top    { transform: rotateX(90deg)   translateZ(50px) }
.bottom { transform: rotateX(-90deg)  translateZ(50px) }

While CSS makes it trivial to model the cube, it doesn’t provide advanced animation features like dynamic shading. The cube’s animation instead relies on requestAnimationFrame to calculate and update each side at any point in the rotation. There are three things to determine on every frame:

  • Visibility. There are never more than three faces visible at the same time, so we can avoid any computations and expensive repaints for hidden sides.
  • Transformation. Each visible side of the cube needs to be transformed based on its initial rotation, current animation state, and the speed of each axis.
  • Shading. While CSS lets you position elements in a three-dimensional space, there are no traditional concepts from 3D environments (e.g. light sources). In order to mimic a 3D environment, we can render a light source by progressively darkening the sides of the cube as they move away from a particular point.

There are other considerations to take into account (such as improving performance using requestIdleCallback in JavaScript and backface-visibility in CSS), but these are the main pillars behind the logic of the animation.

We can calculate the visibility and transformation of each side by continually tracking their state and updating them with basic math operations. With the help of pure functions and ES2015 features such as template literals, things become even easier. Here are two short excerpts of JavaScript code to compute and define the current transformation:

const getDistance = (state, rotate) =>
  ["x", "y"].reduce((object, axis) => {
    object[axis] = Math.abs(state[axis] + rotate[axis]);
    return object;
  }, {});

const getRotation = (state, size, rotate) => {
  const axis = rotate.x ? "Z" : "Y";
  const direction = rotate.x > 0 ? -1 : 1;

  return `
    rotateX(${state.x + rotate.x}deg)
    rotate${axis}(${direction * (state.y + rotate.y)}deg)
    translateZ(${size / 2}px)

The most challenging piece of the puzzle is how to properly calculate shading for each face of the cube. In order to simulate a virtual light source at the center of the stage, we can gradually increase each face’s lighting effect as they approach the center point—on all axes. Concretely, that means we need to calculate the luminosity and color for each face. We’ll perform this calculation on every frame by interpolating the base color and the current shading factor.

// Linear interpolation between a and b
// Example: (100, 200, .5) = 150
const interpolate = (a, b, i) => a * (1 - i) + b * i;

const getShading = (tint, rotate, distance) => {
  const darken = ["x", "y"].reduce((object, axis) => {
    const delta = distance[axis];
    const ratio = delta / 180;
    object[axis] = delta > 180 ? Math.abs(2 - ratio) : ratio;
    return object;
  }, {});

  if (rotate.x)
    darken.y = 0;
  else {
    const {x} = distance;
    if (x > 90 && x < 270)
      directions.forEach(axis => darken[axis] = 1 - darken[axis]);

  const alpha = (darken.x + darken.y) / 2;
  const blend = (value, index) =>
    Math.round(interpolate(value, tint.shading[index], alpha));

  const [r, g, b] =;
  return `rgb(${r}, ${g}, ${b})`;

Phew! The rest of the code is fortunately far less hairy and mostly composed of boilerplate code, DOM helpers and other elementary abstractions. One last detail that’s worth mentioning is the technique used to make the animations less obtrusive depending on the user’s preferences:

On macOS, when Reduce Motion is enabled in System Preferences, the new prefers-reduced-motion media query will be triggered (only in Safari for now), and all decorative animations on the page will be disabled. The cubes use both CSS animations to fade in and JavaScript animations to rotate. We can cancel these animations with a combination of a @media block and the MediaQueryList Interface:

/* CSS */
@media (prefers-reduced-motion) {
  #header-hero * {
    animation: none

// JavaScript
const reduceMotion = matchMedia("(prefers-reduced-motion)").matches;
const tick = () => {
  if (reduceMotion) return;

More CSS 3D!

We use custom 3D-rendered devices across the site to showcase Stripe customers and apps in situ. In our never-ending quest to reduce file sizes and loading time, we considered a few options to achieve a soft three-dimensional look and feel with lightweight and resolution-independent assets. Drawing the devices directly in CSS fulfilled our objectives. Here’s the CSS laptop:

Defining the object in CSS is obviously less convenient than exporting a bitmap, but it’s worth the effort. The laptop above weighs less than 1KB and is easy to tweak. We can add hardware-acceleration, animate any part, make it responsive without losing image quality, and precisely position DOM elements (e.g. other images) within the laptop’s display. This flexibility doesn’t mean giving up on clean code—the markup stays clear, concise and descriptive:

<div class="laptop">
  <span class="shadow"></span>
  <span class="lid"></span>
  <span class="camera"></span>
  <span class="screen"></span>
  <span class="chassis">
    <span class="keyboard"></span>
    <span class="trackpad"></span>

Styling the laptop involves a mix of gradients, shadows and transforms. In many ways, it’s a simple translation of the workflow and concepts you know and use in your graphic tools. For example, here’s the CSS code for the lid:

.laptop .lid {
  position: absolute;
  width: 100%;
  height: 100%;
  border-radius: 20px;
  background: linear-gradient(45deg, #E5EBF2, #F3F8FB);
  box-shadow: inset 1px -4px 6px rgba(145, 161, 181, .3)

Choosing the right tool for the job isn’t always obvious—between CSS, SVG, Canvas, WebGL and images the choice isn’t as clear as it used to be. It’s easy to dismiss CSS as something exclusively meant for presenting documents, but it’s just as easy to go overboard and abuse its visual capabilities. No matter the technology you choose, optimize for the user! This means paying close attention to client-side performance, accessibility needs, and fallback options for older browsers.

Web Animations API

The Onboarding & Verification section showcases a demo of Express, Connect’s new user onboarding flow. The whole animation is built in code and relies for the most part on the new Web Animations API.

The Web Animations API provides the performance and simplicity of CSS @keyframes in JavaScript, making it easy to create smooth, chainable animation sequences. As opposed to the requestAnimationFrame low-level API, you get all the niceties of CSS animations for free, such as native support for cubic-bezier easing functions. As an example, let’s take a look at the code for our keyboard sliding animation:

const toggleKeyboard = (element, callback, action) => {
  const keyframes = {
    transform: [100, 0].map(n => `translateY(${n}%)`)

  const options = {
    duration: 800,
    fill: "forwards",
    easing: "cubic-bezier(.2, 1, .2, 1)",
    direction: action == "hide" ? "reverse" : "normal"

  const animation = element.animate(keyframes, options);
  animation.addEventListener("finish", callback, {once: true});

Nice and simple! The Web Animations API covers the vast majority of typical UI animation needs without requiring a third-party dependency (as a result, the whole Express animation is about 5KB all included: scripts, images, etc.). That being said, it is not a downright replacement for requestAnimationFrame which still provides finer control over your animation and allows you to create effects otherwise impossible, such as spring curves and independent transform functions. If you’re not sure about the right technology to use for your animations, you can probably prioritize your options like this:

  1. CSS transitions. This is the fastest, easiest, and most efficient way to animate. For simple things like hover effects, this is the way to go.
  2. CSS animations. These have the same performance characteristics as CSS transitions: they’re declarative animations that can be highly optimized by the browsers and run on a separate thread. CSS animations are more powerful than transitions and allow for multiple steps and multiple iterations. They’re also more intricate to implement as they require named @keyframes declaration and often need an explicit animation-fill-mode. (And naming things is always one of the hardest things in computer science!)
  3. Web Animations API. This API offers almost the same performance as CSS animations (these animations are driven by the same engine, but JavaScript code will still run on the main thread) and nearly the same ease of use. This should be your default choice for any animation where you need interactivity, random effects, chainable sequences, and anything richer than a purely declarative animation.
  4. requestAnimationFrame. The sky is the limit, but you have to engineer the rocket ship. The possibilities are endless and the rendering methods unlimited (HTML, SVG, canvas—you name it), but it’s a lot more complicated to use and may not perform as well as the previous options.

No matter the technique you use, there are a few simple tips you can apply everywhere to make your animations look significantly better:

  • Custom curves. You almost never want to use a built-in timing-function like ease-in, ease-out and linear. A nice time-saver is to globally define a number of custom cubic-bezier variables.
  • Performance. Avoid jank in your animations at all costs. In CSS, this means exclusively animating cheap properties (transform and opacity) and offloading animations to the GPU when you can (using will-change).
  • Speed. Animations should never get in the way. The very goal of animations is to make a UI feel responsive, harmonious, enjoyable and polished. There’s no hard limit on the exact animation duration as it depends on the effect and the curve, but in most cases you’ll want to stay under 500 milliseconds.

Intersection Observer

The Express animation starts playing automatically as soon as it’s visible in the viewport (you can try it by scrolling the page). This is usually accomplished by observing scroll movements to trigger some callback, but historically this meant adding expensive event listeners, resulting in verbose and inefficient code.

Connect’s landing page uses the new Intersection Observer API which provides a much more robust and performant way to detect the visibility of an element. Here’s how we start playing the Express animation:

const observeScroll = (element, callback) => {
  const observer = new IntersectionObserver(([entry]) => {
    if (entry.intersectionRatio < 1) return;

    // Stop watching the element
    threshold: 1

  // Start watching the element

const element = document.getElementById("express-animation");
observeScroll(element, startAnimation);

The observeScroll helper simplifies our detection behavior (i.e. when an element is fully visible, the callback is triggered once) without executing anything on the main thread. Thanks to the Intersection Observer API, we’re now one step closer to buttery-smooth web pages!

Polyfills and fallbacks

All these new and shiny APIs are exciting, but they’re unfortunately not yet available everywhere. The common workaround is to use polyfills to feature-test for a particular API and execute only if the API is missing. The obvious downside to this approach is that it penalizes everyone, forever, by forcing them to download the polyfill regardless of whether it’s used. We decided on a different solution:

For JavaScript APIs, Connect’s landing page feature-tests whether a polyfill is necessary and can dynamically insert it in the page. Scripts that are dynamically created and added to the document are asynchronous by default, which means the order of execution isn’t guaranteed. That’s obviously a problem, as a given script may execute before an expected polyfill. Thankfully, we can fix that by explicitly marking our scripts as not asynchronous and therefore lazy-load only what’s required:

const insert = name => {
  const el = document.createElement("script");
  el.src = `${name}.js`;
  el.async = false; // Keep the execution order

const scripts = ["main"];

if (!Element.prototype.animate)

if (!("IntersectionObserver" in window))


For CSS, the problem and solution are pretty much the same as for JavaScript polyfills. The typical way to use modern CSS features is to write the fallback first and override it when possible:

div { display: flex }

@supports (display: grid) {
  div { display: grid }

CSS feature queries are easy, reliable, and they should likely be your default choice. However, they weren’t suited to our audience since close to 90% of our visitors already use a Grid-friendly browser (❤️). In our case, it didn’t make sense to penalize the overwhelming majority of our users with hundreds of fallback rules for a small and decreasing percentage of browsers. Given these statistics, we chose to dynamically create and insert a fallback stylesheet when needed:

// Some browsers not supporting Grid don’t support CSS.supports
// either, so we need to feature-test it the old-fashioned way:

if (!("grid" in {
  const fallback = "<link rel=stylesheet href=fallback.css>";
  document.head.insertAdjacentHTML("beforeend", fallback);

That’s a wrap!

We hope you enjoyed (and maybe even learned) some of these front-end tips! Modern browsers provide us with powerful tools to create rich, fast and engaging experiences, letting our creativity shine on the web. If you’re as excited as we are about the possibilities, we should probably experiment with them together.

Like this post? Join the Stripe design team. View Openings

June 19, 2017

Scaling your API with rate limiters

Paul Tarjan on March 30, 2017 in Engineering

Availability and reliability are paramount for all web applications and APIs. If you’re providing an API, chances are you’ve already experienced sudden increases in traffic that affect the quality of your service, potentially even leading to a service outage for all your users.

The first few times this happens, it’s reasonable to just add more capacity to your infrastructure to accommodate user growth. However, when you’re running a production API, not only do you have to make it robust with techniques like idempotency, you also need to build for scale and ensure that one bad actor can’t accidentally or deliberately affect its availability.

Rate limiting can help make your API more reliable in the following scenarios:

  • One of your users is responsible for a spike in traffic, and you need to stay up for everyone else.
  • One of your users has a misbehaving script which is accidentally sending you a lot of requests. Or, even worse, one of your users is intentionally trying to overwhelm your servers.
  • A user is sending you a lot of lower-priority requests, and you want to make sure that it doesn’t affect your high-priority traffic. For example, users sending a high volume of requests for analytics data could affect critical transactions for other users.
  • Something in your system has gone wrong internally, and as a result you can’t serve all of your regular traffic and need to drop low-priority requests.

At Stripe, we’ve found that carefully implementing a few rate limiting strategies helps keep the API available for everyone. In this post, we’ll explain in detail which rate limiting strategies we find the most useful, how we prioritize some API requests over others, and how we started using rate limiters safely without affecting our existing users’ workflows.

Rate limiters and load shedders

A rate limiter is used to control the rate of traffic sent or received on the network. When should you use a rate limiter? If your users can afford to change the pace at which they hit your API endpoints without affecting the outcome of their requests, then a rate limiter is appropriate. If spacing out their requests is not an option (typically for real-time events), then you’ll need another strategy outside the scope of this post (most of the time you just need more infrastructure capacity).

Our users can make a lot of requests: for example, batch processing payments causes sustained traffic on our API. We find that clients can always (barring some extremely rare cases) spread out their requests a bit more and not be affected by our rate limits.

Rate limiters are amazing for day-to-day operations, but during incidents (for example, if a service is operating more slowly than usual), we sometimes need to drop low-priority requests to make sure that more critical requests get through. This is called load shedding. It happens infrequently, but it is an important part of keeping Stripe available.

A load shedder makes its decisions based on the whole state of the system rather than the user who is making the request. Load shedders help you deal with emergencies, since they keep the core part of your business working while the rest is on fire.

Using different kinds of rate limiters in concert

Once you know rate limiters can improve the reliability of your API, you should decide which types are the most relevant.

At Stripe, we operate 4 different types of limiters in production. The first one, the Request Rate Limiter, is by far the most important one. We recommend you start here if you want to improve the robustness of your API.

Request rate limiter

This rate limiter restricts each user to N requests per second. Request rate limiters are the first tool most APIs can use to effectively manage a high volume of traffic.

Our rate limits for requests is constantly triggered. It has rejected millions of requests this month alone, especially for test mode requests where a user inadvertently runs a script that’s gotten out of hand.

Our API provides the same rate limiting behavior in both test and live modes. This makes for a good developer experience: scripts won't encounter side effects due to a particular rate limit when moving from development to production.

After analyzing our traffic patterns, we added the ability to briefly burst above the cap for sudden spikes in usage during real-time events (e.g. a flash sale.)

Request rate limiters restrict users to a maximum number of requests per second.

Concurrent requests limiter

Instead of “You can use our API 1000 times a second”, this rate limiter says “You can only have 20 API requests in progress at the same time”. Some endpoints are much more resource-intensive than others, and users often get frustrated waiting for the endpoint to return and then retry. These retries add more demand to the already overloaded resource, slowing things down even more. The concurrent rate limiter helps address this nicely.

Our concurrent request limiter is triggered much less often (12,000 requests this month), and helps us keep control of our CPU-intensive API endpoints. Before we started using a concurrent requests limiter, we regularly dealt with resource contention on our most expensive endpoints caused by users making too many requests at one time. The concurrent request limiter totally solved this.

It is completely reasonable to tune this limiter up so it rejects more often than the Request Rate Limiter. It asks your users to use a different programming model of “Fork off X jobs and have them process the queue” compared to “Hammer the API and back off when I get a HTTP 429”. Some APIs fit better into one of those two patterns so feel free to use which one is most suitable for the users of your API.

Concurrent request limiters manage resource contention for CPU-intensive API endpoints.

Fleet usage load shedder

Using this type of load shedder ensures that a certain percentage of your fleet will always be available for your most important API requests.

We divide up our traffic into two types: critical API methods (e.g. creating charges) and non-critical methods (e.g. listing charges.) We have a Redis cluster that counts how many requests we currently have of each type.

We always reserve a fraction of our infrastructure for critical requests. If our reservation number is 20%, then any non-critical request over their 80% allocation would be rejected with status code 503.

We triggered this load shedder for a very small fraction of requests this month. By itself, this isn’t a big deal—we definitely had the ability to handle those extra requests. But we’ve had other months where this has prevented outages.

Fleet usage load shedders reserves fleet resources for critical requests.

Worker utilization load shedder

Most API services use a set of workers to independently respond to incoming requests in a parallel fashion. This load shedder is the final line of defense. If your workers start getting backed up with requests, then this will shed lower-priority traffic.

This one gets triggered very rarely, only during major incidents.

We divide our traffic into 4 categories:

  1. Critical methods
  2. POSTs
  3. GETs
  4. Test mode traffic

We track the number of workers with available capacity at all times. If a box is too busy to handle its request volume, it will slowly start shedding less-critical requests, starting with test mode traffic. If shedding test mode traffic gets it back into a good state, great! We can start to slowly bring traffic back. Otherwise, it’ll escalate and start shedding even more traffic.

It’s very important that shedding and bringing load happen slowly, or you can end up flapping (“I got rid of testmode traffic! Everything is fine! I brought it back! Everything is awful!”). We used a lot of trial and error to tune the rate at which we shed traffic, and settled on a rate where we shed a substantial amount of traffic within a few minutes.

Only 100 requests were rejected this month from this rate limiter, but in the past it’s done a lot to help us recover more quickly when we have had load problems. This load shedder limits the impact of incidents that are already happening and provides damage control, while the first three are more preventative.

Worker utilization load shedders reserve workers for critical requests.

Building rate limiters in practice

Now that we’ve outlined the four basic kinds of rate limiters we use and what they’re for, let’s talk about their implementation. What rate limiting algorithms are there? How do you actually implement them in practice?

We use the token bucket algorithm to do rate limiting. This algorithm has a centralized bucket host where you take tokens on each request, and slowly drip more tokens into the bucket. If the bucket is empty, reject the request. In our case, every Stripe user has a bucket, and every time they make a request we remove a token from that bucket.

We implement our rate limiters using Redis. You can either operate the Redis instance yourself, or, if you use Amazon Web Services, you can use a managed service like ElastiCache.

Here are important things to consider when implementing rate limiters:

  • Hook the rate limiters into your middleware stack safely. Make sure that if there were bugs in the rate limiting code (or if Redis were to go down), requests wouldn’t be affected. This means catching exceptions at all levels so that any coding or operational errors would fail open and the API would still stay functional.
  • Show clear exceptions to your users. Figure out what kinds of exceptions to show your users. In practice, you should decide if you want HTTP 429 (Too Many Requests) or HTTP 503 (Service Unavailable) and what is the most accurate depending on the situation. The message you return should also be actionable.
  • Build in safeguards so that you can turn off the limiters. Make sure you have kill switches to disable the rate limiters should they kick in erroneously. Having feature flags in place can really help should you need a human escape valve. Set up alerts and metrics to understand how often they are triggering.
  • Dark launch each rate limiter to watch the traffic they would block. Evaluate if it is the correct decision to block that traffic and tune accordingly. You want to find the right thresholds that would keep your API up without affecting any of your users’ existing request patterns. This might involve working with some of them to change their code so that the new rate limit would work for them.


Rate limiting is one of the most powerful ways to prepare your API for scale. The different rate limiting strategies described in this post are not all necessary on day one, you can gradually introduce them once you realize the need for rate limiting.

Our recommendation is to follow the following steps to introduce rate limiting to your infrastructure:

  1. Start by building a Request Rate Limiter. It is the most important one to prevent abuse, and it’s by far the one that we use the most frequently.
  2. Introduce the next three types of rate limiters over time to prevent different classes of problems. They can be built slowly as you scale.
  3. Follow good launch practices as you're adding new rate limiters to your infrastructure. Handle any errors safely, put them behind feature flags to turn them off easily at any time, and rely on very good observability and metrics to see how often they’re triggering.
To help you get started, we’ve created a GitHub gist to share implementation details based on the code we actually use in production at Stripe.

Like this post? Join the Stripe engineering team. View Openings

March 30, 2017

Designing robust and predictable APIs with idempotency

Brandur Leach on February 22, 2017 in Engineering

Networks are unreliable. We’ve all experienced trouble connecting to Wi-Fi, or had a phone call drop on us abruptly.

The networks connecting our servers are, on average, more reliable than consumer-level last miles like cellular or home ISPs, but given enough information moving across the wire, they’re still going to fail in exotic ways. Outages, routing problems, and other intermittent failures may be statistically unusual on the whole, but still bound to be happening all the time at some ambient background rate.

To overcome this sort of inherently unreliable environment, it’s important to design APIs and clients that will be robust in the event of failure, and will predictably bring a complex integration to a consistent state despite them. Let’s take a look at a few ways to do that.

Planning for failure

Consider a call between any two nodes. There are a variety of failures that can occur:

  • The initial connection could fail as the client tries to connect to a server.
  • The call could fail midway while the server is fulfilling the operation, leaving the work in limbo.
  • The call could succeed, but the connection break before the server can tell its client about it.

Any one of these leaves the client that made the request in an uncertain situation. In some cases, the failure is definitive enough that the client knows with good certainty that it’s safe to simply retry it. For example, a total failure to even establish a connection to the server. In many others though, the success of the operation is ambiguous from the perspective of the client, and it doesn’t know whether retrying the operation is safe. A connection terminating midway through message exchange is an example of this case.

This problem is a classic staple of distributed systems, and the definition is broad when talking about a “distributed system” in this sense: as few as two computers connecting via a network that are passing each other messages. Even the Stripe API and just one other server that’s making requests to it comprise a distributed system.

Making liberal use of idempotency

The easiest way to address inconsistencies in distributed state caused by failures is to implement server endpoints so that they’re idempotent, which means that they can be called any number of times while guaranteeing that side effects only occur once.

When a client sees any kind of error, it can ensure the convergence of its own state with the server’s by retrying, and can continue to retry until it verifiably succeeds. This fully addresses the problem of an ambiguous failure because the client knows that it can safely handle any failure using one simple technique.

As an example, consider the API call for a hypothetical DNS provider that enables us to add subdomains via an HTTP request:

curl \
   -X PUT \
   -d type=CNAME \
   -d value="" \
   -d ttl=3600

All the information needed to create a record is included in the call, and it’s perfectly safe for a client to invoke it any number of times. If the server receives a call that it realizes is a duplicate because the domain already exists, it simply ignores the request and responds with a successful status code.

According to HTTP semantics, the PUT and DELETE verbs are idempotent, and the PUT verb in particular signifies that a target resource should be created or replaced entirely with the contents of a request’s payload (in modern RESTful parlance, a modification would be represented by a PATCH).

Guaranteeing “exactly once” semantics

While the inherently idempotent HTTP semantics around PUT and DELETE are a good fit for many API calls, what if we have an operation that needs to be invoked exactly once and no more? An example might be if we were designing an API endpoint to charge a customer money; accidentally calling it twice would lead to the customer being double-charged, which is very bad.

This is where idempotency keys come into play. When performing a request, a client generates a unique ID to identify just that operation and sends it up to the server along with the normal payload. The server receives the ID and correlates it with the state of the request on its end. If the client notices a failure, it retries the request with the same ID, and from there it’s up to the server to figure out what to do with it.

If we consider our sample network failure cases from above:

  • On retrying a connection failure, on the second request the server will see the ID for the first time, and process it normally.
  • On a failure midway through an operation, the server picks up the work and carries it through. The exact behavior is heavily dependent on implementation, but if the previous operation was successfully rolled back by way of an ACID database, it’ll be safe to retry it wholesale. Otherwise, state is recovered and the call is continued.
  • On a response failure (i.e. the operation executed successfully, but the client couldn’t get the result), the server simply replies with a cached result of the successful operation.

The Stripe API implements idempotency keys on mutating endpoints (i.e. anything under POST in our case) by allowing clients to pass a unique value in with the special Idempotency-Key header, which allows a client to guarantee the safety of distributed operations:

curl \
   -u sk_test_BQokikJOvBiI2HlWgH4olfQ2: \
   -H "Idempotency-Key: AGJ6FJMkGQIpHUTX" \
   -d amount=2000 \
   -d currency=usd \
   -d description="Charge for Brandur" \
   -d customer=cus_A8Z5MHwQS7jUmZ

If the above Stripe request fails due to a network connection error, you can safely retry it with the same idempotency key, and the customer is charged only once.

Being a good distributed citizen

Safely handling failure is hugely important, but beyond that, it’s also recommended that it be handled in a considerate way. When a client sees that a network operation has failed, there’s a good chance that it’s due to an intermittent failure that will be gone by the next retry. However, there’s also a chance that it’s a more serious problem that’s going to be more tenacious; for example, if the server is in the middle of an incident that’s causing hard downtime. Not only will retries of the operation not go through, but they may contribute to further degradation.

It’s usually recommended that clients follow something akin to an exponential backoff algorithm as they see errors. The client blocks for a brief initial wait time on the first failure, but as the operation continues to fail, it waits proportionally to 2n, where n is the number of failures that have occurred. By backing off exponentially, we can ensure that clients aren’t hammering on a downed server and contributing to the problem.

Exponential backoff has a long and interesting history in computer networking.

Furthermore, it’s also a good idea to mix in an element of randomness. If a problem with a server causes a large number of clients to fail at close to the same time, then even with back off, their retry schedules could be aligned closely enough that the retries will hammer the troubled server. This is known as the thundering herd problem.

We can address thundering herd by adding some amount of random “jitter” to each client’s wait time. This will space out requests across all clients, and give the server some breathing room to recover.

Thundering herd problem when a server faces simultaneous retries from all clients.

The Stripe Ruby library retries on failure automatically with an idempotency key using increasing backoff times and jitter. The implementation for that is pretty simple, and you can refer to it on GitHub to see exactly how it works.

Codifying the design of robust APIs

Considering the possibility of failure in a distributed system and how to handle it is of paramount importance in building APIs that are both robust and predictable. Retry logic on clients and idempotency on servers are techniques that are useful in achieving this goal and relatively simple to implement in any technology stack.

Here are a few core principles to follow while designing your clients and APIs:

  1. Make sure that failures are handled consistently. Have clients retry operations against remote services. Not doing so could leave data in an inconsistent state that will lead to problems down the road.
  2. Make sure that failures are handled safely. Use idempotency and idempotency keys to allow clients to pass a unique value and retry requests as needed.
  3. Make sure that failures are handled responsibly. Use techniques like exponential backoff and random jitter. Be considerate of servers that may be stuck in a degraded state.

Like this post? Join the Stripe engineering team. View Openings

February 22, 2017

Online migrations at scale

Jacqueline Xu on February 2, 2017 in Engineering

Engineering teams face a common challenge when building software: they eventually need to redesign the data models they use to support clean abstractions and more complex features. In production environments, this might mean migrating millions of active objects and refactoring thousands of lines of code.

Stripe users expect availability and consistency from our API. This means that when we do migrations, we need to be extra careful: objects stored in our systems need to have accurate values, and Stripe’s services need to remain available at all times.

In this post, we’ll explain how we safely did one large migration of our hundreds of millions of Subscriptions objects.

Why are migrations hard?

  • Scale

    Stripe has hundreds of millions of Subscriptions objects. Running a large migration that touches all of those objects is a lot of work for our production database.

    Imagine that it takes one second to migrate each subscription object: in sequential fashion, it would take over three years to migrate one hundred million objects.

  • Uptime

    Businesses are constantly transacting on Stripe. We perform all infrastructure upgrades online, rather than relying on planned maintenance windows. Because we couldn’t simply pause the Subscriptions service during migrations, we had to execute the transition with all of our services operating at 100%.

  • Accuracy

    Our Subscriptions table is used in many different places in our codebase. If we tried to change thousands of lines of code across the Subscriptions service at once, we would almost certainly overlook some edge cases. We needed to be sure that every service could continue to rely on accurate data.

A pattern for online migrations

Moving millions of objects from one database table to another is difficult, but it’s something that many companies need to do.

There’s a common 4 step dual writing pattern that people often use to do large online migrations like this. Here’s how it works:

  1. Dual writing to the existing and new tables to keep them in sync.
  2. Changing all read paths in our codebase to read from the new table.
  3. Changing all write paths in our codebase to only write to the new table.
  4. Removing old data that relies on the outdated data model.

Our example migration: Subscriptions

What are Subscriptions and why did we need to do a migration?

Stripe Billing helps users like DigitalOcean and Squarespace build and manage recurring billing for their customers. Over the past few years, we’ve steadily added features to support their more complex billing models, such as multiple subscriptions, trials, coupons, and invoices.

In the early days, each Customer object had, at most, one subscription. Our customers were stored as individual records. Since the mapping of customers to subscriptions was straightforward, subscriptions were stored alongside customers.

class Customer
  Subscription subscription

Eventually, we realized that some users wanted to create customers with multiple subscriptions. We decided to transform the subscription field (for a single subscription) to a subscriptions field—allowing us to store an array of multiple active subscriptions.

class Customer
  array: Subscription subscriptions

As we added new features, this data model became problematic. Any changes to a customer’s subscriptions meant updating the entire Customer record, and subscriptions-related queries scanning through customer objects. So we decided to store active subscriptions separately.

Our redesigned data model moves subscriptions into their own table.

As a reminder, our four migration phases were:

  1. Dual writing to the existing and new tables to keep them in sync.
  2. Changing all read paths in our codebase to read from the new table.
  3. Changing all write paths in our codebase to only write to the new table.
  4. Removing old data that relies on the outdated data model.

Let’s walk through these four phases looked like for us in practice.

Part 1: Dual writing

We begin the migration by creating a new database table. The first step is to start duplicating new information so that it’s written to both stores. We’ll then backfill missing data to the new store so the two stores hold identical information.

All new writes should update both stores.

In our case, we record all newly-created subscriptions into both the Customers table and the Subscriptions table. Before we begin dual writing to both tables, it’s worth considering the potential performance impact of this additional write on our production database. We can mitigate performance concerns by slowly ramping up the percentage of objects that get duplicated, while keeping a careful eye on operational metrics.

At this point, newly created objects exist in both tables, while older objects are only found in the old table. We’ll start copying over existing subscriptions in a lazy fashion: whenever objects are updated, they will automatically be copied over to the new table. This approach lets us begin to incrementally transfer our existing subscriptions.

Finally, we’ll backfill any remaining Customer subscriptions into the new Subscriptions table.

We need to backfill existing subscriptions to the new Subscriptions table.

The most expensive part of backfilling the new table on the live database is simply finding all the objects that need migration. Finding all the objects by querying the database would require many queries to the production database, which would take a lot of time. Luckily, we were able to offload this to an offline process that had no impact on our production databases. We make snapshots of our databases available to our Hadoop cluster, which lets us use MapReduce to quickly process our data in a offline, distributed fashion.

We use Scalding to manage our MapReduce jobs. Scalding is a useful library written in Scala that makes it easy to write MapReduce jobs (you can write a simple one in 10 lines of code). In this case, we’ll use Scalding to help us identify all subscriptions. We’ll follow these steps:

  • Write a Scalding job that provides a list of all subscription IDs that need to be copied over.
  • Run a large, multi-threaded migration to duplicate these subscriptions with a fleet of processes efficiently operating on our data in parallel.
  • Once the migration is complete, run the Scalding job once again to make sure there are no existing subscriptions missing from the Subscriptions table.

Part 2: Changing all read paths

Now that the old and new data stores are in sync, the next step is to begin using the new data store to read all our data.

For now, all reads use the existing Customers table: we need to move to the Subscriptions table.

We need to be sure that it’s safe to read from the new Subscriptions table: our subscription data needs to be consistent. We’ll use GitHub’s Scientist to help us verify our read paths. Scientist is a Ruby library that allows you to run experiments and compare the results of two different code paths, alerting you if two expressions ever yield different results in production. With Scientist, we can generate alerts and metrics for differing results in real time. When an experimental code path generates an error, the rest of our application won’t be affected.

We’ll run the following experiment:

  • Use Scientist to read from both the Subscriptions table and the Customers table.
  • If the results don’t match, raise an error alerting our engineers to the inconsistency.

GitHub’s Scientist lets us run experiments that read from both tables and compare the results.

After we verified that everything matched up, we started reading from the new table.

Our experiments are successful: all reads now use the new Subscriptions table.

Part 3: Changing all write paths

Next, we need to update write paths to use our new Subscriptions store. Our goal is to incrementally roll out these changes, so we’ll need to employ careful tactics.

Up until now, we’ve been writing data to the old store and then copying them to the new store:

We now want to reverse the order: write data to the new store and then archive it in the old store. By keeping these two stores consistent with each other, we can make incremental updates and observe each change carefully.

Refactoring all code paths where we mutate subscriptions is arguably the most challenging part of the migration. Stripe’s logic for handling subscriptions operations (e.g. updates, prorations, renewals) spans thousands of lines of code across multiple services.

The key to a successful refactor will be our incremental process: we’ll isolate as many code paths into the smallest unit possible so we can apply each change carefully. Our two tables need to stay consistent with each other at every step.

For each code path, we’ll need to use a holistic approach to ensure our changes are safe. We can’t just substitute new records with old records: every piece of logic needs to be considered carefully. If we miss any cases, we might end up with data inconsistency. Thankfully, we can run more Scientist experiments to alert us to any potential inconsistencies along the way.

Our new, simplified write path looks like this:

We can make sure that no code blocks continue using the outdated subscriptions array by raising an error if the property is called:

class Customer
  def subscriptions
    Opus::Error.hard("Accessing subscriptions array on customer")

Part 4: Removing old data

Our final (and most satisfying) step is to remove code that writes to the old store and eventually delete it.

Once we’ve determined that no more code relies on the subscriptions field of the outdated data model, we no longer need to write to the old table:

With this change, our code no longer uses the old store, and the new table now becomes our source of truth.

We can now remove the subscriptions array on all of our Customer objects, and we’ll incrementally process deletions in a lazy fashion. We first automatically empty the array every time a subscription is loaded, and then run a final Scalding job and migration to find any remaining objects for deletion. We end up with the desired data model:


Running migrations while keeping the Stripe API consistent is complicated. Here’s what helped us run this migration safely:

  • We laid out a four phase migration strategy that would allow us to transition data stores while operating our services in production without any downtime.
  • We processed data offline with Hadoop, allowing us to manage high data volumes in a parallelized fashion with MapReduce, rather than relying on expensive queries on production databases.
  • All the changes we made were incremental. We never attempted to change more than a few hundred lines of code at one time.
  • All our changes were highly transparent and observable. Scientist experiments alerted us as soon as a single piece of data was inconsistent in production. At each step of the way, we gained confidence in our safe migration.

We’ve found this approach effective in the many online migrations we’ve executed at Stripe. We hope these practices prove useful for other teams performing migrations at scale.

Like this post? Join the Stripe engineering team. View Openings

February 2, 2017

Reproducible research: Stripe’s approach to data science

Dan Frank on November 22, 2016 in Engineering

When people talk about their data infrastructure, they tend to focus on the technologies: Hadoop, Scalding, Impala, and the like. However, we’ve found that just as important as the technologies themselves are the principles that guide their use. We’d like to share our experience with one such principle that we’ve found particularly useful: reproducibility.

We’ll talk about our motivation for focusing on reproducibility, how we’re using Jupyter Notebooks as our core tool, and the workflow we’ve developed around Jupyter to operationalize our approach.

Jupyter notebooks are a fantastic way to create reproducible data science research.


Data tools are most often used to generate some kind of exploratory analysis report. At Stripe, an example is an investigation of the probability that a card gets declined, given the time since its last charge. The investigator writes a query, which is executed by a query engine like Redshift, and then runs some further code to interpret and visualize the results.

The most common way to share results from these sorts of studies is to compose an email and attach some graphs. But this means that viewers of the report don’t know how the query was constructed and analyzed. As a result, they are unable to review the work in depth, or to extend it themselves. It’s very easy to commit methodological errors when asking questions of data; an unintended bias here, or a missed corner case there, can lead to entirely incorrect conclusions.

In academia, the peer review system helps catch these errors. Many in the scientific community have championed the practice of open science, where data and code are released along with experimental results, such that reviewers can independently recreate the original results. Taking inspiration from this movement, we sought to make data reports within Stripe transparent and reproducible, so that anyone at the company can look at a report and understand how it was generated. Just like an always-green test suite forces developers to write better code, we wanted to see if requiring all analyses be reproducible would force us to produce better reports.


Our implementation of reproducible analysis centers on Jupyter Notebook, a web-based frontend to the Jupyter interactive computing environment which provides an interface similar to Mathematica or Matlab.

An example of a bar chart output from a Jupyter Notebook

Jupyter Notebook also comes with built-in functionality to convert a notebook into a publishable HTML document. You can see a sample of one of our published notebooks, studying the relationship between Benford’s Law and the amounts of each charge made on Stripe.

Now, let’s say that Alice wants to share a notebook with Bob. The state of the interactive environment can be persisted as a JSON file containing both the code input to the notebook and data output from it. To share the notebook, Alice would typically send this notebook file directly to Bob. Now, when Bob opens it, he’ll see the same outputs as Alice, but may not be able to do much with them. These outputs include computational results and plots’ image data, but not the values of any of the variables that Alice was working with. To inspect these variables and extend Alice’s work, he’ll have to recompute them from the code inputs. However, there may have been certain cells that only run correctly on the Alice’s computer, or some cells might have been rearranged in a way that unintentionally broke the flow of computation. It’s easy to miss mistakes like these when you’re able to share a notebook with the results embedded, so we decided to try something different.

In our workflow, developers and data scientists work on a notebook locally and check this source file into Git. To publish their work, they use our common deployment framework, which executes the notebook code once it hits our servers. The results are translated into HTML, which are served statically. Importantly, we strip results from the notebook files in a pre-commit hook, meaning that only code is checked into our repositories. This ensures that the results are fully reproduced from scratch when the notebook is published. Thus, it’s a requirement that all notebooks be programmatically executable from back to front, without needing any manual steps to run. If you were on a Stripe computer, you could run the notebook above with one click and obtain the same results. This is a huge deal!

To make this workflow possible, we had to write some additional tooling to enable the same code to run on developers’ laptops and production servers. The bulk of this work involved access to our query engines, which is perhaps the most common obstacle to collaboration on data analysis projects. Even very well-organized workflows often require a data file to be present at a particular path, or some out of band authentication step with the machines running the queries. The key to overcoming these challenges was to create a common entry point in code to access these query engines from developers’ laptops, as well as our servers. This way, a notebook that runs on one developer’s computer will always run correctly on everyone else’s.

Adding this tooling also greatly improved the experience of doing exploratory data analysis within the notebook. Prior to our reproducibility tooling, setting up data access was tedious, time-consuming, and error-prone. Automating and standardizing this process allowed data scientists and developers to focus on their analysis instead.


Reproducibility makes data science at Stripe feel like working on GitHub, where anyone can obtain and extend others’ work. Instead of islands of analysis, we share our research in a central repository of knowledge. This makes it dramatically easier for anyone on our team to work with our data science research, encouraging independent exploration.

We approach our analyses with the same rigor we apply to production code: our reports feel more like finished products, research is fleshed out and easy to understand, and there are clear programmatic steps from start to finish for every analysis.

We’ve switched over to reproducible reports, and we’re not looking back. Delivering them requires more up-front work, but we’ve found it to be a good long-term investment. If you give it a try, we think you’ll feel the same way!

Like this post? Join the Stripe engineering team. View Openings

November 22, 2016

Service discovery at Stripe

Julia Evans on October 31, 2016 in Engineering

With so many new technologies coming out every year (like Kubernetes or Habitat), it’s easy to become so entangled in our excitement about the future that we forget to pay homage to the tools that have been quietly supporting our production environments. One such tool we've been using at Stripe for several years now is Consul. Consul helps discover services (that is, it helps us navigate the thousands of servers we run with various services running on them and tells us which ones are up and available for use). This effective and practical architectural choice wasn't flashy or entirely novel, but has served us dutifully in our continued mission to provide reliable service to our users around the world.

We’re going to talk about:

  • What service discovery and Consul are,
  • How we managed the risks of deploying a critical piece of software,
  • The challenges we ran into along the way and what we did about them.

You don’t just set up new software and expect it to magically work and solve all your problems—using new software is a process. This is an example of what the process of using new software in production has looked like for us.

What’s service discovery?

Great question! Suppose you’re a load balancer for Stripe, and a request to create a charge has come in. You want to send it to an API server. Any API server!

We run thousands of servers with various services running on them. Which ones are the API servers? What port is the API running on? One amazing thing about using AWS is that our instances can go down at any time, so we need to be prepared to:

  • Lose API servers at any time,
  • Add extra servers to the rotation if we need additional capacity.

This problem of tracking changes around which boxes are available is called service discovery. We use a tool called Consul from HashiCorp to do service discovery.

The fact that our instances can go down at any time is actually very helpful—our infrastructure gets regular practice losing instances and dealing with it automatically, so when it happens it’s just business as usual. It’s easier to handle failure gracefully when failure happens often.

Introduction to Consul

Consul is a service discovery tool: it lets services register themselves and to discover other services. It stores which services are up in a database, has client software that puts information in that database, and other client software that reads from that database. There are a lot of pieces to wrap your head around!

The most important component of Consul is the database. This database contains entries like “api-service is running at IP at port 12345. It is up.”

Individual boxes publish information to Consul saying “Hi! I am running api-service on port 12345! I am up!”.

Then if you want to talk to the API service, you can ask Consul “Which api-services are up?”. It will give you back a list of IP addresses and ports you can talk to.

Consul is a distributed system itself (remember: we can lose any box at any time, which means we could lose the Consul server itself!) so it uses a consensus algorithm called Raft to keep its database in sync.

If you’re interested in consensus in Consul, read more here.

The beginning of Consul at Stripe

We started out by only writing to Consul—having machines report whether or not they were up to the Consul server, but not using that information to do service discovery. We wrote some Puppet configuration to set it up, which wasn’t that hard!

This way we could uncover potential issues with running the Consul client and get experience operating it on thousands of machines. At first, no services were being discovered with Consul.

What could go wrong?

Addressing memory leaks

If you add a new piece of software to every box in your infrastructure, that software could definitely go wrong! Early on we ran into memory leaks in Consul’s stats library: we noticed that one box was taking over 100MB of RAM and climbing. This was a bug in Consul, which got fixed.

100MB of memory is not a big leak, but the leak was growing quickly. Memory leaks in general are worrisome because they're one of the easiest ways for one process on a box to Totally Ruin Things for other processes on the box.

Good thing we decided not to use Consul to discover services to start! Letting it sit on a bunch of production machines and monitoring memory usage let us find out about a potentially serious problem quickly with no negative impact.

Starting to discover services with Consul

Once we were more confident that running Consul in our infrastructure would work, we started adding a few clients to talk to Consul! We made this less risky in 2 ways:

  • Only use Consul in a few places to start,
  • Keep a fallback system in place so that we could function during outages.

Here are some of the issues we ran into. We’re not listing these to complain about Consul, but rather to emphasize that when using new technology, it’s important to roll it out slowly and be cautious.

A ton of Raft failovers. Remember that Consul uses a consensus protocol? This copies all the data on one server in the Consul cluster to other servers in that cluster. The primary server was having a ton of problems with disk I/O—the disks weren’t fast enough to do the reads that Consul wanted to do, and the whole primary server would hang. Then Raft would say “oh, the primary is down!” and elect a new primary, and the cycle would repeat. While Consul was busy electing a new primary, it would not let anybody write anything or read anything from its database (because consistent reads are the default).

Version 0.3 broke SSL completely. We were using Consul’s SSL feature (technically, TLS) for our Consul nodes to communicate securely. One Consul release just broke it. We patched it. This is an example of a kind of issue that isn’t that difficult to detect or scary (we tested in QA, realized SSL was broken, and just didn’t roll out the release), but is pretty common when using early-stage software.

Goroutine leaks. We started using Consul’s leader election. and there was a goroutine leak that caused Consul to quickly eat all the memory on the box. The Consul team was really helpful in debugging this and we fixed a bunch of memory leaks (different memory leaks from before).

Once all of these were fixed, we were in a much better place. Getting from “our first Consul client” to “we’ve fixed all these issues in production” took a bit less than a year of background work cycles.

Scaling Consul to discover which services are up

So, we’d learned about a bunch of bugs in Consul, and had them fixed, and everything was operating much better. Remember that step we talked about at the beginning, though? Where you ask Consul “hey, what boxes are up for api-service?” We were having intermittent problems where the Consul server would respond slowly or not at all.

This was mostly during raft failovers or instability; because Consul uses a strongly-consistent store its availability will always be weaker than something that doesn't. It was especially rough in the early days.

We still had fallbacks, but Consul outages became pretty painful for us. We would fall back to a hardcoded set of DNS names (like “apibox1”) when Consul was down. This worked okay when we first rolled out Consul, but as we scaled and used Consul more widely, it became less and less viable.

Consul Template to the rescue

Asking Consul which services were up (via its HTTP API) was unreliable. But we were happy with it otherwise!

We wanted to get information out of Consul about which services were up without using its API. How?

Well, Consul would take a name (like monkey-srv) and translate it into one or several IP addresses (“this is where monkey-srv lives”). Know what else takes in names and outputs IP address? A DNS server! So we replaced Consul with a DNS server. Here’s how: Consul Template is a Go program that generates static configuration files from your Consul database.

We started using Consul Template to generate DNS records for Consul services. So if monkey-srv was running on IP, we’d generate a DNS record:

monkey-srv.service.consul IN A

Here’s what that looks like in code. You can also find our real Consul Template configuration which is a little more complicated.

{{range service $service.Name}}
{{$service.Name}}.service.consul. IN A {{.Address}}

If you're thinking "wait, DNS records only have an IP address, you also need to know which port the server is running on," you are right! DNS A records (the kind you normally see) only have an IP address in them. However, DNS SRV records can have ports in them, and we also use Consul Template to generate SRV records.

We run Consul Template in a cron job every 60 seconds. Consul Template also has a “watch” mode (the default) which continuously updates configuration files when its database is updated. When we tried the watch mode, it DOSed our Consul server, so we stopped using it.

So if our Consul server goes down, our internal DNS server still has all the records! They might be a little old, but that’s fine. What’s awesome about our DNS server is that it’s not a fancy distributed system, which means it’s a much simpler piece of software, and much less likely to spontaneously break. This means that I can just look up monkey-srv.service.consul get an IP, and use it to talk to my service!

Because DNS is a shared-nothing eventually consistent system we can replicate and cache it a bunch (we have 5 canonical DNS servers and every server has a local DNS cache and knows how to talk to any of the 5 canonical servers) so it's fundamentally more resilient than Consul.

Adding a load balancer for faster healthchecks

We just said that we update DNS records from Consul every 60 seconds. So, what happens if an API server explodes? Do we keep sending requests to that IP for 45 more seconds until the DNS server gets updated? We do not! There’s one more piece of the story: HAProxy.

HAProxy is a load balancer. If you give a healthcheck for the service it’s sending requests to, it can make sure that your backends are up! All of our API requests actually go through HAProxy. Here’s how it works:

  • Every 60 seconds, Consul Template writes an HAProxy configuration file.
  • This means that HAProxy always has an approximately correct set of backends.
  • If a machine goes down, HAProxy realizes quickly that something has gone wrong (since it runs healthchecks every 2 seconds).

This means we restart HAProxy every 60 seconds. But does that mean we drop connections when we restart HAProxy? No. To avoid dropping connections between restarts, we use HAProxy’s graceful restarts feature. It’s still possible to drop some traffic with this restart policy, as described here, but we don’t process enough traffic that it’s an issue.

We have a standard healthcheck endpoint for our services—almost every service has a /healthcheck endpoint that returns 200 if it’s up and and errors if not. Having a standard is important because it means we can easily configure HAProxy to check service health.

When Consul is down, HAProxy will just have a stale configuration file, which will keep working.

Trading consistency for availability

If you’ve been paying close attention, you’ll notice that the system we started with (a strongly consistent database which was guaranteed to have the latest state) was very different from the the system we finished with (a DNS server which could be up to a minute behind). Giving up our requirement for consistency let us have a much more available system—Consul outages have basically no effect on our ability to discover services.

An important lesson from this is that consistency doesn’t come for free! You have to be willing to pay the price in availability, and so if you’re going to be using a strongly consistent system it’s important to make sure that’s something you actually need.

What happens when you make a request

We covered a lot in this post, so let’s go through the request flow now that we’ve learned how it all works.

When you make a request for, what happens? How does it end up at the right server? Here’s a simplified explanation:

  1. It comes into one of our public load balancers, running HAProxy,
  2. Consul Template has populated a list of servers serving in the /etc/haproxy.conf configuration file,
  3. HAProxy reloads this configuration file every 60 seconds,
  4. HAProxy sends your request on to a server! It makes sure that the server is up.

It’s actually a little more complicated than that (there’s actually an extra layer, and Stripe API requests are even more complicated, because we have systems to deal with PCI compliance), but all the core ideas are there.

This means that when we bring up or take down servers, Consul takes care of removing them from the HAProxy rotation automatically. There’s no manual work to do.

More than a year of peace

There are a lot of areas we’re looking forward to improving in our approach to service discovery. It’s a space with loads of active development and we see some elegant opportunities for integrating our scheduling and request routing infrastructure in the near future.

However, one of the important lessons we’ve taken away is that simple approaches are often the right ones. This system has been working for us reliably for more than a year without any incidents. Stripe doesn’t process anywhere near as many requests as Twitter or Facebook, but we do care a very great deal about extreme reliability. Sometimes the best wins come from deploying a stable, excellent solution instead of a novel one.

Like this post? Join the Stripe engineering team. View Openings

October 31, 2016

A primer on machine learning for fraud detection

Michael Manapat on October 27, 2016 in Engineering

Stripe Radar is a collection of tools to help businesses detect and prevent fraud. At Radar’s core is a machine learning engine that scans every card payment across Stripe’s 100,000+ businesses, aggregates information from those payments into behavioral signals that are predictive of fraud, and blocks payments that have a high probability of being fraudulent.

Radar’s power comes from all the data we obtain from the Stripe “network.” Instead of requiring users to label charges manually, Radar obtains the “ground truth” of fraud directly from our banking partners. Just as importantly, the signals we use in our models include aggregates over the entire stream of payments processed by Stripe: when a card is used for the first time on a Stripe business, there’s an 80% chance we’ve seen that card elsewhere on the Stripe network, and those previous interactions provide valuable information about potential fraud.

If you’re curious to learn more, we’ve put together a detailed outline that describes how we use machine learning at Stripe to detect and prevent fraud.

Read more

October 27, 2016