Blog Engineering

Follow Stripe on Twitter

Scaling your API with rate limiters

Paul Tarjan on March 30, 2017 in Engineering

Availability and reliability are paramount for all web applications and APIs. If you’re providing an API, chances are you’ve already experienced sudden increases in traffic that affect the quality of your service, potentially even leading to a service outage for all your users.

The first few times this happens, it’s reasonable to just add more capacity to your infrastructure to accommodate user growth. However, when you’re running a production API, not only do you have to make it robust with techniques like idempotency, you also need to build for scale and ensure that one bad actor can’t accidentally or deliberately affect its availability.

Rate limiting can help make your API more reliable in the following scenarios:

  • One of your users is responsible for a spike in traffic, and you need to stay up for everyone else.
  • One of your users has a misbehaving script which is accidentally sending you a lot of requests. Or, even worse, one of your users is intentionally trying to overwhelm your servers.
  • A user is sending you a lot of lower-priority requests, and you want to make sure that it doesn’t affect your high-priority traffic. For example, users sending a high volume of requests for analytics data could affect critical transactions for other users.
  • Something in your system has gone wrong internally, and as a result you can’t serve all of your regular traffic and need to drop low-priority requests.

At Stripe, we’ve found that carefully implementing a few rate limiting strategies helps keep the API available for everyone. In this post, we’ll explain in detail which rate limiting strategies we find the most useful, how we prioritize some API requests over others, and how we started using rate limiters safely without affecting our existing users’ workflows.

Rate limiters and load shedders

A rate limiter is used to control the rate of traffic sent or received on the network. When should you use a rate limiter? If your users can afford to change the pace at which they hit your API endpoints without affecting the outcome of their requests, then a rate limiter is appropriate. If spacing out their requests is not an option (typically for real-time events), then you’ll need another strategy outside the scope of this post (most of the time you just need more infrastructure capacity).

Our users can make a lot of requests: for example, batch processing payments causes sustained traffic on our API. We find that clients can always (barring some extremely rare cases) spread out their requests a bit more and not be affected by our rate limits.

Rate limiters are amazing for day-to-day operations, but during incidents (for example, if a service is operating more slowly than usual), we sometimes need to drop low-priority requests to make sure that more critical requests get through. This is called load shedding. It happens infrequently, but it is an important part of keeping Stripe available.

A load shedder makes its decisions based on the whole state of the system rather than the user who is making the request. Load shedders help you deal with emergencies, since they keep the core part of your business working while the rest is on fire.

Using different kinds of rate limiters in concert

Once you know rate limiters can improve the reliability of your API, you should decide which types are the most relevant.

At Stripe, we operate 4 different types of limiters in production. The first one, the Request Rate Limiter, is by far the most important one. We recommend you start here if you want to improve the robustness of your API.

Request rate limiter

This rate limiter restricts each user to N requests per second. Request rate limiters are the first tool most APIs can use to effectively manage a high volume of traffic.

Our rate limits for requests is constantly triggered. It has rejected millions of requests this month alone, especially for test mode requests where a user inadvertently runs a script that’s gotten out of hand.

Our API provides the same rate limiting behavior in both test and live modes. This makes for a good developer experience: scripts won't encounter side effects due to a particular rate limit when moving from development to production.

After analyzing our traffic patterns, we added the ability to briefly burst above the cap for sudden spikes in usage during real-time events (e.g. a flash sale.)

Request rate limiters restrict users to a maximum number of requests per second.

Concurrent requests limiter

Instead of “You can use our API 1000 times a second”, this rate limiter says “You can only have 20 API requests in progress at the same time”. Some endpoints are much more resource-intensive than others, and users often get frustrated waiting for the endpoint to return and then retry. These retries add more demand to the already overloaded resource, slowing things down even more. The concurrent rate limiter helps address this nicely.

Our concurrent request limiter is triggered much less often (12,000 requests this month), and helps us keep control of our CPU-intensive API endpoints. Before we started using a concurrent requests limiter, we regularly dealt with resource contention on our most expensive endpoints caused by users making too many requests at one time. The concurrent request limiter totally solved this.

It is completely reasonable to tune this limiter up so it rejects more often than the Request Rate Limiter. It asks your users to use a different programming model of “Fork off X jobs and have them process the queue” compared to “Hammer the API and back off when I get a HTTP 429”. Some APIs fit better into one of those two patterns so feel free to use which one is most suitable for the users of your API.

Concurrent request limiters manage resource contention for CPU-intensive API endpoints.

Fleet usage load shedder

Using this type of load shedder ensures that a certain percentage of your fleet will always be available for your most important API requests.

We divide up our traffic into two types: critical API methods (e.g. creating charges) and non-critical methods (e.g. listing charges.) We have a Redis cluster that counts how many requests we currently have of each type.

We always reserve a fraction of our infrastructure for critical requests. If our reservation number is 20%, then any non-critical request over their 80% allocation would be rejected with status code 503.

We triggered this load shedder for a very small fraction of requests this month. By itself, this isn’t a big deal—we definitely had the ability to handle those extra requests. But we’ve had other months where this has prevented outages.

Fleet usage load shedders reserves fleet resources for critical requests.

Worker utilization load shedder

Most API services use a set of workers to independently respond to incoming requests in a parallel fashion. This load shedder is the final line of defense. If your workers start getting backed up with requests, then this will shed lower-priority traffic.

This one gets triggered very rarely, only during major incidents.

We divide our traffic into 4 categories:

  1. Critical methods
  2. POSTs
  3. GETs
  4. Test mode traffic

We track the number of workers with available capacity at all times. If a box is too busy to handle its request volume, it will slowly start shedding less-critical requests, starting with test mode traffic. If shedding test mode traffic gets it back into a good state, great! We can start to slowly bring traffic back. Otherwise, it’ll escalate and start shedding even more traffic.

It’s very important that shedding and bringing load happen slowly, or you can end up flapping (“I got rid of testmode traffic! Everything is fine! I brought it back! Everything is awful!”). We used a lot of trial and error to tune the rate at which we shed traffic, and settled on a rate where we shed a substantial amount of traffic within a few minutes.

Only 100 requests were rejected this month from this rate limiter, but in the past it’s done a lot to help us recover more quickly when we have had load problems. This load shedder limits the impact of incidents that are already happening and provides damage control, while the first three are more preventative.

Worker utilization load shedders reserve workers for critical requests.

Building rate limiters in practice

Now that we’ve outlined the four basic kinds of rate limiters we use and what they’re for, let’s talk about their implementation. What rate limiting algorithms are there? How do you actually implement them in practice?

We use the token bucket algorithm to do rate limiting. This algorithm has a centralized bucket host where you take tokens on each request, and slowly drip more tokens into the bucket. If the bucket is empty, reject the request. In our case, every Stripe user has a bucket, and every time they make a request we remove a token from that bucket.

We implement our rate limiters using Redis. You can either operate the Redis instance yourself, or, if you use Amazon Web Services, you can use a managed service like ElastiCache.

Here are important things to consider when implementing rate limiters:

  • Hook the rate limiters into your middleware stack safely. Make sure that if there were bugs in the rate limiting code (or if Redis were to go down), requests wouldn’t be affected. This means catching exceptions at all levels so that any coding or operational errors would fail open and the API would still stay functional.
  • Show clear exceptions to your users. Figure out what kinds of exceptions to show your users. In practice, you should decide if you want HTTP 429 (Too Many Requests) or HTTP 503 (Service Unavailable) and what is the most accurate depending on the situation. The message you return should also be actionable.
  • Build in safeguards so that you can turn off the limiters. Make sure you have kill switches to disable the rate limiters should they kick in erroneously. Having feature flags in place can really help should you need a human escape valve. Set up alerts and metrics to understand how often they are triggering.
  • Dark launch each rate limiter to watch the traffic they would block. Evaluate if it is the correct decision to block that traffic and tune accordingly. You want to find the right thresholds that would keep your API up without affecting any of your users’ existing request patterns. This might involve working with some of them to change their code so that the new rate limit would work for them.


Rate limiting is one of the most powerful ways to prepare your API for scale. The different rate limiting strategies described in this post are not all necessary on day one, you can gradually introduce them once you realize the need for rate limiting.

Our recommendation is to follow the following steps to introduce rate limiting to your infrastructure:

  1. Start by building a Request Rate Limiter. It is the most important one to prevent abuse, and it’s by far the one that we use the most frequently.
  2. Introduce the next three types of rate limiters over time to prevent different classes of problems. They can be built slowly as you scale.
  3. Follow good launch practices as you're adding new rate limiters to your infrastructure. Handle any errors safely, put them behind feature flags to turn them off easily at any time, and rely on very good observability and metrics to see how often they’re triggering.
To help you get started, we’ve created a GitHub gist to share implementation details based on the code we actually use in production at Stripe.

Like this post? Join the Stripe engineering team. View Openings

March 30, 2017

Designing robust and predictable APIs with idempotency

Brandur Leach on February 22, 2017 in Engineering

Networks are unreliable. We’ve all experienced trouble connecting to Wi-Fi, or had a phone call drop on us abruptly.

The networks connecting our servers are, on average, more reliable than consumer-level last miles like cellular or home ISPs, but given enough information moving across the wire, they’re still going to fail in exotic ways. Outages, routing problems, and other intermittent failures may be statistically unusual on the whole, but still bound to be happening all the time at some ambient background rate.

To overcome this sort of inherently unreliable environment, it’s important to design APIs and clients that will be robust in the event of failure, and will predictably bring a complex integration to a consistent state despite them. Let’s take a look at a few ways to do that.

Planning for failure

Consider a call between any two nodes. There are a variety of failures that can occur:

  • The initial connection could fail as the client tries to connect to a server.
  • The call could fail midway while the server is fulfilling the operation, leaving the work in limbo.
  • The call could succeed, but the connection break before the server can tell its client about it.

Any one of these leaves the client that made the request in an uncertain situation. In some cases, the failure is definitive enough that the client knows with good certainty that it’s safe to simply retry it. For example, a total failure to even establish a connection to the server. In many others though, the success of the operation is ambiguous from the perspective of the client, and it doesn’t know whether retrying the operation is safe. A connection terminating midway through message exchange is an example of this case.

This problem is a classic staple of distributed systems, and the definition is broad when talking about a “distributed system” in this sense: as few as two computers connecting via a network that are passing each other messages. Even the Stripe API and just one other server that’s making requests to it comprise a distributed system.

Making liberal use of idempotency

The easiest way to address inconsistencies in distributed state caused by failures is to implement server endpoints so that they’re idempotent, which means that they can be called any number of times while guaranteeing that side effects only occur once.

When a client sees any kind of error, it can ensure the convergence of its own state with the server’s by retrying, and can continue to retry until it verifiably succeeds. This fully addresses the problem of an ambiguous failure because the client knows that it can safely handle any failure using one simple technique.

As an example, consider the API call for a hypothetical DNS provider that enables us to add subdomains via an HTTP request:

curl \
   -X PUT \
   -d type=CNAME \
   -d value="" \
   -d ttl=3600

All the information needed to create a record is included in the call, and it’s perfectly safe for a client to invoke it any number of times. If the server receives a call that it realizes is a duplicate because the domain already exists, it simply ignores the request and responds with a successful status code.

According to HTTP semantics, the PUT and DELETE verbs are idempotent, and the PUT verb in particular signifies that a target resource should be created or replaced entirely with the contents of a request’s payload (in modern RESTful parlance, a modification would be represented by a PATCH).

Guaranteeing “exactly once” semantics

While the inherently idempotent HTTP semantics around PUT and DELETE are a good fit for many API calls, what if we have an operation that needs to be invoked exactly once and no more? An example might be if we were designing an API endpoint to charge a customer money; accidentally calling it twice would lead to the customer being double-charged, which is very bad.

This is where idempotency keys come into play. When performing a request, a client generates a unique ID to identify just that operation and sends it up to the server along with the normal payload. The server receives the ID and correlates it with the state of the request on its end. If the client notices a failure, it retries the request with the same ID, and from there it’s up to the server to figure out what to do with it.

If we consider our sample network failure cases from above:

  • On retrying a connection failure, on the second request the server will see the ID for the first time, and process it normally.
  • On a failure midway through an operation, the server picks up the work and carries it through. The exact behavior is heavily dependent on implementation, but if the previous operation was successfully rolled back by way of an ACID database, it’ll be safe to retry it wholesale. Otherwise, state is recovered and the call is continued.
  • On a response failure (i.e. the operation executed successfully, but the client couldn’t get the result), the server simply replies with a cached result of the successful operation.

The Stripe API implements idempotency keys on mutating endpoints (i.e. anything under POST in our case) by allowing clients to pass a unique value in with the special Idempotency-Key header, which allows a client to guarantee the safety of distributed operations:

curl \
   -u sk_test_BQokikJOvBiI2HlWgH4olfQ2: \
   -H "Idempotency-Key: AGJ6FJMkGQIpHUTX" \
   -d amount=2000 \
   -d currency=usd \
   -d description="Charge for Brandur" \
   -d customer=cus_A8Z5MHwQS7jUmZ

If the above Stripe request fails due to a network connection error, you can safely retry it with the same idempotency key, and the customer is charged only once.

Being a good distributed citizen

Safely handling failure is hugely important, but beyond that, it’s also recommended that it be handled in a considerate way. When a client sees that a network operation has failed, there’s a good chance that it’s due to an intermittent failure that will be gone by the next retry. However, there’s also a chance that it’s a more serious problem that’s going to be more tenacious; for example, if the server is in the middle of an incident that’s causing hard downtime. Not only will retries of the operation not go through, but they may contribute to further degradation.

It’s usually recommended that clients follow something akin to an exponential backoff algorithm as they see errors. The client blocks for a brief initial wait time on the first failure, but as the operation continues to fail, it waits proportionally to 2n, where n is the number of failures that have occurred. By backing off exponentially, we can ensure that clients aren’t hammering on a downed server and contributing to the problem.

Exponential backoff has a long and interesting history in computer networking.

Furthermore, it’s also a good idea to mix in an element of randomness. If a problem with a server causes a large number of clients to fail at close to the same time, then even with back off, their retry schedules could be aligned closely enough that the retries will hammer the troubled server. This is known as the thundering herd problem.

We can address thundering herd by adding some amount of random “jitter” to each client’s wait time. This will space out requests across all clients, and give the server some breathing room to recover.

Thundering herd problem when a server faces simultaneous retries from all clients.

The Stripe Ruby library retries on failure automatically with an idempotency key using increasing backoff times and jitter. The implementation for that is pretty simple, and you can refer to it on GitHub to see exactly how it works.

Codifying the design of robust APIs

Considering the possibility of failure in a distributed system and how to handle it is of paramount importance in building APIs that are both robust and predictable. Retry logic on clients and idempotency on servers are techniques that are useful in achieving this goal and relatively simple to implement in any technology stack.

Here are a few core principles to follow while designing your clients and APIs:

  1. Make sure that failures are handled consistently. Have clients retry operations against remote services. Not doing so could leave data in an inconsistent state that will lead to problems down the road.
  2. Make sure that failures are handled safely. Use idempotency and idempotency keys to allow clients to pass a unique value and retry requests as needed.
  3. Make sure that failures are handled responsibly. Use techniques like exponential backoff and random jitter. Be considerate of servers that may be stuck in a degraded state.

Like this post? Join the Stripe engineering team. View Openings

February 22, 2017

Online migrations at scale

Jacqueline Xu on February 2, 2017 in Engineering

Engineering teams face a common challenge when building software: they eventually need to redesign the data models they use to support clean abstractions and more complex features. In production environments, this might mean migrating millions of active objects and refactoring thousands of lines of code.

Stripe users expect availability and consistency from our API. This means that when we do migrations, we need to be extra careful: objects stored in our systems need to have accurate values, and Stripe’s services need to remain available at all times.

In this post, we’ll explain how we safely did one large migration of our hundreds of millions of Subscriptions objects.

Why are migrations hard?

  • Scale

    Stripe has hundreds of millions of Subscriptions objects. Running a large migration that touches all of those objects is a lot of work for our production database.

    Imagine that it takes one second to migrate each subscription object: in sequential fashion, it would take over three years to migrate one hundred million objects.

  • Uptime

    Businesses are constantly transacting on Stripe. We perform all infrastructure upgrades online, rather than relying on planned maintenance windows. Because we couldn’t simply pause the Subscriptions service during migrations, we had to execute the transition with all of our services operating at 100%.

  • Accuracy

    Our Subscriptions table is used in many different places in our codebase. If we tried to change thousands of lines of code across the Subscriptions service at once, we would almost certainly overlook some edge cases. We needed to be sure that every service could continue to rely on accurate data.

A pattern for online migrations

Moving millions of objects from one database table to another is difficult, but it’s something that many companies need to do.

There’s a common 4 step dual writing pattern that people often use to do large online migrations like this. Here’s how it works:

  1. Dual writing to the existing and new tables to keep them in sync.
  2. Changing all read paths in our codebase to read from the new table.
  3. Changing all write paths in our codebase to only write to the new table.
  4. Removing old data that relies on the outdated data model.

Our example migration: Subscriptions

What are Subscriptions and why did we need to do a migration?

Stripe Billing helps users like DigitalOcean and Squarespace build and manage recurring billing for their customers. Over the past few years, we’ve steadily added features to support their more complex billing models, such as multiple subscriptions, trials, coupons, and invoices.

In the early days, each Customer object had, at most, one subscription. Our customers were stored as individual records. Since the mapping of customers to subscriptions was straightforward, subscriptions were stored alongside customers.

class Customer
  Subscription subscription

Eventually, we realized that some users wanted to create customers with multiple subscriptions. We decided to transform the subscription field (for a single subscription) to a subscriptions field—allowing us to store an array of multiple active subscriptions.

class Customer
  array: Subscription subscriptions

As we added new features, this data model became problematic. Any changes to a customer’s subscriptions meant updating the entire Customer record, and subscriptions-related queries scanning through customer objects. So we decided to store active subscriptions separately.

Our redesigned data model moves subscriptions into their own table.

As a reminder, our four migration phases were:

  1. Dual writing to the existing and new tables to keep them in sync.
  2. Changing all read paths in our codebase to read from the new table.
  3. Changing all write paths in our codebase to only write to the new table.
  4. Removing old data that relies on the outdated data model.

Let’s walk through these four phases looked like for us in practice.

Part 1: Dual writing

We begin the migration by creating a new database table. The first step is to start duplicating new information so that it’s written to both stores. We’ll then backfill missing data to the new store so the two stores hold identical information.

All new writes should update both stores.

In our case, we record all newly-created subscriptions into both the Customers table and the Subscriptions table. Before we begin dual writing to both tables, it’s worth considering the potential performance impact of this additional write on our production database. We can mitigate performance concerns by slowly ramping up the percentage of objects that get duplicated, while keeping a careful eye on operational metrics.

At this point, newly created objects exist in both tables, while older objects are only found in the old table. We’ll start copying over existing subscriptions in a lazy fashion: whenever objects are updated, they will automatically be copied over to the new table. This approach lets us begin to incrementally transfer our existing subscriptions.

Finally, we’ll backfill any remaining Customer subscriptions into the new Subscriptions table.

We need to backfill existing subscriptions to the new Subscriptions table.

The most expensive part of backfilling the new table on the live database is simply finding all the objects that need migration. Finding all the objects by querying the database would require many queries to the production database, which would take a lot of time. Luckily, we were able to offload this to an offline process that had no impact on our production databases. We make snapshots of our databases available to our Hadoop cluster, which lets us use MapReduce to quickly process our data in a offline, distributed fashion.

We use Scalding to manage our MapReduce jobs. Scalding is a useful library written in Scala that makes it easy to write MapReduce jobs (you can write a simple one in 10 lines of code). In this case, we’ll use Scalding to help us identify all subscriptions. We’ll follow these steps:

  • Write a Scalding job that provides a list of all subscription IDs that need to be copied over.
  • Run a large, multi-threaded migration to duplicate these subscriptions with a fleet of processes efficiently operating on our data in parallel.
  • Once the migration is complete, run the Scalding job once again to make sure there are no existing subscriptions missing from the Subscriptions table.

Part 2: Changing all read paths

Now that the old and new data stores are in sync, the next step is to begin using the new data store to read all our data.

For now, all reads use the existing Customers table: we need to move to the Subscriptions table.

We need to be sure that it’s safe to read from the new Subscriptions table: our subscription data needs to be consistent. We’ll use GitHub’s Scientist to help us verify our read paths. Scientist is a Ruby library that allows you to run experiments and compare the results of two different code paths, alerting you if two expressions ever yield different results in production. With Scientist, we can generate alerts and metrics for differing results in real time. When an experimental code path generates an error, the rest of our application won’t be affected.

We’ll run the following experiment:

  • Use Scientist to read from both the Subscriptions table and the Customers table.
  • If the results don’t match, raise an error alerting our engineers to the inconsistency.

GitHub’s Scientist lets us run experiments that read from both tables and compare the results.

After we verified that everything matched up, we started reading from the new table.

Our experiments are successful: all reads now use the new Subscriptions table.

Part 3: Changing all write paths

Next, we need to update write paths to use our new Subscriptions store. Our goal is to incrementally roll out these changes, so we’ll need to employ careful tactics.

Up until now, we’ve been writing data to the old store and then copying them to the new store:

We now want to reverse the order: write data to the new store and then archive it in the old store. By keeping these two stores consistent with each other, we can make incremental updates and observe each change carefully.

Refactoring all code paths where we mutate subscriptions is arguably the most challenging part of the migration. Stripe’s logic for handling subscriptions operations (e.g. updates, prorations, renewals) spans thousands of lines of code across multiple services.

The key to a successful refactor will be our incremental process: we’ll isolate as many code paths into the smallest unit possible so we can apply each change carefully. Our two tables need to stay consistent with each other at every step.

For each code path, we’ll need to use a holistic approach to ensure our changes are safe. We can’t just substitute new records with old records: every piece of logic needs to be considered carefully. If we miss any cases, we might end up with data inconsistency. Thankfully, we can run more Scientist experiments to alert us to any potential inconsistencies along the way.

Our new, simplified write path looks like this:

We can make sure that no code blocks continue using the outdated subscriptions array by raising an error if the property is called:

class Customer
  def subscriptions
    Opus::Error.hard("Accessing subscriptions array on customer")

Part 4: Removing old data

Our final (and most satisfying) step is to remove code that writes to the old store and eventually delete it.

Once we’ve determined that no more code relies on the subscriptions field of the outdated data model, we no longer need to write to the old table:

With this change, our code no longer uses the old store, and the new table now becomes our source of truth.

We can now remove the subscriptions array on all of our Customer objects, and we’ll incrementally process deletions in a lazy fashion. We first automatically empty the array every time a subscription is loaded, and then run a final Scalding job and migration to find any remaining objects for deletion. We end up with the desired data model:


Running migrations while keeping the Stripe API consistent is complicated. Here’s what helped us run this migration safely:

  • We laid out a four phase migration strategy that would allow us to transition data stores while operating our services in production without any downtime.
  • We processed data offline with Hadoop, allowing us to manage high data volumes in a parallelized fashion with MapReduce, rather than relying on expensive queries on production databases.
  • All the changes we made were incremental. We never attempted to change more than a few hundred lines of code at one time.
  • All our changes were highly transparent and observable. Scientist experiments alerted us as soon as a single piece of data was inconsistent in production. At each step of the way, we gained confidence in our safe migration.

We’ve found this approach effective in the many online migrations we’ve executed at Stripe. We hope these practices prove useful for other teams performing migrations at scale.

Like this post? Join the Stripe engineering team. View Openings

February 2, 2017

Reproducible research: Stripe’s approach to data science

Dan Frank on November 22, 2016 in Engineering

When people talk about their data infrastructure, they tend to focus on the technologies: Hadoop, Scalding, Impala, and the like. However, we’ve found that just as important as the technologies themselves are the principles that guide their use. We’d like to share our experience with one such principle that we’ve found particularly useful: reproducibility.

We’ll talk about our motivation for focusing on reproducibility, how we’re using Jupyter Notebooks as our core tool, and the workflow we’ve developed around Jupyter to operationalize our approach.

Jupyter notebooks are a fantastic way to create reproducible data science research.


Data tools are most often used to generate some kind of exploratory analysis report. At Stripe, an example is an investigation of the probability that a card gets declined, given the time since its last charge. The investigator writes a query, which is executed by a query engine like Redshift, and then runs some further code to interpret and visualize the results.

The most common way to share results from these sorts of studies is to compose an email and attach some graphs. But this means that viewers of the report don’t know how the query was constructed and analyzed. As a result, they are unable to review the work in depth, or to extend it themselves. It’s very easy to commit methodological errors when asking questions of data; an unintended bias here, or a missed corner case there, can lead to entirely incorrect conclusions.

In academia, the peer review system helps catch these errors. Many in the scientific community have championed the practice of open science, where data and code are released along with experimental results, such that reviewers can independently recreate the original results. Taking inspiration from this movement, we sought to make data reports within Stripe transparent and reproducible, so that anyone at the company can look at a report and understand how it was generated. Just like an always-green test suite forces developers to write better code, we wanted to see if requiring all analyses be reproducible would force us to produce better reports.


Our implementation of reproducible analysis centers on Jupyter Notebook, a web-based frontend to the Jupyter interactive computing environment which provides an interface similar to Mathematica or Matlab.

An example of a bar chart output from a Jupyter Notebook

Jupyter Notebook also comes with built-in functionality to convert a notebook into a publishable HTML document. You can see a sample of one of our published notebooks, studying the relationship between Benford’s Law and the amounts of each charge made on Stripe.

Now, let’s say that Alice wants to share a notebook with Bob. The state of the interactive environment can be persisted as a JSON file containing both the code input to the notebook and data output from it. To share the notebook, Alice would typically send this notebook file directly to Bob. Now, when Bob opens it, he’ll see the same outputs as Alice, but may not be able to do much with them. These outputs include computational results and plots’ image data, but not the values of any of the variables that Alice was working with. To inspect these variables and extend Alice’s work, he’ll have to recompute them from the code inputs. However, there may have been certain cells that only run correctly on the Alice’s computer, or some cells might have been rearranged in a way that unintentionally broke the flow of computation. It’s easy to miss mistakes like these when you’re able to share a notebook with the results embedded, so we decided to try something different.

In our workflow, developers and data scientists work on a notebook locally and check this source file into Git. To publish their work, they use our common deployment framework, which executes the notebook code once it hits our servers. The results are translated into HTML, which are served statically. Importantly, we strip results from the notebook files in a pre-commit hook, meaning that only code is checked into our repositories. This ensures that the results are fully reproduced from scratch when the notebook is published. Thus, it’s a requirement that all notebooks be programmatically executable from back to front, without needing any manual steps to run. If you were on a Stripe computer, you could run the notebook above with one click and obtain the same results. This is a huge deal!

To make this workflow possible, we had to write some additional tooling to enable the same code to run on developers’ laptops and production servers. The bulk of this work involved access to our query engines, which is perhaps the most common obstacle to collaboration on data analysis projects. Even very well-organized workflows often require a data file to be present at a particular path, or some out of band authentication step with the machines running the queries. The key to overcoming these challenges was to create a common entry point in code to access these query engines from developers’ laptops, as well as our servers. This way, a notebook that runs on one developer’s computer will always run correctly on everyone else’s.

Adding this tooling also greatly improved the experience of doing exploratory data analysis within the notebook. Prior to our reproducibility tooling, setting up data access was tedious, time-consuming, and error-prone. Automating and standardizing this process allowed data scientists and developers to focus on their analysis instead.


Reproducibility makes data science at Stripe feel like working on GitHub, where anyone can obtain and extend others’ work. Instead of islands of analysis, we share our research in a central repository of knowledge. This makes it dramatically easier for anyone on our team to work with our data science research, encouraging independent exploration.

We approach our analyses with the same rigor we apply to production code: our reports feel more like finished products, research is fleshed out and easy to understand, and there are clear programmatic steps from start to finish for every analysis.

We’ve switched over to reproducible reports, and we’re not looking back. Delivering them requires more up-front work, but we’ve found it to be a good long-term investment. If you give it a try, we think you’ll feel the same way!

Like this post? Join the Stripe engineering team. View Openings

November 22, 2016

Service discovery at Stripe

Julia Evans on October 31, 2016 in Engineering

With so many new technologies coming out every year (like Kubernetes or Habitat), it’s easy to become so entangled in our excitement about the future that we forget to pay homage to the tools that have been quietly supporting our production environments. One such tool we've been using at Stripe for several years now is Consul. Consul helps discover services (that is, it helps us navigate the thousands of servers we run with various services running on them and tells us which ones are up and available for use). This effective and practical architectural choice wasn't flashy or entirely novel, but has served us dutifully in our continued mission to provide reliable service to our users around the world.

We’re going to talk about:

  • What service discovery and Consul are,
  • How we managed the risks of deploying a critical piece of software,
  • The challenges we ran into along the way and what we did about them.

You don’t just set up new software and expect it to magically work and solve all your problems—using new software is a process. This is an example of what the process of using new software in production has looked like for us.

What’s service discovery?

Great question! Suppose you’re a load balancer for Stripe, and a request to create a charge has come in. You want to send it to an API server. Any API server!

We run thousands of servers with various services running on them. Which ones are the API servers? What port is the API running on? One amazing thing about using AWS is that our instances can go down at any time, so we need to be prepared to:

  • Lose API servers at any time,
  • Add extra servers to the rotation if we need additional capacity.

This problem of tracking changes around which boxes are available is called service discovery. We use a tool called Consul from HashiCorp to do service discovery.

The fact that our instances can go down at any time is actually very helpful—our infrastructure gets regular practice losing instances and dealing with it automatically, so when it happens it’s just business as usual. It’s easier to handle failure gracefully when failure happens often.

Introduction to Consul

Consul is a service discovery tool: it lets services register themselves and to discover other services. It stores which services are up in a database, has client software that puts information in that database, and other client software that reads from that database. There are a lot of pieces to wrap your head around!

The most important component of Consul is the database. This database contains entries like “api-service is running at IP at port 12345. It is up.”

Individual boxes publish information to Consul saying “Hi! I am running api-service on port 12345! I am up!”.

Then if you want to talk to the API service, you can ask Consul “Which api-services are up?”. It will give you back a list of IP addresses and ports you can talk to.

Consul is a distributed system itself (remember: we can lose any box at any time, which means we could lose the Consul server itself!) so it uses a consensus algorithm called Raft to keep its database in sync.

If you’re interested in consensus in Consul, read more here.

The beginning of Consul at Stripe

We started out by only writing to Consul—having machines report whether or not they were up to the Consul server, but not using that information to do service discovery. We wrote some Puppet configuration to set it up, which wasn’t that hard!

This way we could uncover potential issues with running the Consul client and get experience operating it on thousands of machines. At first, no services were being discovered with Consul.

What could go wrong?

Addressing memory leaks

If you add a new piece of software to every box in your infrastructure, that software could definitely go wrong! Early on we ran into memory leaks in Consul’s stats library: we noticed that one box was taking over 100MB of RAM and climbing. This was a bug in Consul, which got fixed.

100MB of memory is not a big leak, but the leak was growing quickly. Memory leaks in general are worrisome because they're one of the easiest ways for one process on a box to Totally Ruin Things for other processes on the box.

Good thing we decided not to use Consul to discover services to start! Letting it sit on a bunch of production machines and monitoring memory usage let us find out about a potentially serious problem quickly with no negative impact.

Starting to discover services with Consul

Once we were more confident that running Consul in our infrastructure would work, we started adding a few clients to talk to Consul! We made this less risky in 2 ways:

  • Only use Consul in a few places to start,
  • Keep a fallback system in place so that we could function during outages.

Here are some of the issues we ran into. We’re not listing these to complain about Consul, but rather to emphasize that when using new technology, it’s important to roll it out slowly and be cautious.

A ton of Raft failovers. Remember that Consul uses a consensus protocol? This copies all the data on one server in the Consul cluster to other servers in that cluster. The primary server was having a ton of problems with disk I/O—the disks weren’t fast enough to do the reads that Consul wanted to do, and the whole primary server would hang. Then Raft would say “oh, the primary is down!” and elect a new primary, and the cycle would repeat. While Consul was busy electing a new primary, it would not let anybody write anything or read anything from its database (because consistent reads are the default).

Version 0.3 broke SSL completely. We were using Consul’s SSL feature (technically, TLS) for our Consul nodes to communicate securely. One Consul release just broke it. We patched it. This is an example of a kind of issue that isn’t that difficult to detect or scary (we tested in QA, realized SSL was broken, and just didn’t roll out the release), but is pretty common when using early-stage software.

Goroutine leaks. We started using Consul’s leader election. and there was a goroutine leak that caused Consul to quickly eat all the memory on the box. The Consul team was really helpful in debugging this and we fixed a bunch of memory leaks (different memory leaks from before).

Once all of these were fixed, we were in a much better place. Getting from “our first Consul client” to “we’ve fixed all these issues in production” took a bit less than a year of background work cycles.

Scaling Consul to discover which services are up

So, we’d learned about a bunch of bugs in Consul, and had them fixed, and everything was operating much better. Remember that step we talked about at the beginning, though? Where you ask Consul “hey, what boxes are up for api-service?” We were having intermittent problems where the Consul server would respond slowly or not at all.

This was mostly during raft failovers or instability; because Consul uses a strongly-consistent store its availability will always be weaker than something that doesn't. It was especially rough in the early days.

We still had fallbacks, but Consul outages became pretty painful for us. We would fall back to a hardcoded set of DNS names (like “apibox1”) when Consul was down. This worked okay when we first rolled out Consul, but as we scaled and used Consul more widely, it became less and less viable.

Consul Template to the rescue

Asking Consul which services were up (via its HTTP API) was unreliable. But we were happy with it otherwise!

We wanted to get information out of Consul about which services were up without using its API. How?

Well, Consul would take a name (like monkey-srv) and translate it into one or several IP addresses (“this is where monkey-srv lives”). Know what else takes in names and outputs IP address? A DNS server! So we replaced Consul with a DNS server. Here’s how: Consul Template is a Go program that generates static configuration files from your Consul database.

We started using Consul Template to generate DNS records for Consul services. So if monkey-srv was running on IP, we’d generate a DNS record:

monkey-srv.service.consul IN A

Here’s what that looks like in code. You can also find our real Consul Template configuration which is a little more complicated.

{{range service $service.Name}}
{{$service.Name}}.service.consul. IN A {{.Address}}

If you're thinking "wait, DNS records only have an IP address, you also need to know which port the server is running on," you are right! DNS A records (the kind you normally see) only have an IP address in them. However, DNS SRV records can have ports in them, and we also use Consul Template to generate SRV records.

We run Consul Template in a cron job every 60 seconds. Consul Template also has a “watch” mode (the default) which continuously updates configuration files when its database is updated. When we tried the watch mode, it DOSed our Consul server, so we stopped using it.

So if our Consul server goes down, our internal DNS server still has all the records! They might be a little old, but that’s fine. What’s awesome about our DNS server is that it’s not a fancy distributed system, which means it’s a much simpler piece of software, and much less likely to spontaneously break. This means that I can just look up monkey-srv.service.consul get an IP, and use it to talk to my service!

Because DNS is a shared-nothing eventually consistent system we can replicate and cache it a bunch (we have 5 canonical DNS servers and every server has a local DNS cache and knows how to talk to any of the 5 canonical servers) so it's fundamentally more resilient than Consul.

Adding a load balancer for faster healthchecks

We just said that we update DNS records from Consul every 60 seconds. So, what happens if an API server explodes? Do we keep sending requests to that IP for 45 more seconds until the DNS server gets updated? We do not! There’s one more piece of the story: HAProxy.

HAProxy is a load balancer. If you give a healthcheck for the service it’s sending requests to, it can make sure that your backends are up! All of our API requests actually go through HAProxy. Here’s how it works:

  • Every 60 seconds, Consul Template writes an HAProxy configuration file.
  • This means that HAProxy always has an approximately correct set of backends.
  • If a machine goes down, HAProxy realizes quickly that something has gone wrong (since it runs healthchecks every 2 seconds).

This means we restart HAProxy every 60 seconds. But does that mean we drop connections when we restart HAProxy? No. To avoid dropping connections between restarts, we use HAProxy’s graceful restarts feature. It’s still possible to drop some traffic with this restart policy, as described here, but we don’t process enough traffic that it’s an issue.

We have a standard healthcheck endpoint for our services—almost every service has a /healthcheck endpoint that returns 200 if it’s up and and errors if not. Having a standard is important because it means we can easily configure HAProxy to check service health.

When Consul is down, HAProxy will just have a stale configuration file, which will keep working.

Trading consistency for availability

If you’ve been paying close attention, you’ll notice that the system we started with (a strongly consistent database which was guaranteed to have the latest state) was very different from the the system we finished with (a DNS server which could be up to a minute behind). Giving up our requirement for consistency let us have a much more available system—Consul outages have basically no effect on our ability to discover services.

An important lesson from this is that consistency doesn’t come for free! You have to be willing to pay the price in availability, and so if you’re going to be using a strongly consistent system it’s important to make sure that’s something you actually need.

What happens when you make a request

We covered a lot in this post, so let’s go through the request flow now that we’ve learned how it all works.

When you make a request for, what happens? How does it end up at the right server? Here’s a simplified explanation:

  1. It comes into one of our public load balancers, running HAProxy,
  2. Consul Template has populated a list of servers serving in the /etc/haproxy.conf configuration file,
  3. HAProxy reloads this configuration file every 60 seconds,
  4. HAProxy sends your request on to a server! It makes sure that the server is up.

It’s actually a little more complicated than that (there’s actually an extra layer, and Stripe API requests are even more complicated, because we have systems to deal with PCI compliance), but all the core ideas are there.

This means that when we bring up or take down servers, Consul takes care of removing them from the HAProxy rotation automatically. There’s no manual work to do.

More than a year of peace

There are a lot of areas we’re looking forward to improving in our approach to service discovery. It’s a space with loads of active development and we see some elegant opportunities for integrating our scheduling and request routing infrastructure in the near future.

However, one of the important lessons we’ve taken away is that simple approaches are often the right ones. This system has been working for us reliably for more than a year without any incidents. Stripe doesn’t process anywhere near as many requests as Twitter or Facebook, but we do care a very great deal about extreme reliability. Sometimes the best wins come from deploying a stable, excellent solution instead of a novel one.

Like this post? Join the Stripe engineering team. View Openings

October 31, 2016

A primer on machine learning for fraud detection

Michael Manapat on October 27, 2016 in Engineering

Stripe Radar is a collection of tools to help businesses detect and prevent fraud. At Radar’s core is a machine learning engine that scans every card payment across Stripe’s 100,000+ businesses, aggregates information from those payments into behavioral signals that are predictive of fraud, and blocks payments that have a high probability of being fraudulent.

Radar’s power comes from all the data we obtain from the Stripe “network.” Instead of requiring users to label charges manually, Radar obtains the “ground truth” of fraud directly from our banking partners. Just as importantly, the signals we use in our models include aggregates over the entire stream of payments processed by Stripe: when a card is used for the first time on a Stripe business, there’s an 80% chance we’ve seen that card elsewhere on the Stripe network, and those previous interactions provide valuable information about potential fraud.

If you’re curious to learn more, we’ve put together a detailed outline that describes how we use machine learning at Stripe to detect and prevent fraud.

Read more

October 27, 2016

Introducing Veneur: high performance and global aggregation for Datadog

Cory Watson on October 18, 2016 in Engineering

When a company writes about their observability stack, they often focus on sweet visualizations, advanced anomaly detection or innovative data stores. Those are well and good, but today we’d like to talk about the tip of the spear when it comes to observing your systems: metrics pipelines! Metrics pipelines are how we get metrics from where they happen—our hosts and services—to storage quickly and efficiently so they can be queried, all without interrupting the host service.

First, let’s establish some technical context. About a year ago, Stripe started the process of migrating to Datadog. Datadog is a hosted product that offers metric storage, visualization and alerting. With them we can get some marvelous dashboards to monitor our Observability systems:

A screenshot of a DataDog dashboard showing several graphs

Observability Overview Dashboard

Previously, we’d been using some nice open-source software but it was sadly unowned and unmaintained internally. Facing the high cost—in money and people—we decided that outsourcing to Datadog was a great idea. Nearly a year later, we’re quite happy with the improved visibility and reliability we’ve gained through significant effort in this area. One of the most interesting aspects of this work was how to even metric!

Using StatsD for metrics

There are many ways to instrument your systems. Our preferred method is the StatsD style: a simple text-based protocol with minimal performance impact. Code is instrumented to emit UDP to a central server at runtime whenever measured stuff happens.

Like all of life, this choice has tradeoffs. For the sake of brevity, we’ll quickly mention the two downsides of StatsD that are most relevant to us: its use of inherently unreliable UDP, and its role as a Single Point of Failure for timer aggregation.

As you may know, UDP is a “fire and forget” protocol that does not require any acknowledgement by the receiver. This makes UDP pretty fast for the client, but also means that client has no way to ensure that the metric was received by anyone! When combined with the network and the host’s natural protections that cause traffic to be dropped, you’ve got a problem.

Another problem is the Single Point of Failure. The poor StatsD server has to process a lot of UDP packets if you’ve got a non trivial number of sources. Add to that the nightmare of the machine going down and the need to shard or use other tricks to scale out, and you’ve got your work cut out for you.

DogStatsD and the lack of “global”

Aware that a central StatsD can be a problem for some, Datadog takes a different approach: Each host runs an instance of DogStatsD as part of the Datadog agent. This neatly sidesteps most performance problems but created a large feature regression for Stripe: no more global percentiles. Datadog only supports per-host aggregations for histograms, timers and sets.

Remember that, with StatsD, you emit a metric to the downstream server each time the event occurs. If you’re measuring API requests and emitting that metric on each host, you are now sending your timer metric to the local Datadog agent which aggregates them and flushes them to Datadog’s servers in batches. For counters, this is great because you can just add them together! But for percentiles we’ve got problems. Imagine you’ve got hundreds of servers each doing an unequal number of API requests with unequal workloads. Our percentiles are not representative of how our whole API is behaving. Even worse, once we’ve generated the percentiles for our histograms there is no meaningful way, mathematically, to combine them. (More precisely, the percentiles of arbitrary subsamples of a distribution are not sufficient for the percentiles of the full distribution).

Stripe needs to know the overall percentiles because each host’s histogram only has a small subset of random requests. We needed something better!

Enter Veneur

To provide these features to Stripe we created Veneur, a DogStatsD server with global aggregation capability. We’re happily running it in production and you can too! It’s open-source and we’d love for you to take a look.

Veneur runs in place of Datadog’s bundled DogStatsD server, listening on the same port. It flushes metrics to Datadog just like you’d expect. That’s where the similarities end, however, and the magic begins.

Instead of aggregating the histogram and emitting percentiles at flush time, Veneur forwards the histogram on to a global Veneur instance which merges all the histograms and flushes them to Datadog at the next window. It adds a bit of delay—one flush period—but the result is a best-of-both mix of local and global metrics!

We monitor the performance of many of our API calls, such as this chart of various percentiles for creating a charge. Red bars are deploys!

Approximate, mergeable histograms

As mentioned earlier, the essential problem with percentiles is that, once reported, they can’t be combined together. If host A received 20 requests and host B received 15, the two numbers can be added to determine that, in total, we had 35 requests. But if host A has a 99th percentile response time of 8ms and host B has a 99th percentile response time of 10ms, what’s the 99th percentile across both hosts?

The answer is, “we don’t know”. Taking the mean of those two percentiles results in a number that is statistically meaningless. If we have more than two hosts, we can’t simply take the percentile of percentiles either. We can’t even use the percentiles of each host to infer a range for the global percentile—the global 99th percentile could, in rare cases, be larger than any of the individual hosts’ 99th percentiles. We need to take the original set of response times reported from host A, and the original set from host B, and combine those together. Then, from the combined set, we can report the real 99th percentile across both hosts. That’s what forwarding is for.

Of course, there are a few caveats. If each histogram stores all the samples it received, the final histogram on the global instance could potentially be huge. To sidestep this issue, Veneur uses an approximating histogram implementation called a t-digest, which uses constant space regardless of the number of samples. (Specifically, we wrote our own Go port of it.) As the name would suggest, approximating histograms return approximate percentiles with some error, but this tradeoff ensures that Veneur’s memory consumption stays under control under any load.


The global Veneur instance is also a single point of failure for the metrics that pass through it. If it went down we would lose percentiles (and sets, since those are forwarded too). But we wouldn’t lose everything. Besides the percentiles, StatsD histograms report a counter of how many samples they’ve received, and the minimum/maximum samples. These metrics can be combined without forwarding (if we know the maximum response time on each host, the maximum across all hosts is just the maximum of the individual values, and so on), so they get reported immediately without any forwarding. Clients can opt out of forwarding altogether, if they really do want their percentiles to be constrained to each host.

Veneur’s Overview Dashboard, well instrumented and healthy!

Other cool features and errors

Veneur—named for the Grand Huntsman of France, master of dogs!—also has a few other tricks:

  • Drop-in replacement for Datadog’s included DogStatsD. It even processes events and service checks!
  • Written in Go to minimize deployment troubles to a single binary
  • Use of HyperLogLogs for counting the unique members of a set efficiently with fixed memory consumption
  • Extensive metrics (natch) so you can watch the watchers
  • Efficient compressed, chunked POST requests sent concurrently to Datadog’s API
  • Extremely fast

Over the course of Veneur’s development we also iterated a lot. Our initial implementation was purely a global DogStatsD implementation without the forwarding or merging. It was really fast, but we quickly decided that processing more packets faster wasn’t really going to get us very far.

Next we took some twists and turns through “smart clients” that tried to route metrics to the appropriate places. This was initially promising, but we found that supporting this for each of our language runtimes and use cases was prohibitively expensive and undermined (Dog)StatsD’s simplicity. Some of our instrumentation is as simple as an nc command and that simplicity is very helpful to quickly instrument things.

While overall our work was transparent, we did cause some trouble when we initially turned the global features back on. Some teams had come to rely on per-host for very specific metrics. When we had to fail back to host local for some refactoring, we caused problems to teams who had just adapted to using global features. Argh! Each of these wound up being positive learning experiences, and we found Stripe’s engineers to be very accommodating. Thanks!

Thanks and future work

The Observability team would like to thank Datadog for their support and advice in the creation of Veneur. We’d also like to thank our friends and teammates at Stripe for their patience as we iterated to where we are today. Specifically for the occasional broken charts, metrics outages and other hilarious-in-hindsight problems we caused along the way.

We’ve been running Veneur in production for months and have been enjoying the fruits of our labor. We’re now operating at a stable, more mature pace for improvements around efficiency learned from monitoring its production behavior. We hope to leverage Veneur in the future for continued improvements to the features and reliability of our metrics pipeline. We’ve discussed additional protocol features, unified formats, per-team accounting and even incorporating other sensor data like tracing spans. Veneur’s speed, instrumentation and flexibility give us lots of room to grow and improve. Someone’s gotta feed those wicked cool visualizations and anomaly detectors!

Like this post? Join the Stripe engineering team. View Openings

October 18, 2016

Running three hours of Ruby tests in under three minutes

Nelson Elhage on August 13, 2015 in Engineering

At Stripe, we make extensive use of automated testing to help ensure the stability and reliability of our services. We have expansive test coverage for our API and other core services, we run tests on a continuous integration server over every git branch, and we never deploy without green tests.

The size and complexity of our codebase has grown over the past few years—and so has the size of the test suite. As of August 2015, we have over 1400 test files that define nearly 15,000 test cases and make over 130,000 assertions. According to our CI server, the tests would take over three hours if run sequentially.

With a large (and growing) group of engineers waiting for those tests with every change they make, the speed of running tests is critical. We’ve used a number of hosted CI solutions in the past, but as test runtimes crept past 10 minutes, we brought testing in-house to give us more control and room for experimentation.

Recently, we’ve implemented our own distributed test runner that brought the runtime of our tests to just under three minutes. While some of these tactics are specific to our codebase and systems, we hope sharing what we did to improve our test runtimes will help other engineering organizations.

Forking executor

We write tests using minitest, but we've implemented our own plugin to execute tests in parallel across multiple CPUs on multiple different servers.

In order to get maximum parallel performance out of our build servers, we run tests in separate processes, allowing each process to make maximum use of the machine's CPU and I/O capability. (We run builds on Amazon's c4.8xlarge instances, which give us 36 cores each.)

Initially, we experimented with using Ruby's threads instead of multiple processes, but discovered that using a large number of threads was significantly slower than using multiple processes. This slowdown was present even if the ruby threads were doing nothing but monitoring subprocess children. Our current runner doesn’t use Ruby threads at all.

When tests start up, we start by loading all of our application code into a single Ruby process so we don’t have to parse and load all our Ruby code and gem dependencies multiple times. This process then calls fork a number of times to produce N different processes that’ll each have all of the code pre-loaded and ready to go.

Each of those workers then starts executing tests. As they execute tests, our custom executor forks further: Each process forks and executes a single test file’s worth of tests inside the child process. The child process writes the results to the parent over a pipe, and then exits.

This second round of forking provides a layer of isolation between tests: If a test makes changes to global state, running the test inside a throwaway process will clean everything up once that process exits. Isolating state at a per-file level also means that running individual tests on developer machines will behave similarly to the way they behave in CI, which is an important debugging affordance.


The custom forking executor spawns a lot of processes, and creates a number of scratch files on disk. We run all builds at Stripe inside of Docker, which means we don't need to worry about cleaning up all of these processes or this on-disk state. At the end of a build, all of the state—be that in-memory processes or on disk—will be cleaned up by a docker stop, every time.

Managing trees of UNIX processes is notoriously difficult to do reliably, and it would be easy for a system that forks this often to leak zombie processes or stray workers (especially during development of the test framework itself). Using a containerization solution like Docker eliminates that nuisance, and eliminates the need to write a bunch of fiddly cleanup code.

Managing build workers

In order to run each build across multiple machines at once, we need a system to keep track of which servers are currently in-use and which ones are free, and to assign incoming work to available servers.

We run all our tests inside of Jenkins; Rather than writing custom code to manage worker pools, we (ab)use a Jenkins plugin called the matrix build plugin.

The matrix build plugin is designed for projects where you want a "build matrix" that tests a project in multiple environments. For example, you might want to build every release of a library against several versions of Ruby and make sure it works on each of them.

We misuse it slightly by configuring a custom build axis, called BUILD_ROLE, and telling Jenkins to build with BUILD_ROLE=leader, BUILD_ROLE=worker1, BUILD_ROLE=worker2, and so on. This causes Jenkins to run N simultaneous jobs for each build.

Combined with some other Jenkins configuration, we can ensure that each of these builds runs on its own machine. Using this, we can take advantage of Jenkins worker management, scheduling, and resource allocation to accomplish our goal of maintaining a large pool of identical workers and allocating a small number of them for each build.


Once we have a pool of workers running, we decide which tests to run on each node.

One tactic for splitting work—used by several of our previous test runners—is to split tests up statically. You decide ahead of time which workers will run which tests, and then each worker just runs those tests start-to-finish. A simple version of this strategy just hashes each test and take the result modulo the number of workers; Sophisticated versions can record how long each test took, and try to divide tests into group of equal total runtime.

The problem with static allocations is that they’re extremely prone to stragglers. If you guess wrong about how long tests will take, or if one server is briefly slow for whatever reason, it’s very easy for one job to finish far after all the others, which means slower, less efficient, tests.

We opted for an alternate, dynamic approach, which allocates work in real-time using a work queue. We manage all coordination between workers using an nsqd instance. nsq is a super-simple queue that was developed at; we already use it in a few other places, so it was natural to adopt here.

Using the build number provided by Jenkins, we separate distinct test runs. Each run makes use of three queues to coordinate work:

  • The node with BUILD_ROLE=leader writes each test file that needs to be run into the test.<BUILD_NUMBER>.jobs queue.
  • As workers execute tests, they write the results back to the test.<BUILD_NUMBER>.results queue, where they are collected by the leader node.
  • Once the leader has results for each test, it writes "kill" signals to the test.<BUILD_NUMBER>.shutdown queue, one for each worker machine. A thread on each worker pulls off a single event and terminates all work on that node.

Each worker machine forks off a pool of processes after loading code. Each of those processes independently reads from the jobs queue and executes tests. By relying on nsq for coordination even within a single machine, we have no need for a second, machine-local, communication mechanism, which might risk limiting our concurrency across multiple CPUs.

Other than the leader node, all nodes are homogenous; they blindly pull work off the queue and execute it, and otherwise behave identically.

Dynamic allocation has proven to be hugely effective. All of our worker processes across all of our different machines reliably finish within a few seconds of each other, which means we're making excellent use of our available resources.

Because workers only accept jobs as they go, work remains well-balanced even if things go slightly awry: Even if one of the servers starts up slightly slowly, or if there isn't enough capacity to start all four servers right at once, or if the servers happen to be on different-sized hardware, we still tend to see every worker finishing essentially at once.


Reasoning about and understanding performance of a distributed system is always a challenging task. If tests aren't finishing quickly, it's important that we can understand why so we can debug and resolve the issue.

The right visualization can often capture performance characteristics and problems in a very powerful (and visible) way, letting operators spot the problems immediately, without having to pore through reams of log files and timing data.

To this end, we've built a waterfall visualizer for our test runner. The test processes record timing data as they run, and save the result in a central file on the build leader. Some Javascript d3 code can then assemble that data into a waterfall diagram showing when each individual job started and stopped.

Waterfall diagrams of a slow test run and a fast test run.

Each group of blue bars shows tests run by a single process on a single machine. The black lines that drop down near the right show the finish times for each process. In the first visualization, you can see that the first process (and to a lesser extent, the second) took much longer to finish than all the others, meaning a single test was holding up the entire build.

By default, our test runner uses test files as the unit of parallelism, with each process running an entire file at a time. Because of stragglers like the above case, we implemented an option to split individual test files further, distributing the individual test classes in the file instead of the entire file.

If we apply that option to the slow files and re-run, all the "finished" lines collapse into one, indicating that every process on every worker finished at essentially the same time—an optimal usage of resources.

Notice also that the waterfall graphs show processes generally going from slower tests to faster ones. The test runner keeps a persistent cache recording how long each test took on previous runs, and enqueues tests starting with the slowest. This ensures that slow tests start as soon as possible and is important for ensuring an optimal work distribution.

The decision to invest effort in our own testing infrastructure wasn't necessarily obvious: we could have continued to use a third-party solution. However, spending a comparatively small amount of effort allowed the rest of our engineering organization to move significantly faster—and with more confidence. I'm also optimistic this test runner will continue to scale with us and support our growth for several years to come.

If you end up implementing something like this (or have already), send me a note! I'd love to hear what you've done, and what's worked or hasn't for others with similar problems.

August 13, 2015


Greg Brockman on December 16, 2014 in Engineering

When we announced the Open Source retreat, we'd pictured it primarily as giving people the opportunity to work on projects they'd already been meaning to do. However, the environment we provided also became a place for people to come up with new ideas and give them a try. One of these ideas, Libscore, is launching publicly today.

Top libraries used across the web.

Libscore, built by Julian Shapiro with support from both us and Digital Ocean, makes it possible for frontend developers to see where their work is being used. The service periodically crawls the top million websites, determines the JavaScript libraries in use on each, and makes that data publicly queriable.

For example, wondering about MVC framework popularity? Backbone is used on about 8,000 of the top million sites while Ember appears on only 185. You can also query which libraries are used on your favorite site, or view some precompiled aggregates.

We were attracted to Libscore because it sounded like internet infrastructure that should exist. Sometimes—as with our support for Alipay—we get to build such components directly; sometimes, it seems better to support something external—as with Stellar. If you have other ideas, please let us know (or work on them here!).

December 16, 2014

Scaling email transparency

Greg Brockman on December 8, 2014 in Engineering

In February 2013, we blogged about email transparency at Stripe. Since then a number of other companies have implemented their own versions of it (which a few have talked about publicly). We often get asked whether email transparency is still around, and if so, how we've scaled it.

Email transparency continues to be one important tool for state transfer at Stripe. The vast majority of Stripe email (excluding particularly sensitive classes of email or threads where a participant has a strong expectation of privacy) remains publicly available through the company.

Today we're publishing two key components that have allowed us to scale it this far: our list manager tool and updated internal documentation reflecting what we've learned over the past year and a half. Hopefully these will make it easier for others to run email transparency at their own organizations.


In the time since our first post, we've grown our mailing list count almost linearly with headcount: from 40 employees and 119 mailing lists in February 2013 to now 164 people and 428 lists. A plurality are project lists (sys@, sys-archive@, sys-bots@, sys-ask@), but there's also a long tail on topics ranging from country operations (australia@) to ideas for things Stripe should try (crazyideas@).

We use Google Groups for our email list infrastructure. Today we're releasing the web interface we've built on Google's APIs to make managing many list subscriptions (and associated filters) easy. This interface, called Gaps, lets you do things like:

  • Quickly subscribe to or unsubscribe from a list.
  • View your organization's lists (categorized by topic), and which you're subscribed to (including indirect subscriptions through other lists).
  • Get notifications when new lists are created.
  • Generate and upload GMail filters.

Here's a quick sample of what Gaps looks like:

Check it out and let us know what you think!

Updated internal documentation

Scaling email transparency has required active cultural effort and adaptation. As our team grew, we'd notice that formerly good patterns could turn sour. For example, at first email transparency would improve many conversations by letting people drop in with helpful tidbits. But with a larger team, having many people jumping into a conversation would instead grind the thread to a halt.

As we've identified cases where email transparency didn't scale well, we've made changes to our culture. Below is our updated internal documentation on how we approach email transparency. It embodies what we've learned about how to make email transparency work at an organization of our size:

Email transparency (from our internal wiki)

One of Stripe's core strategies is hiring great people and then making sure they have enough information to make good local decisions. Email transparency is one system that has helped make this possible. As with any rule at Stripe, you should consider the recommendations in this document to be strong defaults, which you should just override if they don't make sense in a particular circumstance.

How it works

Email transparency is fairly simple: make your emails transparent by CCing a list, and make it easy for others to be transparent by observing the responsibilities below.

The main mechanisms of email transparency are the specially-designated archive lists, to which you should CC all mail that would normally be off-list, but only because of its apparent irrelevance rather than out of any particular desire for secrecy. The goal isn't to share things that would otherwise be secret: it's to unlock the wealth of information that would otherwise be accidentally locked up in a few people's inboxes.

In general, if you are debating including an archive list, you should include it. This includes internal P2P email which you would normally leave off a list, emails to vendors, and scheduling email. Don't be afraid to send "boring" email to an archive list — people have specifically chosen to subscribe to that list. You should expect most people will autoarchive this list traffic (hence the name!), and then dip into it as they prefer.

If you're new to it, email transparency always feels a bit weird at first, but it doesn't take long to get used to it.

What's the point?

Email transparency is something few other organizations try to do. It's correspondingly on us to make sure we have really good indicators for how it's valuable. Here's a sample of things people have found useful about email transparency:

  • Provides the full history on interactions that are relevant to you. If you're pulled into something, you can always pull up the relevant state. This is especially useful for external communications with users or vendors.
  • Provides a way for serendipitous interactions to happen — someone who has more state on something may notice what's happening and jump in to help (subject to the limitations about jumping in).
  • Lets you keep up with things going on at various other parts of Stripe, at whatever granularity you want. This reduces siloing, makes it easier to function as a remote (and even just know what we're working on), and generally increases the feeling of connectedness.
  • Requires ~no additional effort from the sender.
  • Makes conversations persistent and linkable, which is particularly useful for new hires.
  • Forces us to think about how we're segmenting information — if you're tempted to send something off-list, you should think through why.
  • Makes spin-up easier by immersing yourself in examples of Stripe tone and culture, and enabling you to answer your own questions via the archives.
  • Helps you learn how different parts of the business work.

Reader responsibilities

Email transparency cuts two ways. Being able to see the raw feed of happenings at Stripe as they unfold is awesome, but it also implies an obligation to consume responsibly. Overall, threads on an archive list merit a level of civil inattention — you should feel free to read it, but be careful about adding your own contributions.

  • Talk to people rather than silently judging. If you see something on an email list that rubs you the wrong way or that you think doesn't make sense (e.g. "why are we working on that?", "that email seems overly harsh/un-Stripelike"), you should talk to that person directly (or their manager, if there's a reason you can't talk to them about it). Remember that we hire smart people, and if something seems off you're likely missing context or a view of the larger picture. No one wants their choice to send email on-list to result in a bunch of people making judgements without telling them, or chattering behind their back — if that can happen, then people will be less likely to CC a list in the future.
  • Avoid jumping in. A conversation on an archive list should be considered a private conversation between the participants. When people jump into the thread, it often grinds to a halt and nothing gets done. There will be some very rare occasions (e.g. if you have some factual knowledge the participants probably don't) where it's ok to lurk in to the thread, but in practice these should be very rare. By convention, the people on the thread may ignore your email; don't take it personally — it's just a way of making sure that email transparency doesn't accidentally make email communication harder. Knowing when to jump in is an art, and when in doubt, don't.
  • Don't penalize people for choosing to CC a list. Ideally, people are writing their emails exactly as they would if they were off-list. So be cognizant about creating additional overhead for people because they chose to CC the list. There may be typos or things that you're wondering about or don't make sense. If you're *concerned* about something being actively bad, then you should talk to the person, but if it's something small (e.g. "there's a typo", "this tone isn't Stripelike", "this conversation seems like a waste of time"), you should trust that there's either a reason, or the person's manager will be on the lookout to help them (especially if they're new).
  • Help others live by the above responsibilities. The only way we can preserve email transparency is by collectively nudging each other onto the right course. Whether it's poking someone to CC a list, or telling someone to stop venting about an email but just go talk to the author, the person responsible for fixing the shortfalls you see is the same as the one responsible for your availability.

Common scenarios/FAQs

  • I don't mind people being able to read this boring scheduling email, but I don't think it's worth anyone's time to read. You should still send it to an archive list! Archive lists are intended to be the feed of everything going on within a particular team — let the people who are subscribing decide if it's worth their time or not.
  • I have a small joke on this thread. Should I CC it to the list, or just send it to one person (or a small set of people)? Small jokes are good! The main cost is potentially derailing the relevant thread. So generally, if it's a productive, focused thread, just send your joke off-list, but if it's already fairly broad, then you should feel free to send the joke publicly.
  • I feel like I need to write my email for the broad audience that might be reading it, rather than the one person it's actually meant for. The only change between how you write emails for email transparency and how you would write them privately to other Stripes should be that one has a CC. That is, if you feel a need to rewrite your emails for the audience, then that likely indicates a bug in the organization we should fix. If you notice yourself having this tendency, talk to gdb — we should be able to shift the norms of the organization so this isn't a problem.
  • How do we make sure this respects outside people's expectations? In many ways, email transparency is just a more extreme version of what happens at other organizations — since it's opt-in, all of the emails are human-vetted to be shareable. Email transparency is mostly about changing the default thresholds. As a corollary, if someone requests that their email not be shared, then certainly respect their request.

Common exceptions

Like any tool, email transparency has its limitations. Since it's in many ways a one-way communication system, email transparency is bad for sensitive situations where people may react strongly. It's also important to preserve people's privacy. The following is a description of the classes of things which you may not see on an archive list.

  • Anything personnel related (e.g. performance).
  • Some recruiting conversations, especially during closing or when people are confidentially looking around. People's decision-making process at that stage is usually quite personal, and even if people have a hard time picking Stripe, we want to make sure that they start with a blank slate.
  • Communications of mixed personal and professional nature (e.g. recruiting a friend).
  • Early stage discussions about topics that will affect Stripes personally (e.g. changing our approach to compensation).
  • Some particularly sensitive partnerships.

As we said in the original email transparency post, it's hard to know how far it will scale. That doesn't bother us much: we continue to do unscalable things until they break down. The general sentiment at Stripe is that email transparency adds a lot of value, and it seems we'll keep being able to find tweaks to keep it going.

Hopefully these components will help you with email transparency in your own organization. If you end up implementing something similar, I'd love to hear about it!

December 8, 2014