Blog Engineering

Follow Stripe on Twitter

To design and develop an interactive globe

Nick Jones on September 1, 2020 in Engineering

As humans, we’re driven to build models of our world.

A traditional globemaker molds a sphere, mounts it on an axle, balances it with hidden weights, and precisely applies gores—triangular strips of printed earth—to avoid overlap and align latitudes. Cartographers face unenviable trade-offs when making maps. They can either retain the shape of countries, but warp their size—or maintain the size of countries, but contort their shape. In preserving one aspect of our world, they distort another.

These are terrestrial globe gores reissued by Giuseppe di Rossi in 1615.

As visual designers and software engineers, we’re modeling a piece of the world every time we build software. In some cases, it’s the entire world—and that digital world is animated and interactive. There are tools that render 3D objects on the web, but they’re considered sorcery by many. And conjuring that magic doesn’t come without sweat. In WebGL, displaying a single triangle—like a globemaker’s gore—with no lights, textures, interactivity, or motion requires 50+ lines of code.

For the new, we built a 1:40 million-scale, interactive 3D model of the earth. We wanted to convey the interconnected nature of the internet economy and the global scale of our service, while acknowledging how much ground is yet to be covered. Despite expansion to 40 countries and payment processing from 195 countries, we grapple with the complexity of cross-border operations and expansion every day.

We set out to build a globe that inspires a sense of awe, invites people to explore, and conceals details for discovery. Along the way, we evaluated existing tools, designed our own solution, solved four interesting technical challenges, and improved the way we collaborate. Here’s what we learned.

Ways to build the world

It wasn’t a given that we’d build an interactive 3D globe on our landing page. We designed our first version of the globe to communicate nuanced data about the amount of online, cross-border commerce happening between each country. For this reason, it includes extra visual details like country borders. For our landing page, the goal of the globe was to capture our global scale and bring a visual metaphor to life. A week before launch, we had a nice animated map where the globe now sits but we didn’t love it. Despite the impending release, an executive (it was Patrick) posed to us: what would you build if you had the time to do it the way you wish you could?

We decided on a globe—and felt it was a better option for three reasons. First, using a sphere to display the earth takes up less than 20% of the screen area required to display the world in two dimensions. Second, a globe more accurately portrays the relative size, shape, and orientation of countries and bodies of water, even though visibility of the entire world at a glance is easier with a map. (More than ¾ of the globe is either hidden on the reverse hemisphere or obscured by its curvature.) Lastly, as an interactive experience, spinning a globe is much more satisfying than scanning a map.

The sphere occupies approximately 17% of the total area of the map.

Once we settled on a globe, we had to work out how to bring it to life.

If we had known precisely the globe we wanted to build, we’d have been foolish not to hire GlobeKit. Instead, not knowing what we did not know, we decided to figure it out ourselves. The primary tools used to render 3D objects on the web, WebGL and GLSL shaders, can be daunting. Developers writing shaders can get by without deep knowledge of trigonometry and linear algebra, but a good understanding of these disciplines make 3D graphics development substantially easier.

None of us on the team considered ourselves 3D artists, so we leaned on each other, the internet, and friends to help solve technical problems. To start, the project’s design lead created the best approximation of her vision of the globe in Photoshop. We naturally kept the globe’s design fluid, making it easy to adopt better ideas as they emerged without feeling precious about what was discarded.

When it became clear that writing our own 3D engine was out of scope, we decided to use Three.js. Three is an approachable layer for WebGL, which abstracts away much of its complexity behind a well-documented API. Originally ported from ActionScript (Flash) in 2010, Three helped us create rich 3D graphics that render in real-time in the browser without needing to define how light reflects on every pixel of every shape.

To render a Three.js scene, you need a renderer, a DOM element to render in, a scene, a camera, and a mesh with both material (fragment shader) and geometry (vertex shader).

Ten years after its release, Three.js matches—or surpasses—much of what was possible in Flash a decade ago. The most engaging interactive experiences on the web are now built with Three. The community’s enthusiasm bears resemblance to the early days of ActionScript with the added benefit of running on mobile browsers and requiring no plugins. As Three.js and WebGL gain popularity, approachability, and support, the web is poised to embrace 3D en masse. Since WebGL is GPU-accelerated, it’s capable of processing a surprising amount of continuous visual change without bottlenecking on the CPU even on lower-end consumer hardware. Finding the boundary between making our globe feel alive and crashing the browser would emerge as our greatest technical hurdle. But it wouldn’t be the only one.

Global issues

Like the challenges we faced, our globe goes a few levels deep. It’s composed of three distinct layers, despite appearing as a single surface. The base layer represents the oceans, and is a semi-transparent sphere with ~50 segments both horizontally and vertically. The second layer is another sphere textured with tens of thousands of twinkling dots. The outermost layer is made of animated arcs of color which travel from one pulsing disc to another, wrapping themselves around the two spheres. The arcs travel from any country where Stripe accepts payments to countries where businesses accept payments using Stripe.

We encountered several significant technical challenges along the way, each of which could have prevented us from realizing our vision. For the benefit of those generating their own interactive globes—or similar complex 3D objects—let’s break a few of these challenges down.

We stacked each layer of the globe to produce a single visual surface.

Challenge 1: Fill the surface with dots

The primary purpose of the outermost sphere—a layer of tens of thousands of dots—is to define continents. But as we removed the visual complexity of borders and animated each dot, they did more than communicate land masses; they made the globe feel alive. To make them work, we had two main requirements. The first step was to find a way to maintain consistent spacing between every row and column of dots, from pole to pole. Second, we needed to animate each dot individually.

Before landing on our final design, we tested and considered three different approaches to filling open space with a cluster of dots (as shown in the image below). Each attempt had its benefits and drawbacks.

This is the North Pole of the globe, mapped with dots using three different approaches.

  1. Image of evenly spaced dots. This approach is the simplest to create, but quickly becomes problematic. The dots fuse together as the circumference of each row shrinks as it nears the poles of the globe. As a static bitmap rather than geometry, we couldn’t animate each dot individually without an overly complex shader.

  2. Image of unevenly spaced dots. We increased the width and horizontal spacing in the rows of dots. The image of nearly 80,000 dots places fewer, wider dots at the top and bottom. This tweak helps prevent the pinching and clumping of dots at the globe’s poles which invalidated the previous approach. When mapped as a texture onto a sphere, this image creates a nearly uniform spacing of dots. At first, we created this image texture by hand, then generated an SVG with JavaScript. This option better met our visual goals but still didn’t let us animate each dot. We assumed Three.js would work well with SVGs, but since every shape must be converted to a triangle, the complexity of the conversion deterred us from this approach.

  3. Programmatically-generated layer (vs. an image) The most straightforward way to animate individual dots is to generate them in a three-dimensional space. To do this, we reused the code from our SVG to generate rows of dots as geometry in Three.js. Each row includes a different number of dots, from zero at the poles to five hundred at the equator. We used a sine function to choose the number of dots for each row, plotted each dot, and applied the lookAt method to rotate each dot to face the center of the sphere. However, the number of dots jumped inconsistently along a few latitudes, creating harsh lines and an unnatural effect in the longitudinal columns.

The final attempt—and right design—utilized a sunflower pattern. Like a sunflower’s pattern of seeds, the dots are a sequence of hexagons tightly coiled around latitudes from the top to the bottom of a sphere. Using the built in setFromSphericalCoords method we settled on this solution:

The spiral sunflower pattern was the winning design.

// Create 60000 tiny dots and spiral them around the sphere.
const DOT_COUNT = 60000;

// A hexagon with a radius of 2 pixels looks like a circle
const dotGeometry = new THREE.CircleGeometry(2, 5);

// The XYZ coordinate of each dot
const positions = [];

// A random identifier for each dot
const rndId = [];

// The country border each dot falls within
const countryIds = [];

const vector = new THREE.Vector3();

for (let i = DOT_COUNT; i >= 0; i--) {
  const phi = Math.acos(-1 + (2 * i) / DOT_COUNT);
  const theta = Math.sqrt(DOT_COUNT * Math.PI) * phi;

  // Pass the angle between this dot an the Y-axis (phi)
  // Pass this dot’s angle around the y axis (theta)
  // Scale each position by 600 (the radius of the globe)
  vector.setFromSphericalCoords(600, phi, theta);

  // Move the dot to the newly calculated position
  dotGeometry.translate(vector.x, vector.y, vector.z);

Challenge 2: Group the dots by country

On the globe at, dots form continents and light up where Stripe is live. In a previous iteration that we created for, we grouped dots by country to indicate where Stripe operates. We decided to turn this feature off for the interactive globe on our landing page, but thought it might be worthwhile to share how we approached grouping dots by country.

Once we filled the globe with dots, the next step was to transform our layers of spheres into a globe by defining countries. Our first goal was to make dots appear only within the borders of countries where Stripe is live. Once that was done, we needed dots within those live countries to be targets for animation as a group.

Each country where Stripe is live is given a unique color for identification.

A teammate who had recently experimented with shaders for a gaming project brought inspiration to this challenge. He thought to encode a PNG image with a unique color for each country where Stripe is live (see above). We used the built-in canvas getImageData to give us the color of each pixel in the image. Then, we matched each color to an array of country colors, tagging every dot with a unique countryId before passing its coordinates to the shader for rendering. Now we could isolate the group of dots in any country and animate its color, opacity, and position in z-space.

In 2020, each country where Stripe is live is indicated by a unique color. The assumed drawback to generating all of the dots as individual geometry was the astronomical number of calculations required to animate the properties of 60,000 dots 60 times per second. Lucky for us, the earth’s surface is mostly water. By only rendering geometry for countries where Stripe is live, we were able to reduce the geometry from 60,000 dots to ~20,000 dots passing a fraction of the data to the vertex shader. By pushing less data to the shader, we freed up rendering budget for use by other animations.

// We assign a color to each ISO country code
  [0, '#99cc99', 'at'],
  [1, '#993333', 'au'],
  [2, '#cccc00', 'be'],
// Load the color-coded image then get each pixel’s color
new ImageLoader().load('map.png', (mapImage) => {
  const imageData = getImageData(mapImage);
const uv = pointToUV(, this.position);
const sample = sampleImage(uv, imageData);
// If there is no color data, return and move to the next dot
if (!sample[3]) return;
// Create a dot if there is color data
for (let i = 0; i < dotGeometry.faces.length; i++) {
  const face = dotGeometry.faces[i];
  // Create the vertices which make up the face of each dot
    … // face.b, face.c
  const [countryId] = getCountryId(sample);
  countryIds.push(countryId, countryId, countryId);
// Convert RGB to Hex and look up the countryId by color
function getCountryId([r, g, b, _]) {
  const hex = [r, g, b]
    .map((color) => color.toString(16).padStart(2, '0'))
  const countryId = COUNTRY_MAPPING.find(([_, id]) => id === hex);
  return countryId;

Challenge 3: Animate it all

After filling in the surface with dots and grouping them into countries, we needed to connect the dots to show how and where business is done globally. Our goal was to bring the globe to life, which meant adding animation. Early on, we knew we’d need to get the globe spinning, each dot twinkling, and bend arcs between countries to indicate transaction patterns. We wanted visitors to be able to control and spin the globe.

Around the time we started animating, we got a new teammate. In a past life, he had engineered the scrolling of the iconic Pencil by 53 site. In short order, he added animations for undulating, aurora borealis-like lights, made the globe rotate on page load, and spun the earth when the user scrolled the page. We handled the subtly twinkling dots and arcs with a custom fragment shader (and a lot of help from but the rest of the animation is vanilla JavaScript. requestAnimationFrame drives the motion of the arcs, the spinning of the globe, and the changing of colors.

// Draw an arc between two coordinates
constructor(start, end, radius) {

  // Convert latitude/longitude to XYZ on the globe
  const start = toXYZ(start[0], start[1], radius);
  const endXYZ = toXYZ(end[0], end[1], radius);

  // D3 interpolates along the great arc that passes
  // through both the start and end point
  const d3Interpolate = geoInterpolate(
    [start[1], start[0]],
    [end[1], end[0]],
  const control1 = d3Interpolate(0.25);
  const control2 = d3Interpolate(0.75);

  // Set the arc height to half the distance between points
  const arcHeight = start.distanceTo(end) * 0.5 + radius;
  const controlXYZ1 = toXYZ(control1[1], control1[0], arcHeight);
  const controlXYZ2 = toXYZ(control2[1], control2[0], arcHeight);

  // CubicBezier allows for curves which travel half way
  // around the globe without penetrating the sphere
  const curve = new CubicBezierCurve3(start, controlXYZ1, controlXYZ2, end);

  // Arcs are curved tubes with 0.5px radius and 8 sides
  // Each curve is broken into 44 segments
  this.geometry = new THREE.TubeBufferGeometry(curve, 44, 0.5, 8);
  this.material = new THREE.ShaderMaterial({
    // A custom fragment shader animates arc colors
  this.mesh = new THREE.Mesh(this.geometry, this.material);

  // Set the draw range to show only the first vertex
  this.geometry.setDrawRange(0, 1);

drawAnimatedLine = () => {
  let drawRangeCount = this.geometry.drawRange.count;
  const timeElapsed = - this.startTime;

  // Animate the curve for 2.5 seconds
  const progress = timeElapsed / 2500;

  // Arcs are made up of roughly 3000 vertices
  drawRangeCount = progress * 3000;

  if (progress < 0.999) {
    // Update the draw range to reveal the curve
    this.geometry.setDrawRange(0, drawRangeCount);

Challenge 4: Make it performant

Early on, we discussed our expectations for the globe’s performance on different browsers to frame our requirements. We boiled those expectations down to one requirement: all animation and scrolling effects had to perform at 60fps (to match the common device refresh rate of 60hz). If this condition couldn’t be met, we were prepared to fall back to a static image. Thanks to GPU-acceleration of WebGL and some of the findings mentioned here, we never had to abandon our interactive globe.

Initially, we ruled out mobile support. We assumed that scrolling and 3D animation would be too much for any machine, and that we’d have to either accept some lag or reduced motion on both smaller and more underpowered machines, or settle for the fallback. But as we learned about the capabilities of GPUs, we kept raising our expectations. Most of what’s possible in WebGL works on mobile without modification. We did make minor adjustments: during scroll, we pause all animations and debounce using Lodash so the globe spins without visual hiccups.

A few days prior to launch, we tested the page on laptops without dedicated GPUs and reported that they were struggling to power a 5K display with the globe running fullscreen. We weren’t willing to accept defeat by falling back to an image for this rare case. Instead, we cycled through all possible bottlenecks one by one. No matter how much we simplified the geometry, stopped animations, or killed lights and shaders, we couldn’t smooth it out.

On a whim, we turned off the antialias parameter of the WebGL renderer. This one change not only fixed the issue on high-res monitors, but also improved performance of the animation and smoothness of scrolling on all devices, even those already running at 60fps. One might assume that removing antialiasing would make everything look pixelated. Since it only applies to the edges of the geometry, our image textures were still smooth and gradients and lighting were unaffected. Though pixelation occurs minimally on the arcs around the globe, the performance gain was significant enough to accept a tradeoff.

// Turn off antialiasing for WebGL to improve performance
this.renderer = new WebGLRenderer({ antialias: false, alpha: true });
// Rotating the globe on scroll
import throttle from 'lodash/throttle';
const SCROLL_EPSILON = 0.0016;
const GLOBE_TRIGGER_TOP = window.innerHeight;
document.addEventListener('scroll', this.universalScrollHandler);
// Event handler: rotate the globe based on the current browser scroll position

universalScrollHandler = throttle(this.scrollHandler.bind(this), 16);
scrollHandler() {
  // Turns off all other animation
  this.isScrolling = true;
  this.oldScrollTop = this.scrollTop;
  this.scrollTop =
    document.scrollingElement.scrollTop || document.body.scrollTop;
  this.scrollDelta = this.oldScrollTop - this.scrollTop;

  const rotationDelta = this.scrollDelta * SCROLL_EPSILON;
  this.globeContainer.rotation.y += rotationDelta;

  // Once the browser scrolls past the globe on the page
  // stop all animations and move the globe off-screen
  if (GLOBE_TRIGGER_TOP < this.scrollTop) {
    this.globeOff = true; = 'translateX(100vw)';
  } else {
    this.globeOff = false; = 'translateX(0)';

The globe fully assembled as it appears on

Designing a better world

Tectonic plates arrange continents, but countries—how we organize the globe—are defined by people. It’s the same with organizations: how we define teams determines how we operate. We’ve found that establishing how designers and engineers relate, collaborate, and organize has an outsized influence in how we build. There’s a long line of designers and developers with a mutual respect for both pixels and code. This rapport sidesteps many pitfalls when building products: the transfer of pressure from designer to developer to deliver stunning visuals, to engineers diluting the vision at the eleventh hour. Blending design and engineering complicates the process, but enriches the result.

We could only properly evaluate our globe once we built a functional prototype with a sphere on the screen to examine. Modern software development is often built modularly, snapping components together until it’s ready to ship. We pledged to build the real, whole product, even in its earliest—and ugliest—stages. This enabled us to separate its functionality from its finality, focusing less on if it worked and more on when it worked well enough for us to ship it. This released us from the temptation to make sacrifices in quality just to make the globe fully operative.

Building a fully-functional prototype early in our development process focussed our highly cross-functional team; over time and through iteration, improvements unfolded gradually. Since its first incarnation in 2019, we’ve used the globe for mockups, keynotes, websites, and a small, but momentous appearance in Stripe’s Dashboard.

Measures of time are actually measures of the earth’s rotation: sixty units of rotation per minute, and sixty minutes of rotation per hour. As our product expands to cover the surface of the globe, we’ll keep smoothing the rough edges, connecting dots in distant countries, and working to keep the world spinning at 60 frames per second.

Interested in doing this sort of work? Join us. View openings

September 1, 2020

Similarity clustering to catch fraud rings

Andrew Tausz on February 20, 2020 in Engineering

Stripe enables businesses in many countries worldwide to onboard easily so they can accept payments as quickly as possible. Stripe’s scale makes our platform a common target for payments fraud and cybercrime, so we’ve built a deep understanding of the patterns bad actors use. We take these threats seriously because they harm both our users and our ecosystem; every fraudulent transaction we circumvent keeps anyone impacted from having a bad day.

We provide our risk analysts with automated tools to make informed decisions while sifting legitimate users from potentially fraudulent accounts. One of the most useful tools we’ve developed uses machine learning to identify similar clusters of accounts created by fraudsters trying to scale their operations. Many of these attempts are easy to detect and we can reverse engineer the fingerprints they leave behind to shut them down in real-time. In turn, this allows our analysts to spend more time on sophisticated cases that have the potential to do more harm to our users.

Fraud in the payments ecosystem

Fraud can generally be separated into two large categories: transaction fraud and merchant fraud. Transaction fraud applies to individual charges (such as those protected by Radar), where a fraudster may purchase items with a stolen credit card to resell later.

Merchant fraud occurs when someone signs up for a Stripe account to later defraud cardholders. For example, a fraudster may attempt to use stolen card numbers through their account, so they’ll try to provide a valid website, account activity, and charge activity to appear legitimate. The fraudster hopes to be paid out to their bank account before Stripe finds out. Eventually, the actual cardholders will request a chargeback from their bank for the unauthorized transaction. Stripe will reimburse chargebacks to issuing banks (and by proxy, the cardholder) and attempt to debit the fraudster’s account. However, if they have already been paid out then it may be too late to recover those funds and Stripe ultimately covers those costs as fraud losses.

Fraudsters also may attempt to defraud Stripe at a larger scale by setting up a predatory or scam business. For example, the fraudster will create a Stripe account, claiming to sell expensive apparel or electronics for low prices. Unsuspecting customers think they are getting a great deal, but they never receive the product they ordered. Once again, the fraudster hopes to be paid out before they are shut down or overwhelmed with chargebacks.

Using similarity information to reduce fraud

Fraudsters tend to create Stripe accounts with reused information and attributes. Typically, low-effort fraudsters will not try to hide links to previous accounts, and this activity can be detected immediately at signup. More sophisticated fraudsters will put more work into hiding their tracks in order to prevent any association with prior fraud attempts. Some attributes like name or date of birth are trivial to fabricate, whereas others are more difficult—for example, it requires significant effort to obtain a new bank account.

Linking accounts together via shared attributes is reasonably effective at catching obvious fraud attempts, but we wanted to move from a system based on heuristics to one powered by machine learning models. While heuristics may be effective in certain cases, machine learning models are significantly more effective at learning predictive rules.

Suppose a pair of accounts are assigned a similarity score based on the number of attributes they share. This similarity score could then help predict future behavior: if an account looks similar to a known fraudulent account, there’s a significant likelihood they are more likely to also be fraudulent. The challenge here is to accurately quantify similarity. For example, two accounts who share dates of birth should have a lower similarity score than two accounts who share a bank account.

By training a machine learning model, we remove the need for guesswork and hand-constructed heuristics. Now, we can automatically retrain the model over time as we obtain more data. Automatic retraining enables our models to continually improve in accuracy, adapt to new fraud trends, and learn the signatures of particular adversarial groups.

Choosing a clustering approach

Machine learning tasks are generally classified as either supervised or unsupervised. The goal of supervised learning is to make predictions given an existing dataset of labeled examples (for example, a label that indicates whether an account is fraudulent), whereas in unsupervised learning the usual goal is learn a generative model for the raw data (in other words, to understand the underlying structure of the data). Traditionally, clustering tasks fall into the class of unsupervised learning: unlabeled data needs to be grouped into clusters that capture some understanding of similarity or likeness.

Fortunately, we’re able to use supervised models, which are generally easier to train and may be more accurate. We already have a large body of data demonstrating whether a given account has been created by a fraudster based on the downstream impact (e.g. we observe a significant number of chargebacks and fraud losses). This allows us to confidently label millions of legitimate and illegitimate businesses from our dataset.

In particular, our approach is an example of similarity learning where the objective is to learn a symmetric function based on training data. Over the years, our risk underwriting teams have manually compiled many examples of existing clusters of fraudulent accounts through our investigations of fraud rings, and we can use these reference clusters as training data to learn our similarity function. By sampling edges from these groups, we obtain a dataset consisting of pairs of accounts along with a label for each pair indicating whether or not the two accounts belong to the same cluster. We use intra-cluster edges as positive training examples and inter-cluster edges as negative training examples, where an edge denotes a pair of accounts.

Clusters of accounts used to train predictive models

We use known clusters of accounts to train our predictive models.

Now that we have the labels specified, we must decide what features to use for our model. We want to convert pairs of Stripe accounts into useful model inputs that have predictive power. The feature generation process takes two Stripe accounts and produces a number of features that are defined on the pair. Due to the rich nature of Stripe accounts and their associated data, we can construct an extensive set of features for any given pair. Some examples of the features we’d include are categorical features that store the values of common attributes such as the account’s email domain, any overlap in card numbers used on both accounts, and measures of text similarity.

Using gradient-boosted decision trees

Because of the wide variety of features we can construct from given pairs of accounts, we decided to use gradient-boosted decision trees (GBDTs) to represent our similarity model. In practice, we’ve found GBDTs strike the right balance between being easy to train, having strong predictive power and being robust despite variations in the data. When we started this project we wanted to get something out the door quickly that was effective, had well-understood properties, and was straightforward to fine-tune. The variant that we use, XGBoost, is one of the best performing off-the-shelf models for cases with structured (also known as tabular) data, and we have well-developed infrastructure to train and serve them. You can read more about the infrastructure we use to train machine learning models at Stripe in a previous post.

Now that we have a trained model, we can use it to predict fraudulent activity. Since this model operates on pairs of Stripe accounts, it’s not feasible to feed it all possible pairs of accounts and compute scores across all pairs. Instead, we first generate a candidate set of edges to be scored. We do this by taking recently created Stripe accounts and creating edges between accounts that share certain attributes. Although this isn’t an exhaustive approach, this heuristic works well in practice to prune the set of candidate edges to a reasonable number.

Once the candidate edges are scored, we then filter edges by selecting those with a similarity score above some threshold. We then compute the connected components on the resulting graph. The final output is a set of high-fidelity account clusters which we can analyze, process, or manually inspect together as a unit. In particular, a fraud analyst may want to examine clusters which contain known fraudulent accounts and investigate the remaining accounts in that cluster.

This is an iterative process; as each individual cluster grows, we can quickly identify increasing similarity as fake accounts in a fraudster’s operation are created. And the more fraud rings we detect and shutdown at Stripe, the more accurate our clustering model becomes at identifying new clusters in the future.

Connected edges weighted with similarity scores

Each edge is weighted by a similarity score; we identify clusters by finding connected components in the resulting graph.

Benefits of the clustering system

So far, we’ve discussed the overall structure of the account clustering system. Although we have other models and systems in place to catch fraudulent accounts, using clustering information has the following advantages:

  • We’re even better at catching obvious fraud. It’s difficult for fraudsters to completely separate new accounts from previous accounts they’ve created in the past, or from accounts created by other fraudsters. Whether this is due to reusing basic attribute data or more complex measures of similarity, the account clustering system catches and blocks hundreds of fraudulent accounts weekly with very few false positives.
  • Fraudsters can only use their resources once. Whenever someone decides to defraud Stripe, they need to invest in resources such as stolen IDs and bank accounts, each of which incur monetary cost or inconvenience. In effect, by requiring fraudsters to use a new set of resources every time they create a Stripe account, we slow them down and increase the cost of defrauding Stripe. Clustering is a key tool since it invalidates resources such as bank accounts that have been previously used on fraudulent accounts.
  • Our risk analysts conduct more efficient reviews. When accounts require manual inspection by an analyst, they spend time trying to understand the intentions and motivations of the person behind the account. Analysts focus on the details of the business to sift legitimate users from a set of identified potentially fraudulent accounts. With the help of our clustering technique, analysts can easily identify common patterns and outliers and apply the same judgments to multiple accounts at once with a smaller likelihood of error.
  • Account clusters are a building block for other systems. Understanding whether two accounts are duplicates or measuring their degree of similarity is a useful primitive that extends beyond the use cases described here. For example, we use the similarity model to expand our training sets for models which have sparse training data.

Catching fraud in action

Stripe faces a multitude of threats from fraudsters who attempt to steal money in creative and complex ways. Identifying similarities between accounts and clustering them together enhances our effectiveness and improves our ability to block fraudulent accounts. Clustering accounts together and identifying duplicate attempts to create fraudulent accounts makes life more difficult for fraudsters. One goal of our models is to change the economic model of fraud by raising the required cost for unused bank accounts, IP addresses, devices and other tools they use. This leads to a negative expected value for fraudsters, weakens the underlying supply chain for stolen credentials and user data, and disincentivizes committing fraud at scale.

We often think about fraud as an adversarial game; uncovering fraudulent clusters allows us to tip the game in our favor. Using common tools like XGBoost enabled us to quickly deploy a solution that naturally fit into our machine learning platform and allows us to easily adapt our approach over time. We’re continuing to explore new techniques to catch fraud to ensure Stripe can reliably operate a low-friction global payment network for millions of businesses.

Like this post? Join the Stripe engineering team. View openings

February 20, 2020

Designing accessible color systems

Daryl Koopersmith and Wilson Miner on October 15, 2019 in Engineering

Color contrast is an important aspect of accessibility. Good contrast makes it easier for people with visual impairments to use products, and helps in imperfect conditions like low-light environments or older screens. With this in mind, we recently updated the colors in our user interfaces to be more accessible. Text and icon colors now reliably have legible contrast throughout the Stripe Dashboard and all other products built with our internal interface library.

Achieving the right contrast with color is challenging, especially because color is incredibly subjective and has a big effect on the aesthetics of a product. We wanted to create a color system with hand-picked, vibrant colors that also met standards for accessibility and contrast.

When we evaluated external tools to improve color contrast and legibility in our products, we noticed two common approaches to tackling the problem:

  1. Hand-pick colors and check their contrast against a standard. Our experience told us that this approach made choosing colors too dependent on trial and error.
  2. Generate lighter and darker tints from a set of base colors. Unfortunately, simply darkening or lightening can result in dull or muted colors, which can be difficult to distinguish from each other and often just don’t look good.

With the existing tools we found, it was hard to create a color system that allowed us to pick great colors while ensuring accessibility. We decided to create a new tool that uses perceptual color models to give real-time feedback about accessibility. This enabled us to quickly create a color scheme that met our needs, and gave us something we could iterate on in the future.


The colors we use in our product interfaces are based on our brand color palette. Using these colors in our products allows us to bring some of the character of Stripe’s brand into our interfaces.

Grid of colors organized by hue, grouped in vertical columns, and stacked by relative lightness within each hue stack from top to bottom.

Unfortunately, it was difficult to meet (and maintain) contrast guidelines with these colors. The web accessibility guidelines suggest a minimum contrast ratio of 4.5 for small text, and 3.0 for large text. When we audited color usage in our products, we discovered that none of the default text colors we were using for small text (except for black) met the contrast threshold.

List of nine colors annotated with contrast values against white, which range from 2.3 for yellow to 4.3 for gray. Only six of the nine colors pass the recommended contrast value for icons (3.0), and none of them pass the recommended threshold for text (4.5).

Choosing accessible color combinations required each individual designer or engineer to understand the guidelines and select color pairs with enough contrast in each situation. With certain combinations of colors, options were limited and the accessible color combinations just didn’t look good.

When we first looked at ways to improve text contrast in our products, we initially explored shifting the default colors for text one step darker on our scale, illustrated by the left column below.

The same list of nine colors repeated in two columns, each darker than the colors in the previous image. The contrast ratios for the first column range from 3.2 to 6.4, and the ratios for the second column range from 4.8 to 8.9.

Unfortunately, some of our colors still didn’t have sufficient contrast at the next darkest shade. Once we got to a shade with sufficient contrast on our existing scales (the right column), we lost a lot of the brightness and vibrancy of our colors. The colors pass guidelines on a white background, but they’re dark and muddy and it’s difficult to tell the hues apart.

Without digging deeper it would be easy to just accept the tradeoff that you need to choose between having accessible colors or colors that look good. In order to get both, we needed to rework our color system from the ground up.

We wanted to design a new color system that would provide three key benefits out of the box:

  1. Predictable accessibility: Colors have enough contrast to pass accessibility guidelines.
  2. Clear, vibrant hues: Users can easily distinguish colors from one another.
  3. Consistent visual weight: At each level, no single color appears to take priority over another.

A brief interlude on color spaces

To explain how we got there, we need to get a little nerdy about color.

We’re used to working with color on screens in terms of the RGB color space. Colors are specified in terms of how much red, green, and blue light is mixed on screen to make the color.

Two rows of nine squares. The first row shows 3 stripes of red, green, and blue at different levels of brightness. The second row shows the color resulting from the resulting RGB value of mixing the colors above.

Unfortunately, while describing colors this way comes naturally to computers, it doesn’t come naturally to humans. Given an RGB color value, what needs to change to make it lighter? More colorful? Add more yellow?

It’s more intuitive for us to think of colors as organized by three attributes:

  • Hue: What color is it?
  • Chroma: How colorful is it?
  • Lightness: How bright is it?

Three horizontal bars showing the range of hue, saturation, and lightness values. The first bar shows a range of different hues in a rainbow gradient. The second bar shows a gradient from a muted gray gradually transitioning to a bright blue. The third bar shows a gradient gradually transitioning from black to bright blue to white.

A popular color space that supports specifying colors in this way is HSL. It’s well supported in design tools and popular code libraries for color manipulation. There’s just one problem: the way HSL calculates lightness is flawed. What most color spaces don’t take into account is that different hues are inherently perceived as different levels of lightness by the human eye—at the same level of mathematical lightness, yellow appears lighter than blue.

The image below is a set of colors with the same lightness and saturation in a display color space. While the color space claims the saturation and lightness are all the same, our eyes disagree. Notice that some of these colors appear lighter or more saturated than others. For example, the blues appear especially dark and the yellows and greens appear especially light.

A bar labeled 'display color space' shows 36 stripes of color arranged horizontally by hue. Each individual stripe is clearly distinguishable from the next because they appear lighter or darker at different hues.

There are color spaces which attempt to model human perception of color. Perceptually uniform color spaces model colors based on factors that relate more to human vision, and perform sophisticated color transformations to ensure that these dimensions reflect how human vision works.

A bar labeled 'perceptually uniform color space' showing 36 stripes of color arranged horizontally by hue. Each individual stripe appears to blend seamlessly into the next because all hues appear to have the same lightness.

When we take a sample of colors with the same lightness and saturation in a perceptually uniform color space, we can observe a significant difference. These colors appear to blend together, and each color appears to be just as light and as saturated as the rest. This is perceptual uniformity at work.

There are surprisingly few tools that support perceptually uniform color models, and none that came close to helping us design a color palette. So we built our own.

Visualizing color

We built a web interface to allow us to visualize and manipulate our color system using perceptually uniform color models. The tool gave us an immediate feedback loop while we were iterating on our colors—we could see the effect of every change.

The color space illustrated above is known as CIELAB or, affectionately, Lab. The L in Lab stands for lightness, but unlike the lightness in HSL, it’s designed to be perceptually uniform. By translating our color scales into the Lab color space, we can adjust our colors based on their perceptual contrast and visually compare the results.

The diagram below shows the lightness and contrast values of our previous color palette visualized in the color tool. You can see that the perceptual lightness of each of our colors follows a different curve, with the yellow and green colors much lighter than the blues and purples at the same point.

Line chart with nine overlapping showing lightness values for different colors. Each curve has a different shape and they are all distinguishable from each other, with yellow charting a higher lightness value than any color.

By manipulating our colors in perceptually uniform color space, we were able to produce a set of colors which have uniform contrast across all the hues, and preserve as much of the intended hue and saturation of our current colors. In the proposed colors, yellow has the same contrast range as blue, but they still look like our colors.

In the diagram below, you can see the perceptual lightness for each color follows the same curve, meaning each color (the labels on the left) has the same contrast value at a given level (the number labels on the top).

Line chart showing a single blue curve following a downward slope of lightness values from left to right.

Line chart showing a single yellow curve following an identical slope of lightness values as the blue line above.

Our new tool also showed us what was possible. Visualizing a perceptually uniform color model allowed us to see the constraints of visual perception. The shaded areas in the charts represent so-called imaginary colors which aren’t actually reproducible or perceivable. It turns out “really dark yellow” isn’t actually a thing.

Most tools for mixing colors allow you to set values across the full range for each parameter, and just clip the colors or return the nearest fit colors that don’t actually represent the parameters you set. Visualizing the available color space in real time as we made changes allowed us to iterate much faster because we could tell what changes were possible and what changes moved us closer to our goal: “bright”, differentiated colors that met the appropriate contrast guidelines.

At some points, finding a set of colors that worked together was like threading a needle. Here, the shaded areas show how limited the space is to actually find a combination of values that allows for roughly equal lightness for all hues.

Line chart showing a straight horizontal line with equivalent lightness values for each of nine colors. The background of the chart shows a shaded area with curved edges representing color values that are unavailable. The available space is very narrow for some colors.


After a lot of iterations and tests with real components and interfaces, we arrived at a palette of colors that achieved our goals: our colors predictably passed accessibility guidelines, kept their clear, vibrant hues, and maintained a consistent visual weight across hues.

Our new default colors for text and icons now pass the accessibility contrast threshold defined in the WCAG 2.0 guidelines.

List of nine colors annotated with contrast values against white, ranging from 4.5 to 4.6. All of the values pass the recommended contrast ratio for text (4.5).

List of nine colors annotated with contrast values against white, ranging from 3.0 to 3.1. All of the values pass the recommended contrast ratio for icons (3.0).

In addition to passing contrast guidelines over white backgrounds, each color also passes when displayed atop the lightest color value in any hue. Since we commonly use these lightly tinted backgrounds to offset or highlight sections, this makes it simple and predictable to ensure text has sufficient contrast throughout our products.

Because the new colors are uniformly organized based on contrast, we also have straightforward guidelines built-in for choosing appropriate contrast pairs in less common cases. Any two colors are guaranteed to have sufficient contrast for small text if they are at least five levels apart, and at least four levels apart for icons and large text.

With contrast guidelines built in to the system, it’s simple to make adjustments for color contrast in different components with predictable results.

Two rows of pill-shaped badges in different colors. The top row shows light text colors over a transparent background with a transparent color outline. The second row shows darker text colors over solid light color backgrounds. The background colors are bright and saturated, and the text colors are dark enough to have high contrast with the background.

For example, we redesigned our Badge component to use a color background to clearly differentiate each color. At the lightest possible value, the colors were too difficult to distinguish from each other. By shifting both the background and the text color up one level, we were able to maintain text contrast across all badge colors without fine tuning each color combination individually.


We learned that designing accessible color systems doesn’t have to mean fumbling around in the dark. We just needed to change how we thought about color:

Use a perceptually uniform color model
When designing an accessible color system, using a perceptually uniform color model (like CIELAB) helped us understand how each color appears to our eyes as opposed to how it appears to a computer. This allowed us to validate our intuitions and use numbers to compare the lightness and colorfulness of all of our colors.

Accessible doesn’t mean vibrant
The WCAG accessibility standard intentionally only focuses on the contrast between a foreground and a background color—not how vibrant they appear. Understanding how vibrant each color appears can helps to distinguish hues from one another.

Color is hard to reason about, tools can help
One of the pitfalls of perceptually uniform color models is that there are impossible colors—there’s no such thing as “very colorful dark yellow” or “vibrant light royal blue”. Building our own tool helped us see exactly which colors were possible and allowed us to rapidly iterate on our color palette until we produced a palette that was accessible, vibrant, and still felt like Stripe.

Additional resources
To learn more about color, we recommend the following resources:

Like this post? Join the Stripe engineering team. View openings

October 15, 2019

Fast and flexible observability with canonical log lines

Brandur Leach on July 30, 2019 in Engineering

Logging is one of the oldest and most ubiquitous patterns in computing. Key to gaining insight into problems ranging from basic failures in test environments to the most tangled problems in production, it’s common practice across all software stacks and all types of infrastructure, and has been for decades.

Although logs are powerful and flexible, their sheer volume often makes it impractical to extract insight from them in an expedient way. Relevant information is spread across many individual log lines, and even with the most powerful log processing systems, searching for the right details can be slow and requires intricate query syntax.

We’ve found using a slight augmentation to traditional logging immensely useful at Stripe—an idea that we call canonical log lines. It’s quite a simple technique: in addition to their normal log traces, requests also emit one long log line at the end that includes many of their key characteristics. Having that data colocated in single information-dense lines makes queries and aggregations over it faster to write, and faster to run.

Out of all the tools and techniques we deploy to help get insight into production, canonical log lines in particular have proven to be so useful for added operational visibility and incident response that we’ve put them in almost every service we run—not only are they used in our main API, but there’s one emitted every time a webhook is sent, a credit card is tokenized by our PCI vault, or a page is loaded in the Stripe Dashboard.

Structured logging

Just like in many other places in computing, logging is used extensively in APIs and web services. In a payments API, a single request might generate a log trace that looks like this:

[2019-03-18 22:48:32.990] Request started [2019-03-18 22:48:32.991] User authenticated [2019-03-18 22:48:32.992] Rate limiting ran [2019-03-18 22:48:32.998] Charge created [2019-03-18 22:48:32.999] Request finished

Structured logging augments the practice by giving developers an easy way to annotate lines with additional data. The use of the word structured is ambiguous—it can refer to a natively structured data format like JSON, but it often means that log lines are enhanced by appending key=value pairs (sometimes called logfmt, even if not universally). The added structure makes it easy for developers to tag lines with extra information without having to awkwardly inject it into the log message itself.

An enriched form of the trace above might look like:

[2019-03-18 22:48:32.990] Request started http_method=POST http_path=/v1/charges request_id=req_123 [2019-03-18 22:48:32.991] User authenticated auth_type=api_key key_id=mk_123 user_id=usr_123 [2019-03-18 22:48:32.992] Rate limiting ran rate_allowed=true rate_quota=100 rate_remaining=99 [2019-03-18 22:48:32.998] Charge created charge_id=ch_123 permissions_used=account_write team=acquiring [2019-03-18 22:48:32.999] Request finished alloc_count=9123 database_queries=34 duration=0.009 http_status=200

The added structure also makes the emitted logs machine readable (the key=value convention is designed to be a compromise between machine and human readability), which makes them ingestible for a number of different log processing tools, many of which provide the ability to query production logs in near real-time.

For example, we might want to know what the last requested API endpoints were. We could figure that out using a log processing system like Splunk and its built-in query language:

“Request started” | head

Or whether any API requests have recently been rate limited:

“Rate limiting ran” allowed=false

Or gather statistics on API execute duration over the last hour:

“Request finished” earliest=-1h | stats count p50(duration) p95(duration) p99(duration)

In practice, it would be much more common to gather these sorts of simplistic vitals from dashboards generated from metrics systems like Graphite and statsd, but they have limitations. The emitted metrics and dashboards that interpret them are designed in advance, and in a pinch they’re often difficult to query in creative or unexpected ways. Where logging really shines in comparison to these systems is flexibility.

Logs are usually overproducing data to the extent that it’s possible to procure just about anything from them, even information that no one thought they’d need. For example, we could check to see which API path is the most popular:

“Request started” | stats count by http_path

Or let’s say we see that the API is producing 500s (internal server errors). We could check the request duration on the errors to get a good feel as to whether they’re likely caused by database timeouts:

“Request finished” status=500 | stats count p50(duration) p95(duration) p99(duration)

Sophisticated log processing systems tend to also support visualizing information in much the same way as a metrics dashboard, so instead of reading through raw log traces we can have our system graph the results of our ad-hoc queries. Visualizations are more intuitive to interpret, and can make it much easier for us to understand what’s going on.

Canonical log lines: one line per request per service

Although logs offer additional flexibility in the examples above, we’re still left in a difficult situation if we want to query information across the lines in a trace. For example, if we notice there’s a lot of rate limiting occurring in the API, we might ask ourselves the question, “Which users are being rate limited the most?” Knowing the answer helps differentiate between legitimate rate limiting because users are making too many requests, and accidental rate limiting that might occur because of a bug in our system.

The information on whether a request was rate limited and which user performed it is spread across multiple log lines, which makes it harder to query. Most log processing systems can still do so by collating a trace’s data on something like a request ID and querying the result, but that involves scanning a lot of data, and it’s slower to run. It also requires more complex syntax that’s harder for a human to remember, and is more time consuming for them to write.

We use canonical log lines to help address this. They’re a simple idea: in addition to their normal log traces, requests (or some other unit of work that’s executing) also emit one long log line at the end that pulls all its key telemetry into one place. They look something like this:

[2019-03-18 22:48:32.999] canonical-log-line alloc_count=9123 auth_type=api_key database_queries=34 duration=0.009 http_method=POST http_path=/v1/charges http_status=200 key_id=mk_123 permissions_used=account_write rate_allowed=true rate_quota=100 rate_remaining=99 request_id=req_123 team=acquiring user_id=usr_123

This sample shows the kind of information that a canonical line might contain include:

  • The HTTP request verb, path, and response status.
  • The authenticated user and related information like how they authenticated (API key, password) and the ID of the API key they used.
  • Whether rate limiters allowed the request, and statistics like their allotted quota and what portion remains.
  • Timing information like the total request duration, and time spent in database queries.
  • The number of database queries issued and the number of objects allocated by the VM.

We call the log line canonical because it’s the authoritative line for a particular request, in the same vein that the IETF’s canonical link relation specifies an authoritative URL.

Canonical lines are an ergonomic feature. By colocating everything that’s important to us, we make it accessible through queries that are easy for people to write, even under the duress of a production incident. Because the underlying logging system doesn’t need to piece together multiple log lines at query time they’re also cheap for computers to retrieve and aggregate, which makes them fast to use. The wide variety of information being logged provides almost limitless flexibility in what can be queried. This is especially valuable during the discovery phase of an incident where it’s understood that something’s wrong, but it’s still a mystery as to what.

Getting insight into our rate limiting problem above becomes as simple as:

canonical-log-line rate_allowed=false | stats count by user_id

If only one or a few users are being rate limited, it’s probably legitimate rate limiting because they’re making too many requests. If it’s many distinct users, there’s a good chance that we have a bug.

As a slightly more complex example, we could visualize the performance of the charges endpoint for a particular user over time while also making sure to filter out 4xx errors caused by the user. 4xx errors tend to short circuit quickly, and therefore don’t tell us anything meaningful about the endpoint’s normal performance characteristics. The query to do so might look something like this:

canonical-log-line user=usr_123 http_method=POST http_path=/v1/charges http_status!=4* | timechart p50(duration) p95(duration) p99(duration)
API request durations

API request durations at the 50th, 95th, and 99th percentiles: generated on-the-fly from log data.

Implementation in middleware and beyond

Logging is such a pervasive technique and canonical log lines are a simple enough idea that implementing them tends to be straightforward regardless of the tech stack in use.

The implementation in Stripe’s main API takes the form of a middleware with a post-request step that generates the log line. Modules that execute during the lifecycle of the request decorate the request’s environment with information intended for the canonical log line, which the middleware will extract when it finishes.

Here’s a greatly simplified version of what that looks like:

class CanonicalLineLogger def call(env) # Call into the core application and inner middleware status, headers, body = # Emit the canonical line using response status and other # information embedded in the request environment log_canonical_line(status, env) # Return results upstream [status, headers, body] end end

Over the years our implementation has been hardened to maximize the chance that canonical log lines are emitted for every request, even if an internal failure or other unexpected condition occurred. The line is logged in a Ruby ensure block just in case the middleware stack is being unwound because an exception was thrown from somewhere below. The logging statement itself is wrapped in its own begin/rescue block so that any problem constructing a canonical line will never fail a request, and also so someone is notified immediately in case there is. They’re such an important tool for us during incident response that it’s crucial that any problems with them are fixed promptly—not having them would be like flying blind.

Warehousing history

A problem with log data is that it tends to be verbose. This means long-term retention in anything but cold storage is expensive, especially when considering that the chances it’ll be used again are low. Along with being useful in an operational sense, the succinctness of canonical log lines also make them a convenient medium for archiving historical requests.

At Stripe, canonical log lines are used by engineers so often for introspecting production that we’ve developed muscle memory around the naming of particular fields. So for a long time we’ve made an effort to keep that naming stable—changes are inconvenient for the whole team as everyone has to relearn it. Eventually, we took it a step further and formalized the contract by codifying it with a protocol buffer.

Along with emitting canonical lines to the logging system, the API also serializes data according to that contract and sends it out asynchronously to a Kafka topic. A consumer reads the topic and accumulates the lines into batches that are stored to S3. Periodic processes ingest those into Presto archives and Redshift, which lets us easily perform long-term analytics that can look at months’ worth of data.

In practice, this lets us measure almost everything we’d ever want to. For example, here’s a graph that tracks the adoption of major Go versions over time from API requests that are issued with our official API libraries:

Go language versions over time

Go version usage measured over time. Data is aggregated from an archive of canonical log lines ingested into a data warehouse.

Better yet, because these warehousing tools are driven by SQL, engineers and non-engineers alike can aggregate and analyze the data. Here’s the source code for the query above:

    DATE_TRUNC('week', created) AS week,
    REGEXP_SUBSTR(language_version, '\\d*\\.\\d*') AS major_minor,
FROM events.canonical_log_lines
WHERE created > CURRENT_DATE - interval '2 months'
    AND language = 'go'

Product leverage

We already formalized the schema of our canonical log lines with a protocol buffer to use in analytics, so we took it a step further and started using this data to drive parts of the Stripe product itself. A year ago we introduced our Developer Dashboard which gives users access to high-level metrics on their API integrations.

Developer Dashboard sample chart

The Developer Dashboard shows the number of successful API requests for this Stripe account. Data is generated from canonical log lines archived to S3.

The charts produced for this dashboard are also produced from canonical log lines. A MapReduce backend crunches archives stored in S3 to create visualizations tailored to specific users navigating their dashboards. As with our analytics tools, the schema codified in the protocol buffer definition ensures a stable contract so they’re not broken.

Canonical lines are still useful even if they’re never used to power products, but because they contain such a rich trove of historical data, they make an excellent primary data source for this sort of use.

Sketching a canonical logging pipeline

Canonical log lines are well-suited for practically any production environment, but let’s take a brief look at a few specific technologies that might be used to implement a full pipeline for them.

In most setups, servers log to their local disk and those logs are sent by local collector agents to a central processing system for search and analysis. The Kubernetes documentation on logging suggests the use of Elasticsearch, or when on GCP, Google’s own Stackdriver Logging. For an AWS-based stack, a conventional solution is CloudWatch. All three require an agent like fluentd to handle log transmission to them from server nodes. These solutions are common, but far from exclusive—log processing is a thriving ecosystem with dozens of options to choose from, and it’s worth setting aside some time to evaluate and choose the one that works best for you.

Emitting to a data warehouse requires a custom solution, but not one that’s unusual or particularly complex. Servers should emit canonical log data into a stream structure, and asynchronously to keep user operations fast. Kafka is far and away the preferred stream of choice these days, but it’s not particularly cheap or easy to run, so in a smaller-scale setup something like Redis streams are a fine substitute. A group of consumers cooperatively reads the stream and bulk inserts its contents into a warehouse like Redshift or BigQuery. Just like with log processors, there are many data warehousing solutions to choose from.

Flexible, lightweight observability

To recap the key elements of canonical log lines and why we find them so helpful:

  • A canonical line is one line per request per service that collates each request’s key telemetry.
  • Canonical lines are not as quick to reference as metrics, but are extremely flexible and easy to use.
  • We emit them asynchronously into Kafka topics for ingestion into our data warehouse, which is very useful for analytics.
  • The stable contract provided by canonical lines even makes them a great fit to power user-facing products! We use ours to produce the charts on Stripe’s Developer Dashboard.

They’ve proven to be a lightweight, flexible, and technology-agnostic technique for observability that’s easy to implement and very powerful. Small and large organizations alike will find them useful for getting visibility into production services, garner insight through analytics, and even shape their products.

Like this post? Join the Stripe engineering team. View openings

July 30, 2019

The secret life of DNS packets: investigating complex networks

Jeff Jo on May 21, 2019 in Engineering

DNS is a critical piece of infrastructure used to facilitate communication across networks. It’s often described as a phonebook: in its most basic form, DNS provides a way to look up a host’s address by an easy-to-remember name. For example, looking up the domain name will direct clients to the IP address, where one of Stripe’s servers is located. Before any communication can take place, one of the first things a host must do is query a DNS server for the address of the destination host. Since these lookups are a prerequisite for communication, maintaining a reliable DNS service is extremely important. DNS issues can quickly lead to crippling, widespread outages, and you could find yourself in a real bind.

It’s important to establish good observability practices for these systems so when things go wrong, you can clearly understand how they’re failing and act quickly to minimize any impact. Well-instrumented systems provide visibility into how they operate; establishing a monitoring system and gathering robust metrics are both essential to effectively respond to incidents. This is critical for post-incident analysis when you’re trying to understand the root cause and prevent recurrences in the future.

In this post, I’ll describe how we monitor our DNS systems and how we used an array of tools to investigate and fix an unexpected spike in DNS errors that we encountered recently.

DNS infrastructure at Stripe

At Stripe, we operate a cluster of DNS servers running Unbound, a popular open-source DNS resolver that can recursively resolve DNS queries and cache the results. These resolvers are configured to forward DNS queries to different upstream destinations based on the domain in the request. Queries that are used for service discovery are forwarded to our Consul cluster. Queries for domains we configure in Route 53 and any other domains on the public Internet are forwarded to our cluster’s VPC resolver, which is a DNS resolver that AWS provides as part of their VPC offering. We also run resolvers locally on every host, which provides an additional layer of caching.

Unbound runs locally on every host as well as on the DNS servers.

Unbound exposes an extensive set of statistics that we collect and feed into our metrics pipeline. This provides us with visibility into metrics like how many queries are being served, the types of queries, and cache hit ratios.

We recently observed that for several minutes every hour, the cluster’s DNS servers were returning SERVFAIL responses for a small percentage of internal requests. SERVFAIL is a generic response that DNS servers return when an error occurs, but it doesn’t tell us much about what caused the error.

Without much to go on initially, we found another clue in the request list depth metric. (You can think of this as Unbound’s internal todo list, where it keeps track of all the DNS requests it needs to resolve.)

An increase in this metric indicates that Unbound is unable to process messages in a timely fashion, which may be caused by an increase in load. However, the metrics didn’t show a significant increase in the number of DNS queries, and resource consumption didn’t appear to be hitting any limits. Since Unbound resolves queries by contacting external nameservers, another explanation could be that these upstream servers were taking longer to respond.

Tracking down the source

We followed this lead by logging into one of the DNS servers and inspecting Unbound’s request list.

$ unbound-control dump_requestlist
thread #0
#   type cl name    seconds    module status
  0    A IN - iterator wait for
  1  PTR IN - iterator wait for
  2  PTR IN - iterator wait for
  3  PTR IN - iterator wait for
  4  PTR IN - iterator wait for
  5  PTR IN - iterator wait for
  6  PTR IN - iterator wait for
  7  PTR IN - iterator wait for
  8  PTR IN - iterator wait for
  9  PTR IN - iterator wait for
 10  PTR IN - iterator wait for

This confirmed that requests were accumulating in the request list. We noticed some interesting details: most of the entries in the list corresponded to reverse DNS lookups (PTR records) and they were all waiting for a response from, which is the IP address of the VPC resolver.

We then used tcpdump to capture the DNS traffic on one of the servers to get a better sense of what was happening and try to identify any patterns. We wanted to make sure we captured the traffic during one of these spikes, so we configured tcpdump to write data to files over a period of time. We split the files across 60 second collection intervals to keep file sizes small, which made it easier to work with them.

# Capture all traffic on port 53 (DNS traffic)
# Write data to files in 60 second intervals for 30 minutes
# and format the filenames with the current time
$ tcpdump -n -tt -i any -W 30 -G 60 -w '%FT%T.pcap' port 53

The packet captures revealed that during the hourly spike, 90% of requests made to the VPC resolver were reverse DNS queries for IPs in the CIDR range. The vast majority of these queries failed with a SERVFAIL response. We used dig to query the VPC resolver with a few of these addresses and confirmed that it took longer to receive responses.

By looking at the source IPs of clients making the reverse DNS queries, we noticed they were all coming from hosts in our Hadoop cluster. We maintain a database of when Hadoop jobs start and finish, so we were able to correlate these times to the hourly spikes. We finally narrowed down the source of the traffic to one job that analyzes network activity logs and performs a reverse DNS lookup on the IP addresses found in those logs.

One more surprising detail we discovered in the tcpdump data was that the VPC resolver was not sending back responses to many of the queries. During one of the 60-second collection periods the DNS server sent 257,430 packets to the VPC resolver. The VPC resolver replied back with only 61,385 packets, which averages to 1,023 packets per second. We realized we may be hitting the AWS limit for how much traffic can be sent to a VPC resolver, which is 1,024 packets per second per interface. Our next step was to establish better visibility in our cluster to validate our hypothesis.

Counting packets

AWS exposes its VPC resolver through a static IP address relative to the base IP of the VPC, plus two (for example, if the base IP is, then the VPC resolver will be at We need to track the number of packets sent per second to this IP address. One tool that can help us here is iptables, since it keeps track of the number of packets matched by a rule.

We created a rule that matches traffic headed to the VPC resolver IP address and added it to the OUTPUT chain, which is a set of iptables rules that are applied to all packets sent from the host.

# Create a new chain called VPC_RESOLVER
$ iptables -N VPC_RESOLVER

# Match packets destined to VPC resolver and jump to the new chain
$ iptables -A OUTPUT -d -j VPC_RESOLVER

# Add an empty rule to the new chain to help parse the output
$ iptables -A VPC_RESOLVER

We configured the rule to jump to a new chain called VPC_RESOLVER and added an empty rule to that chain. Since our hosts could contain other rules in the OUTPUT chain, we added this rule to isolate matches and make it a little easier to parse the output.

Listing the rules, we see the number of packets sent to the VPC resolver in the output:

$ iptables -L -v -n -x

Chain OUTPUT (policy ACCEPT 41023 packets, 2569001 bytes)
  pkts   bytes target     prot opt in     out     source               destination
 41023 2569001 VPC_RESOLVER  all  --  *      *  

Chain VPC_RESOLVER (1 references)
  pkts   bytes target     prot opt in     out     source               destination
 41023 2569001            all  --  *      *  

With this, we wrote a simple service that reads the statistics from the VPC_RESOLVER chain and reports this value through our metrics pipeline.

while :
  PACKET_COUNT=$(iptables -L VPC_RESOLVER 1 -x -n -v | awk '{ print $1 }')
  report-metric $PACKET_COUNT "vpc_resolver.packet_count"
  sleep 1

Once we started collecting this metric, we could see that the hourly spikes in SERVFAIL responses lined up with periods where the servers were sending too much traffic to the VPC resolver.

Traffic amplification

The data we saw from iptables (the number of packets per second sent to the VPC resolver) indicated a significant increase in traffic to the VPC resolvers during these periods, and we wanted to better understand what was happening. Taking a closer look at the shape of the traffic coming into the DNS servers from the Hadoop job, we noticed the clients were sending the request five times for every failed reverse lookup. Since the reverse lookups were taking so long or being dropped at the server, the local caching resolver on each host was timing out and continually retrying the requests. On top of this, the DNS servers were also retrying requests, leading to request volume amplifying by an average of 7x.

Spreading the load

One thing to remember is that the VPC resolver limit is imposed per network interface. Instead of performing the reverse lookups solely on our DNS servers, we could instead distribute the load and have each host contact the VPC resolver independently. With Unbound running on each host we can easily control this behavior. Unbound allows you to specify different forwarding rules per DNS zone. Reverse queries use the special domain, so configuring this behavior was a matter of adding a rule that forwards requests for this zone to the VPC resolver.

We knew that reverse lookups for private addresses stored in Route 53 would likely return faster than reverse lookups for public IPs that required communication with an external nameserver. So we decided to create two forwarding configurations, one for resolving private addresses (the zone) and one for all other reverse queries (the zone). Both rules were configured to send requests to the VPC resolver. Unbound calculates retry timeouts based on a smoothed average of historical round trip times to upstream servers and maintains separate calculations per forwarding rule. Even if two rules share the same upstream destination the retry timeouts are computed independently, which helps isolate the impact of inconsistent query performance on timeout calculations.

After applying the forwarding configuration change to the local Unbound resolvers on the Hadoop nodes we saw that the hourly load spike to the VPC resolvers had gone away, eliminating the surge of SERVFAILS we were seeing:

Adding the VPC resolver packet rate metric gives us a more complete picture of what’s going on in our DNS infrastructure. It alerts us if we approach any resource limits and points us in the right direction when systems are unhealthy. Some other improvements we’re considering include collecting a rolling tcpdump of DNS traffic and periodically logging the output of some of Unbound’s debugging commands, such as the contents of the request list.

Visibility into complex systems

When operating such a critical piece of infrastructure like DNS, it’s crucial to understand the health of the various components of the system. The metrics and command line tools that Unbound provides gives us great visibility into one of the core components of our DNS systems. As we saw in this scenario, these types of investigations often uncover areas where monitoring can be improved, and it’s important to address these gaps to better prepare for incident response. Gathering data from multiple sources allows you to see what’s going on in the system from different angles, which can help you narrow in on the root cause during an investigation. This information will also identify if the remediations you put in place have the intended effect. As these systems grow to handle more scale and increase in complexity, how you monitor them must also evolve to understand how different components interact with each other and build confidence that your systems are operating effectively.

Like this post? Join the Stripe engineering team. View openings

May 21, 2019

Railyard: how we rapidly train machine learning models with Kubernetes

Rob Story on May 7, 2019 in Engineering

Stripe uses machine learning to respond to our users’ complex, real-world problems. Machine learning powers Radar to block fraud, and Billing to retry failed charges on the network. Stripe serves millions of businesses around the world, and our machine learning infrastructure scores hundreds of millions of predictions across many machine learning models. These models are powered by billions of data points, with hundreds of new models being trained each day. Over time, the volume, quality of data, and number of signals have grown enormously as our models continuously improve in performance.

Running infrastructure at this scale poses a very practical data science and ML problem: how do we give every team the tools they need to train their models without requiring them to operate their own infrastructure? Our teams also need a stable and fast ML pipeline to continuously update and train new models as they respond to a rapidly changing world. To solve this, we built Railyard, an API and job manager for training these models in a scalable and maintainable way. It’s powered by Kubernetes, a platform we’ve been working with since late 2017. Railyard enables our teams to independently train their models on a daily basis with a centrally managed ML service.

In many ways, we’ve built Railyard to mirror our approach to products for Stripe’s users: we want teams to focus on their core work training and developing machine learning models rather than operating infrastructure. In this post, we’ll discuss Railyard and best practices for operating machine learning infrastructure we’ve discovered while building this system.

Effective machine learning infrastructure for organizations

We’ve been running Railyard in production for a year and a half, and our ML teams have converged on it as their common training environment. After training tens of thousands of models on this architecture over that period, here are our biggest takeaways:

  • Build a generic API, not tied to any single machine learning framework. Teams have extended Railyard in ways we did not anticipate. We first focused on classifiers, but teams have since adopted the system for applications such as time series forecasting and word2vec style embeddings.
  • A fully managed Kubernetes cluster reduces operational burden across an organization. Railyard interacts directly with the Kubernetes API (as opposed to a higher level abstraction), but the cluster is operated entirely by another team. We’re able to learn from their domain knowledge to keep the cluster running reliably so we can focus on ML infrastructure.
  • Our Kubernetes cluster gives us great flexibility to scale up and out. We can easily scale our cluster volume when we need to train more models, or quickly add new instance types when we need additional compute resources.
  • Centrally tracking model state and ownership allows us to easily observe and debug training jobs. We’ve moved from asking, “Did you save the output of your job anywhere so we can look at?” to “What’s your job ID? We’ll figure the rest out.” We observe aggregate metrics and track the overall performance of training jobs across the cluster.
  • Building an API for model training enables us to use it everywhere. Teams can call our API from any service, scheduler, or task runner. We now use Railyard to train models using an Airflow task definition as part of a larger graph of data jobs.

The Railyard architecture

In the early days of model training at Stripe, an engineer or data scientist would SSH into an EC2 instance and manually launch a Python process to train a model. This served Stripe’s needs at the time, but had a number of challenges and open questions for our Machine Learning Infrastructure team to address as the company grew:

  • How do we scale model training from ad-hoc Python processes on shared EC2 instances to automatically training hundreds of models a day?
  • How do we build an interface that is generic enough to support multiple training libraries, frameworks, and paradigms while remaining expressive and concise?
  • What metrics and metadata do we want to track for each model run?
  • Where should training jobs be executed?
  • How do we scale different compute resource needs (CPU, GPU, memory) for different model types?

Our goal when designing this system was to enable our data scientists to think less about how their machine learning jobs are run on our infrastructure, and instead focus on their core inquiry. Machine learning workflows typically involve multiple steps that include loading data, training models, serializing models, and persisting evaluation data. Because Stripe runs its infrastructure in the cloud, we can manage these processes behind an API: this reduces cognitive burden for our data science and engineering teams and moves local processes to a collaborative, shared environment. After a year and a half of iteration and collaboration with teams across Stripe, we’ve converged on the following system architecture for Railyard. Here’s a high-level overview:

Railyard runs on a Kubernetes cluster and pairs jobs with the right instance type.

Railyard provides a JSON API and is a Scala service that manages job history, state, and provenance in a Postgres database. Jobs are executed and coordinated using the Kubernetes API, and our Kubernetes cluster provides multiple instance types with different compute resources. The cluster can pair jobs with the right instance type: for example, most jobs default to our high-CPU instances, data-intensive jobs run on high-memory instances, and specialized training jobs like deep learning run on GPU instances.

We package the Python code for model training using Subpar, a Google library that creates a standalone executable that includes all dependencies in one package. This is included in a Docker container, deployed to the AWS Elastic Container Registry, and executed as a Kubernetes job. When Railyard receives an API request, it runs the matching training job and logs are streamed to S3 for inspection. A given job will run through multiple steps, including fetching training and holdout data, training the model, and serializing the trained model and evaluation data to S3. These training results are persisted in Postgres and exposed in the Railyard API.

Railyard’s API design

The Railyard API allows you to specify everything you need to train a machine learning model, including data sources and model parameters. In designing this API we needed to answer the following question: how do we provide a generic interface for multiple training frameworks while remaining expressive and concise for users?

We iterated on a few designs with multiple internal customers to understand each use case. Some teams only needed ad-hoc model training and could simply use SQL to fetch features, while others needed to call an API programmatically hundreds of times a day using features stored in S3. We explored a number of different API concepts, arriving at two extremes on either end of the design spectrum.

On one end, we explored designing a custom DSL to specify the entire training job by encoding scikit-learn components directly in the API itself. Users could include scikit-learn pipeline components in the API specification and would not need to write any Python code themselves.

On the other end of the spectrum we reviewed designs to allow users to write their own Python classes for their training code with clearly defined input and output interfaces. Our library would be responsible for both the necessary inputs to train models (fetching, filtering, and splitting training and test data) and the outputs of the training pipeline (serializing the model, and writing evaluation and label data). The user would otherwise be responsible for writing all training logic.

In the end, any DSL-based approach ended up being too inflexible: it either tied us to a given machine learning framework or required that we continuously update the API to keep pace with changing frameworks or libraries. We converged on the following split: our API exposes fields for changing data sources, data filters, feature names, labels, and training parameters, but the core logic for a given training job lives entirely in Python.

Here’s an example of an API request to the Railyard service:

  // What does this model do?
  "model_description": "A model to predict fraud",
  // What is this model called?
  "model_name": "fraud_prediction_model",
  // What team owns this model?
  "owner": "machine-learning-infrastructure",
  // What project is this model for?
  "project": "railyard-api-blog-post",
  // Which team member is training this model?
  "trainer": "robstory",
  "data": {
    "features": [
        // Columns we’re fetching from Hadoop Parquet files
        "names": ["created_at", "charge_type", "charge_amount",
                  "charge_country", "has_fraud_dispute"],
        // Our data source is S3
        "source": "s3",
        // The path to our Parquet data
        "path": "s3://path/to/parquet/fraud_data.parq"
    // The canonical date column in our dataset
    "date_column": "created_at",
    // Data can be filtered multiple times
    "filters": [
      // Filter out data before 2018-01-01
        "feature_name": "created_at",
        "predicate": "GtEq",
        "feature_value": {
          "string_val": "2018-01-01"
      // Filter out data after 2019-01-01
        "feature_name": "created_at",
        "predicate": "LtEq",
        "feature_value": {
          "string_val": "2019-01-01"
      // Filter for charges greater than $10.00
        "feature_name": "charge_amount",
        "predicate": "Gt",
        "feature_value": {
          "float_val": 10.00
      // Filter for charges in the US or Canada
        "feature_name": "charge_country",
        "predicate": "IsIn",
        "feature_value": {
          "string_vals": ["US", "CA"]
    // We can specify how to treat holdout data
    "holdout_sampling": {
      "sampling_function": "DATE_RANGE",
      // Split holdout data from 2018-10-01 to 2019-01-01
      // into a new dataset
      "date_range_sampling": {
        "date_column": "created_at",
        "start_date": "2018-10-01",
        "end_date": "2019-01-01"
  "train": {
    // The name of the Python workflow we're training
    "workflow_name": "StripeFraudModel",
    // The list of features we're using in our classifier
    "classifier_features": [
      "charge_type", "charge_amount", "charge_country"
    "label": "is_fraudulent",
    // We can include hyperparameters in our model
    "custom_params": {
      "objective": "reg:linear",
      "max_depth": 6,
      "n_estimators": 500,
      "min_child_weight": 50,
      "learning_rate": 0.02

We learned a few lessons while designing this API:

  • Be flexible with model parameters. Providing a free-form custom_params field that accepts any valid JSON was very important for our users. We validate most of the API request, but you can’t anticipate every parameter a machine learning engineer or data scientist needs for all of the model types they want to use. This field is most frequently used to include a model’s hyperparameters.
  • Not providing a DSL was the right choice (for us). Finding the sweet spot for expressiveness in an API for machine learning is difficult, but so far the approach outlined above has worked out well for our users. Many users only need to change dates, data sources, or hyperparameters when retraining. We haven’t gotten any requests to add more DSL-like features to the API itself.

The Python workflow

Stripe uses Python for all ML model training because of its support for many best-in-class ML libraries and frameworks. When the Railyard project started we only had support for scikit-learn, but have since added XGBoost, PyTorch, and FastText. The ML landscape changes very quickly and we needed a design that didn’t pick winners or constrain users to specific libraries. To enable this extensibility, we defined a framework-agnostic workflow that presents an API contract with users: we pass data in, you pass a trained model back out, and we’ll score and serialize the model for you. Here’s what a minimal Python workflow looks like:

class StripeFraudModel(StripeMLWorkflow):
  # A basic model training workflow: all workflows inherit
  # Railyard’s StripeMLWorkflow class
  def train(self, training_dataframe, holdout_dataframe):
    # Construct an estimator using specified hyperparameters
    estimator = xgboost.XGBRegressor(**self.custom_params)

    # Serialize the trained model once training is finished;
    # we're using an in-house serialization library.
    serializable_estimator = stripe_ml.make_serializable(estimator)

    # Train our model
    fitted_model =

    # Hand our fitted model back to Railyard to serialize
    return fitted_model

Teams start adopting Railyard with an API specification and a workflow that defines a train method to train a classifier with the data fetched from the API request. The StripeMLWorkflow interface supports extensive customization to adapt to different training approaches and model types. You can preprocess your data before it gets passed in to the train function, define your own data fetching implementation, specify how you want training/holdout data to be scored, and run any other Python code you need. For example, some of our deep learning models have custom data fetching code to stream batches of training data for model training. When your training job finishes you’ll end up with two output: a model identifier for your serialized model that can be put into production, and your evaluation data in S3.

If you build a machine learning API specification, here are a few things to keep in mind:

  • Interfaces are important. Users will want to load and transform data in ways you didn’t anticipate, train models using unsupported patterns, and write out unfamiliar types of evaluation data. It’s important to provide standard API interfaces like fetch_data, preprocess, train, and write_evaluation_data that specify some standard data containers (e.g., Pandas DataFrame and Torch Dataset) but are flexible in how they are generated and used.
  • Users should not need to think about model serialization or persistence. Reducing their cognitive burden makes their lives easier and gives them more time to be creative and focus on modeling and feature engineering. Data scientists and ML engineers already have enough to think about between feature engineering, modeling, evaluation, and more. They should be able to train and hand over their model to your scoring infrastructure without ever needing to think about how it gets serialized or persisted.
  • Define metrics for each step of the training workflow. Make sure you’re gathering fine-grained metrics for each training step: data loading, model training, model serialization, evaluation data persistence, etc. We store high-level success and failure metrics that can be examined by team, project, or the individual machine performing the training. On a functional level,our team uses these metrics to debug and profile long-running or failed jobs, and provide feedback to the appropriate team when there’s a problem with a given training job. And on a collaborative level, these metrics have changed how our team operates. Moving from a reactive stance (“My model didn’t train, can you help?”) to a proactive one (“Hey, I notice your model didn’t train, here’s what happened”) has helped us be better partners to the many teams we work with.

Scaling Kubernetes

Railyard coordinates hundreds of machine learning jobs across our cluster, so effective resource management across our instances is crucial. The first version of Railyard simply ran individual subprocesses from the Scala service that manages all jobs across our cluster. We would get a request, start Java’s ProcessBuilder, and kick off a subprocess to build a Python virtualenv and train the model. This basic implementation allowed us to quickly iterate on our API in our early days, but managing subprocesses wasn’t going to scale very well. We needed a proper job management system that met a few requirements:

  • Scaling the cluster quickly for different resource/instance types
  • Routing models to specific instances based on their resource needs
  • Job queueing to prioritize resources for pending work

Luckily, our Orchestration team had been working hard to build a reliable Kubernetes cluster and suggested this new cluster would be a good platform for Railyard’s needs. It was a great fit; a fully managed Kubernetes cluster provides all of the pieces we needed to meet our system’s requirements.

Containerizing Railyard

To run Railyard jobs on Kubernetes, we needed a way to reliably package our Python code into a fully executable binary. We use Google’s Subpar library which allows us to package all of our Python requirements and source code into a single .par file for execution. The library also includes support for the Bazel build system out of the box. Over the past few years, Stripe has been moving many of its builds to Bazel; we appreciate its speed, correctness, and flexibility in a multi-language environment.

With Subpar you can define an entrypoint to your Python executable and Bazel will build your .par executable to bundle into a Dockerfile:

    name = "railyard_train",
    srcs = ["@.../ml:railyard_srcs"],
    data = ["@.../ml:railyard_data"],
    main = "@.../ml:railyard/",
    deps = all_requirements,

With the Subpar package built, the Kubernetes command only needs to execute it with Python:

command: ["sh"]
args: ["-c", "python /railyard_train.par"]

Within the Dockerfile we package up any other third-party dependencies that we need for model training, such as the CUDA runtime to provide GPU support for our PyTorch models. After our Docker image is built, we deploy it to AWS’s Elastic Container Repository so our Kubernetes cluster can fetch and run the image.

Running diverse workloads

Some machine learning tasks can benefit from a specific instance type with resources optimized for a given workload. For example, a deep learning task may be best suited for a GPU instance while fraud models that employ huge datasets should be paired with high-memory instances. To support these mixed workloads we added a new top-level field to the Railyard API request to specify the compute resource for jobs running on Kubernetes:

    "compute_resource": "GPU"

Railyard supports training models on CPU, GPU, or memory-optimized instances. Models for our largest datasets can require hundreds of gigabytes of memory to train, while our smaller models can train quickly on smaller (and less expensive) instance types.

Scheduling and distributing jobs

Railyard exerts a fine-grained level of control on how Kubernetes distributes jobs across the cluster. For each request, we look at the requested compute resource and set both a Kubernetes Toleration and an Affinity to specify the type of node that we would like to run on. These parameters effectively tell the Kubernetes cluster:

  • the affinity, or which nodes the job should run on
  • the toleration, or which nodes should be reserved for specific tasks

Kubernetes will use the affinity and toleration properties for a given Kubernetes pod to compute how jobs should be best distributed across or within each node.

Kubernetes supports per-job CPU and memory requirements to ensure that workloads don’t experience resource starvation due to neighboring jobs on the same host. In Railyard, we determine limits for all jobs based on their historic and future expected usage of resources. In the case of high-memory or GPU training jobs, these limits are set so that each job gets an entire node to itself; if all nodes are occupied, then the scheduler will place the job in a queue. Jobs with less intensive resource requirements are scheduled on nodes to run in parallel.

With these parameters in place, we can lean on the Kubernetes resource scheduler to balance our jobs across available nodes. Given a set of job and resource requests, the scheduler will intelligently distribute those jobs to nodes across the cluster.

One year later: running at scale

Moving our training jobs to a Kubernetes cluster has enabled us to rapidly spin up new resources for different models and expand the cluster to support more training jobs. We can use a single command to expand the cluster and new instance types only require a small configuration change. When the memory requirements of running jobs outgrew our CPU-optimized instance types, we started training on memory-optimized instances the very next day; when we observe a backlog of jobs, we can immediately expand the cluster to process the queue. Model training on Kubernetes is available to any data scientist or engineer at Stripe: all that’s needed is a Python workflow and an API request and they can start training models on any resource type in the cluster.

To date, we’ve trained almost 100,000 models on Kubernetes, with new models trained each day. Our fraud models automatically retrain on a regular basis using Railyard and Kubernetes, and we’re steadily moving more of Stripe’s models onto an automated retraining cycle. Radar’s fraud model is built on hundreds of distinct ML models and has a dedicated service that trains and deploys all of those models on a daily cadence. Other models retrain regularly using an Airflow task that uses the Railyard API.

We’ve learned a few key considerations for scaling Kubernetes and effectively managing instances:

  • Instance flexibility is really important. Teams can have very different machine learning workloads. In any given day we might train thousands of time series forecasts, a long-running word embedding model, or a fraud model with hundreds of gigabytes of data. The ability to quickly add new instance types and expand the cluster are equally important for scalability.
  • Managing memory-intensive workflows is hard. Even using various instance sizes and a managed cluster, we still sometimes have jobs that run out of memory and are killed. This is a downside to providing so much flexibility in the Python workflow: modelers are free to write memory-intensive workflows. Kubernetes allows us to proactively kill jobs that are consuming too many resources, but it still results in a failed training job for the modeler. We’re thinking about ways to better manage this, including smart retry behavior to automatically reschedule failed jobs on higher-capacity instances and moving to distributed libraries like dask-ml.
  • Subpar is an excellent solution for packaging Python code. Managing Python dependencies can be tricky, particularly when you’d like to bundle them as an executable that can be shipped to different instances. If we were to build this from scratch again we would probably take a look at Facebook’s XARs, but Subpar is very compatible with Bazel and it’s been running well in production for over a year.
  • Having a good Kubernetes team is a force multiplier. Railyard could not have been a success without the support of our Orchestration team, which manages our Kubernetes cluster and pushes the platform forward for the whole organization. If we had to manage and operate the cluster in addition to building our services, we would have needed more engineers and taken significantly longer to ship.

Building ML infrastructure

We’ve learned that building common machine learning infrastructure enables teams across Stripe to operate independently and focus on their local ML modeling goals. Over the last year we’ve used Railyard to train thousands of models spanning use cases from forecasting to deep learning. This system has enabled us to build rich functionality for model evaluation and design services to optimize hyperparameters for our models at scale.

While there is a wealth of information available on data science and machine learning from the modeling perspective, there isn’t nearly as much published about how companies build and operate their production machine learning infrastructure. Uber, Airbnb, and Lyft have all discussed how their infrastructure operates, and we’re following their lead in introducing the design patterns that have worked for us. We plan to share more lessons from our ML architecture in the months ahead. In the meantime, we’d love to hear from you: please let us know which lessons are most useful and if there are any specific topics about which you’d like to hear more.

Like this post? Join the Stripe engineering team. View openings

May 7, 2019

Effectively using AWS Reserved Instances

Ryan Lopopolo on June 26, 2018 in Engineering

Stripe uses Amazon Web Services to power our infrastructure. With AWS, we can dynamically scale our fleet of servers in real-time. This elasticity enables us to reliably serve a rapidly growing user base and scale along with their businesses. We use AWS Reserved Instances, which allow us to predictably forecast our cloud spend given a dynamic fleet with rapidly changing compute requirements.

One of the biggest problems in cloud computing is capacity planning: the ability to forecast your compute power requirements and manage the budget allocated to AWS servers. At Stripe, we started by solely using reserved instances to manage pricing for individual instances, but today we can dynamically and reliably understand costs as our fleet changes over time. Reserved instances allow us to make cost-effective decisions through careful resource management. We’ve developed an easy-to-use framework for automating our purchase decisions, which we’ll outline in this post.

Reserved instances reduce your AWS pricing (since they’re a commitment to use that server). The most economical way to use reserved instances is to make sure server utilization over the year is higher than 70%; this is the break-even point where it’s more economical to choose reserved instances over on-demand instances. This also fits Stripe’s usage patterns.

Reserved instances are hard to purchase effectively. It’s easy to allocate the wrong number, and hard to predict future compute requirements over time. Deciding which and how many reserved instances to buy is a non-trivial exercise at the nexus of cloud strategy, bin packing, and capacity planning.

Understanding AWS Reserved Instances

There are many dimensions to every reserved instance purchase, some of which are out of scope for this post. Some you may already know, like AWS region, VM tenancy, and OS platform. Other options, like contract length, pricing plan, and the type of reserved instance, are related to your company’s cloud strategy. You need to know what your financial plan looks like over the next few years to make these business decisions; the technical guidance that engineers provide can only offer a limited perspective. At Stripe, we typically use no-upfront convertible reserved instances with a three-year term. This means our pricing is:

  • No-upfront: We pay monthly on our normal billing cycle.
  • Convertible: We can change our instance types for our reservation.
  • Term: We lock in a pricing plan and commit to it for three years.

We think this offers the right trade-off between price efficiency and flexibility.

Of the remaining dimensions, the most impactful decision is scope. Scope is the AWS region or availability zone to which a reserved instance is attached. Your choice of scope affects capacity planning, deployment of your reserved instances, and server upgrades. In Stripe’s case, we reserve our instances with a regional scope.

If you choose to scope your reserved instances to a specific availability zone, they are locked to a specific instance type. This requires you to understand and plan your compute requirements in two dimensions:

  • The instance type (e.g. c5.2xlarge) defines how powerful each instance should be. This is known as vertical scale, since over time you can upgrade each server’s compute power without growing the number of instances.
  • The availability zones are where you plan to deploy instances. Adding more instances across availability zones increases your horizontal scale. The more servers you run, the more likely your application will keep running in case of failure.

These require you to predict both how your application load will grow and how dense your cluster will be be years into the future. Any miscalculation means you’ll pay for reserved instances that you won’t actually use.

Compute power varies by the size of each instance: for example, nine c5.xlarge instances on AWS provide the equivalent computer power of one c5.9xlarge instance.

AWS divides its infrastructure into several regions, which include many availability zones. If you choose to scope your reserved instances more broadly by region, AWS allows you to deploy instances of any size, as long as the compute power matches what you’ve reserved. This allows you to purchase high-powered instances up-front and deploy lower-powered instances later on. Even better, AWS will automatically apply the budget you’ve allocated toward reserved instances to as many instances in that region as possible.

Automate your AWS capacity planning

To adopt reserved instances, you first need to estimate your cluster’s total compute requirements. This is the hardest part of capacity planning. AWS defines a scale for compute power of all it’s server sizes: we can use this to calculate an aggregate value. (We’ve provided an example of a SQL query that could generate this report below.)

  1. Take a snapshot of your fleet using the AWS cost and usage report, which is stored in a Redshift table. You should group the usage by instance family.
  2. Add up the total compute power for each instance family. Each charge in the report includes a scaled usage amount that you should sum up.
  3. Pick a standard instance size that you’ll use for your reserved instances.
  4. Divide the total compute capacity by its scaling factor (e.g. xlarge instances have a scaling factor of 8.0).
  5. The result is the number of reserved instances you’ll purchase. The budget we’ve calculated here should provide sufficient compute power to drive your fleet.

By choosing regional scope, we naturally define three properties across all our reserved instances: the scope, instance size, and instance family. Once we decide on an exact configuration, we execute a purchase in the AWS console and the reserved instance pricing is instantly applied to our fleet.

Because our fleet can dynamically grow, shrink, or change in compute requirements, we need to be more flexible with how we set the target number of reserved instances to purchase. Instead, we choose an acceptable range for a mix of reserved and on-demand instances in our fleet.

To automate this, we built an ETL process in SQL and Python that detects when we fall outside this band and automatically prepares a purchase for us to approve. This is an evergreen process: the ETL process will continue to analyze and suggest purchases over time as the fleet dynamically scales up and down in compute requirements. We purchase reserved instances once a month.

Here’s an example of the SQL query we regularly run to estimate our required compute power. First, we take a snapshot of our fleet with the cost and usage report:

WITH line_items AS (
  lineitem_normalizedusageamount::float / 8.0 AS usage,
  product_region AS region,
  split_part(product_instancetype, '.', 1) AS instance_family,
  lineitem_lineitemtype AS itemtype
  FROM aws.cost_and_usage_201806 -- use your cost & usage report
  WHERE lineitem_productcode = 'AmazonEC2'
  AND lineitem_lineitemtype IN ('Usage', 'DiscountedUsage')
  AND product_instancetype <> ''
  AND lineitem_normalizedusageamount <> ''
  AND date_trunc('hour', lineitem_usagestartdate::timestamp) =
    date_trunc('day', CURRENT_DATE) - interval '4 days'

Next, we select relevant data on usage for our existing reserved instances from our fleet’s total usage:

usage AS (
  SELECT region, instance_family, SUM(usage) AS total,
  SUM(CASE itemtype WHEN 'DiscountedUsage' THEN usage END) as res
  FROM line_items
  GROUP BY region, instance_family

Finally, we compute the number of additional reserved instances we’ll need to purchase to remain within our acceptable range:

region, instance_family,
FLOOR(NVL(res, 0)) AS normalized_reservations,
FLOOR(NVL(total, 0)) AS normalized_usage,
  0.75 * total - res ELSE 0 END) AS to_purchase
FROM usage
ORDER BY region, instance_family

A complete example, including a Python notebook to render the output, can be found in the accompanying gist for this article.

Wrapping up

With this approach, you can automatically budget reserved instances in a predictable manner and dynamically recalculate your compute requirements on an ongoing basis. This process can improve flexibility, cost predictability, and efficiency of your AWS fleet. Here are a few things to keep in mind:

  • Pick one team to own this problem. Since this is a global optimization across the engineering organization, no individual team will have the necessary perspective to understand overall AWS requirements. Dedicating one team to this problem empowers them to gather a complete picture of the organization’s cloud usage and understand how to apply reserved instances effectively.
  • Pick one standard instance size when purchasing reserved instances. Even if the size you choose is larger than the capacity you expect to use for a single application, it’s easier to compare the same size across instance families and understand pricing and compute efficiency.
  • Choose your reserved instances for today’s compute requirements. Rather than choosing reserved instances in anticipation of how you plan to grow your fleet, take a clear snapshot of how you’re using your fleet today. Purchase the number of reserved instances required to meet your goals. Then continue to make purchases frequently and consistently.

Like this post? Join the Stripe engineering team. View openings

June 26, 2018

Learning to operate Kubernetes reliably

Julia Evans on December 20, 2017 in Engineering

We recently built a distributed cron job scheduling system on top of Kubernetes, an exciting new platform for container orchestration. Kubernetes is very popular right now and makes a lot of exciting promises: one of the most exciting is that engineers don’t need to know or care what machines their applications run on.

Distributed systems are really hard, and managing services on distributed systems is one of the hardest problems operations teams face. Breaking in new software in production and learning how to operate it reliably is something we take very seriously. As an example of why learning to operate Kubernetes is important (and why it’s hard!), here’s a fantastic postmortem of a one-hour outage caused by a bug in Kubernetes.

In this post, we’ll explain why we chose to build on top of Kubernetes. We’ll examine how we integrated Kubernetes into our existing infrastructure, our approach to building confidence in (and improving) our Kubernetes’ cluster’s reliability, and the abstractions we’ve built on top of Kubernetes.

What’s Kubernetes?

Kubernetes is a distributed system for scheduling programs to run in a cluster. You can tell Kubernetes to run five copies of a program, and it’ll dynamically schedule them on your worker nodes. Containers are automatically scheduled to increase utilization and save money, powerful deployment primitives allow you to gradually roll out new code, and Security Contexts and Network Policies allow you to run multi-tenant workloads in a secure way.

Kubernetes has a lot of different kinds of scheduling capabilities built into it. It can schedule long-running HTTP services, daemonsets that run on every machine in your cluster, cron jobs that run every hour, and more. There’s a lot more to Kubernetes. If you want to know more, Kelsey Hightower has given a lot of excellent talks: Kubernetes for sysadmins and healthz: Stop reverse engineering applications and start monitoring from the inside are two nice starting points. There’s also a great, supportive community on Slack.

Why Kubernetes?

Every infrastructure project (hopefully!) starts with a business need, and our goal was to improve the reliability and security of an existing distributed cron job system we had. Our requirements were:

  • We needed to be able to build and operate it with a relatively small team (only 2 people were working full time on the project.)
  • We needed to schedule about 500 different cron jobs across around 20 machines reliably.

Here are a few reasons we decided to build on top of Kubernetes:

  • We wanted to build on top of an existing open-source project.
  • Kubernetes includes a distributed cron job scheduler, so we wouldn’t have to write one ourselves.
  • Kubernetes is a very active project and regularly accepts contributions.
  • Kubernetes is written in Go, which is easy to learn. Almost all of our Kubernetes bugfixes were made by inexperienced Go programmers on our team.
  • If we could successfully operate Kubernetes, we could build on top of Kubernetes in the future (for example, we’re currently working on a Kubernetes-based system to train machine learning models.)

We’d previously been using Chronos as a cron job scheduling system, but it was no longer meeting our reliability requirements and it’s mostly unmaintained (1 commit in the last 9 months, and the last time a pull request was merged was March 2016) Because Chronos is unmaintained, we decided it wasn’t worth continuing to invest in improving our existing cluster.

If you’re considering Kubernetes, keep in mind: don’t use Kubernetes just because other companies are using it. Setting up a reliable cluster takes a huge amount of time, and the business case for using it isn’t always obvious. Invest your time in a smart way.

What does reliable mean?

When it comes to operating services, the word reliable isn’t meaningful on its own. To talk about reliability, you first need to establish a SLO (service level objective).

We had three primary goals:

  1. 99.99% of cron jobs should get scheduled and start running within 20 minutes of their scheduled run time. 20 minutes is a pretty wide window, but we interviewed our internal customers and none of them asked for higher precision.
  2. Jobs should run to completion 99.99% of the time (without being terminated).
  3. Our migration to Kubernetes shouldn’t cause any customer-facing incidents.

This meant a few things:

  • Short periods of downtime in the Kubernetes API are acceptable (if it’s down for ten minutes, it’s ok as long as we can recover within five minutes.)
  • Scheduling bugs (where a cron job run gets dropped completely and fails to run at all) are not acceptable. We took reports of scheduling bugs extremely seriously.
  • We needed to be careful about pod evictions and terminating instances safely so that jobs didn’t get terminated too frequently.
  • We needed a good migration plan.

Building a Kubernetes cluster

Our basic approach to setting up our first Kubernetes cluster was to build the cluster from scratch instead of using a tool like kubeadm or kops (using Kubernetes The Hard Way as a reference). We provisioned our configuration with Puppet, our usual configuration management tool. Building from scratch was great for two reasons: we were able to deeply integrate Kubernetes in our architecture, and we developed a deep understanding of its internals.

Building from scratch let us integrate Kubernetes into our existing infrastructure. We wanted seamless integration with our existing systems for logging, certificate management, secrets, network security, monitoring, AWS instance management, deployment, database proxies, internal DNS servers, configuration management, and more. Integrating all those systems sometimes required a little creativity, but overall was easier than trying to shoehorn kubeadm/kops into doing what we wanted.

We already trust and know how to operate all those existing systems, so we wanted to keep using them in our new Kubernetes cluster. For example, secure certificate management is a very hard problem, and we already have a way to issue and manage certificates. We were able to avoid creating a new CA just for Kubernetes with a proper integration.

We were forced to understand exactly how the parameters we were setting affected our Kubernetes setup. For example, there are over a dozen parameters used when configuring the certificates/CAs used for authentication. Understanding all of those parameters made it way easier to debug our setup when we ran into issues with authentication.

Building confidence in Kubernetes

At the beginning of our Kubernetes work, nobody on the team had ever used Kubernetes before (except in some cases for toy projects). How do you get from “None of us have ever used Kubernetes” to “We’re confident running Kubernetes in production”?

Strategy 0: Talk to other companies

We asked a few folks at other companies about their experiences with Kubernetes. They were all using Kubernetes in different ways or on different environments (to run HTTP services, on bare metal, on Google Kubernetes Engine, etc).

Especially when talking about a large and complicated system like Kubernetes, it’s important to think critically about your own use cases, do your own experiments, build confidence in your own environment, and make your own decisions. For example, you should not read this blog post and conclude “Well, Stripe is using Kubernetes successfully, so it will work for us too!”

Here’s what we learned after conversations with several companies operating Kubernetes clusters:

  • Prioritize working on your etcd cluster’s reliability (etcd is where all of your Kubernetes cluster’s state is stored.)
  • Some Kubernetes features are more stable than others, so be cautious of alpha features. Some companies only use stable features after they’ve been stable for more than one release (e.g. if a feature became stable in 1.8, they’d wait for 1.9 or 1.10 before using it.)
  • Consider using a hosted Kubernetes system like GKE/AKS/EKS. Setting up a high-availability Kubernetes system yourself from scratch is a huge amount of work. AWS didn’t have a managed Kubernetes service during this project so this wasn’t an option for us.
  • Be careful about the additional network latency introduced by overlay networks / software defined networking.

Talking to other companies of course didn’t give us a clear answer on whether Kubernetes would work for us, but it did give us questions to ask and things to be cautious about.

Strategy 1: Read the code

We were planning to depend quite heavily on one component of Kubernetes,the cronjob controller. This component was in alpha at the time, which made us a little worried. We’d tried it out in a test cluster, but how could we tell whether it would work for us in production?

Thankfully, all of the cron job controller’s core functionality is just 400 lines of Go. Reading through the source code quickly showed that:

  1. The cron job controller is a stateless service (like every other Kubernetes component, except etcd).
  2. Every ten seconds, this controller calls the syncAll function: go wait.Until(jm.syncAll, 10*time.Second, stopCh)
  3. The syncAll function fetches all cron jobs from the Kubernetes API, iterates through that list, determines which jobs should next run, then starts those jobs.

The core logic seemed relatively easy to understand. More importantly, we felt like if there was a bug in this controller, it was probably something we could fix ourselves.

Strategy 2: Do load testing

Before we started building the cluster in earnest, we did a little bit of load testing. We weren’t worried about how many nodes the Kubernetes cluster could handle (we were planning to deploy around 20 nodes), but we did want to make certain Kubernetes could handle running as many cron jobs as we wanted to run (about 50 per minute).

We ran a test in a 3-node cluster where we created 1,000 cron jobs that each ran every minute. Each of these jobs simply ran bash -c 'echo hello world'. We chose simple jobs because we wanted to test the scheduling and orchestration abilities of the cluster, not the cluster’s total compute capacity.

Our test cluster could not handle 1,000 cron jobs per minute. We observed that every node would only start at most one pod per second, and the cluster was able to run 200 cron jobs per minute without issue. Since we only wanted to run approximately 50 cron jobs per minute, we decided these limits weren’t a blocker (and that we could figure them out later if required). Onwards!

Strategy 3: Prioritize building and testing a high availability etcd cluster

One of the most important things to get right when setting up Kubernetes is running etcd. Etcd is the heart of your Kubernetes cluster—it’s where all of the data about everything in your cluster is stored. Everything other than etcd is stateless. If etcd isn’t running, you can’t make any changes to your Kubernetes cluster (though existing services will continue running!).

This diagram shows how etcd is the heart of your Kubernetes cluster—the API server is a stateless REST/authentication endpoint in front of etcd, and then every other component works by talking to etcd through the API server.

When running, there are two important points to keep in mind:

  • Set up replication so that your cluster doesn’t die if you lose a node. We have three etcd replicas right now.
  • Make sure you have enough I/O bandwidth available. Our version of etcd had an issue where one node with high fsync latency could trigger continuous leader elections, causing unavailability on our cluster. We remediated this by ensuring that all of our nodes had more I/O bandwidth than the number of writes etcd was performing.

Setting up replication isn’t a set-and-forget operation. We carefully tested that we could actually lose an etcd node, and that the cluster gracefully recovered.

Here’s some of the work we did to set up our etcd cluster:

  • Set up replication
  • Monitor that the etcd service is available (if etcd is down, we want to know right away)
  • Write some simple tooling so we could easily spin up new etcd nodes and join them to the cluster
  • Patch etcd’s Consul integration so that we could run more than 1 etcd cluster in our production environment
  • Test recovering from an etcd backup
  • Test that we could rebuild the whole cluster without downtime

We were happy that we did this testing pretty early on. One Friday morning in our production cluster, one of our etcd nodes stopped responding to ping. We got alerted about it, terminated the node, brought up a new one, joined it to the cluster, and in the meantime Kubernetes continued running without incident. Fantastic.

Strategy 4: Incrementally migrate jobs to Kubernetes

One of our major goals was to migrate our jobs to Kubernetes without causing any outages. The secret to running a successful production migrations is not to avoid making any mistakes (that’s impossible), but to design your migration to reduce the impact of mistakes.

We were lucky to have a wide variety of jobs to migrate to our new cluster, so there were some low-impact jobs we could migrate where one or two failures were acceptable.

Before starting the migration, we built easy-to-use tooling that would let us move jobs back and forth between the old and new systems in less than five minutes if necessary. This easy tooling reduced the impact of mistakes—if we moved over a job that had a dependency we hadn’t planned for, no big deal! We could just move it back, fix the issue, and try again later.

Here’s the overall migration strategy we took:

  1. Roughly order the jobs in terms of how critical they were
  2. Repeatedly move some jobs over to Kubernetes. If there’s a new edge case we discover, quickly rollback, fix the issue, and try again.

Strategy 5: Investigate Kubernetes bugs (and fix them)

We set out a rule at the beginning of the project: if Kubernetes does something surprising or unexpected, we have to investigate, figure out why, and come up with a remediation.

Investigating each issue is time consuming, but very important. If we simply dismissed flaky and strange behaviour in Kubernetes as a function of how complex distributed systems can become, we’d feel afraid of being on call for the resulting buggy cluster.

After taking this approach, we discovered (and were able to fix!) several bugs in Kubernetes.

Here are some kinds of issues that we found during these tests:

Fixing these bugs made us feel much better about our use of the Kubernetes project—not only did it work relatively well, but they also accept patches and have a good PR review process.

Kubernetes definitely has bugs, like all software. In particular, we use the scheduler very heavily (because our cron jobs are constantly creating new pods), and the scheduler’s use of caching sometimes results in bugs, regressions, and crashes. Caching is hard! But the codebase is approachable and we’ve been able to handle the bugs we encountered.

One other issue worth mentioning is Kubernetes’ pod eviction logic. Kubernetes has a component called the node controller which is responsible for evicting pods and moving them to another node if a node becomes unresponsive. It’s possible for all nodes to temporarily become unresponsive (e.g. due to a networking or configuration issue), and in that case Kubernetes can terminate all pods in the cluster. This happened to us relatively early on in our testing.

If you’re running a large Kubernetes cluster, carefully read through the node controller documentation, think through the settings carefully, and test extensively. Every time we’ve tested a configuration change to these settings (e.g. --pod-eviction-timeout) by creating network partitions, surprising things have happened. It’s always better to discover these surprises in testing rather than at 3am in production.

Strategy 6: Intentionally cause Kubernetes cluster issues

We’ve discussed running game day exercises at Stripe before, and it’s something we still do very frequently. The idea is to come up with situations you expect to eventually happen in production (e.g. losing a Kubernetes API server) and then intentionally cause those situations in production (during the work day, with warning) to ensure that you can handle them.

After running several exercises on our cluster, they often revealed issues like gaps in monitoring or configuration errors. We were very happy to discover those issues early on in a controlled fashion rather than by surprise six months later.

Here are a few of the game day exercises we ran:

  • Terminate one Kubernetes API server
  • Terminate all the Kubernetes API servers and bring them back up (to our surprise, this worked very well)
  • Terminate an etcd node
  • Cut off worker nodes in our Kubernetes cluster from the API servers (so that they can’t communicate). This resulted in all pods on those nodes being moved to other nodes.

We were really pleased to see how well Kubernetes responded to a lot of the disruptions we threw at it. Kubernetes is designed to be resilient to errors—it has one etcd cluster storing all the state, an API server which is simply a REST interface to that database, and a collection of stateless controllers” that coordinate all cluster management.

If any of the Kubernetes core components (the API server, controller manager, or scheduler) are interrupted or restarted, once they come up they read the relevant state from etcd and continue operating seamlessly. This was one of the things we hoped would be true, and has actually worked very well in practice.

Here are some kinds of issues that we found during these tests:

  • “Weird, I didn’t get paged for that, that really should have paged. Let’s fix our monitoring there.”
  • “When we destroyed our API server instances and brought them back up, they required human intervention. We’d better fix that.”
  • “Sometimes when we do an etcd failover, the API server starts timing out requests until we restart it.”

After running these tests, we developed remediations for the issues we found: we improved monitoring, fixed configuration issues we’d discovered, and filed bugs with Kubernetes.

Making cron jobs easy to use

Let’s briefly explore how we made our Kubernetes-based system easy to use.

Our original goal was to design a system for running cron jobs that our team was confident operating and maintaining. Once we had established our confidence in Kubernetes, we needed to make it easy for our fellow engineers to configure and add new cron jobs. We developed a simple YAML configuration format so that our users didn’t need to understand anything about Kubernetes’ internals to use the system. Here’s the format we developed:

name: job-name-here
  schedule: '15 */2 * * *'
- ruby
- "/path/to/script.rb"
    cpu: 0.1
    memory: 128M
    memory: 1024M

We didn’t do anything very fancy here—we wrote a simple program to take this format and translate it into Kubernetes cron job configurations that we apply with kubectl.

We also wrote a test suite to ensure that job names aren’t too long (Kubernetes cron job names can’t be more than 52 characters) and that all names are unique. We don’t currently use cgroups to enforce memory limits on most of our jobs, but it’s something we plan to roll out in the future.

Our simple format was easy to use, and since we automatically generated both Chronos and Kubernetes cron job definitions from the same format, moving a job between either system was really easy. This was a key part of making our incremental migration work well. Whenever moving a job to Kubernetes caused issues, we could move it back with a simple three-line configuration change and in less than ten minutes.

Monitoring Kubernetes

Monitoring our Kubernetes cluster’s internal state has proven to be very pleasant. We use the kube-state-metrics package for monitoring and a small Go program called veneur-prometheus to scrape the Prometheus metrics kube-state-metrics emits and publish them as statsd metrics to our monitoring system.

For example, here’s a chart of the number of pending pods in our cluster over the last hour. Pending means that they’re waiting to be assigned a worker node to run on. You can see that the number spikes at 11am, because a lot of our cron jobs run at the 0th minute of the hour.

An example chart showing pending pods in a cluster over the last hour

We also have a monitor that checks that no pods are stuck in the Pending state—we check that every pod starts running on a worker node within 5 minutes, or we otherwise receive an alert.

Future plans for Kubernetes

Setting up Kubernetes, getting to a place where we were comfortable running production code and migrating all our cron jobs to the new cluster took us five months with three engineers working full time. One big reason we invested in learning Kubernetes is we expect to be able to use Kubernetes more widely at Stripe.

Here are some principles that apply to operating Kubernetes (or any other complex distributed system):

  • Define a clear business reason for your Kubernetes projects (and all infrastructure projects!). Understanding the business case and the needs of our users made our project significantly easier.
  • Aggressively cut scope. We decided to avoid using many of Kubernetes’ basic features to simplify our cluster. This let us ship more quickly—for example, since pod-to-pod networking wasn’t a requirement for our project, we could firewall off all network connections between nodes and defer thinking about network security in Kubernetes to a future project.
  • Invest a lot of time into learning how to properly operate a Kubernetes cluster. Test edge cases carefully. Distributed systems are extremely complicated and there’s a lot of potential for things to go wrong. Take the example we described earlier: the node controller can kill all pods in your cluster if they lose contact with API servers, depending on your configuration. Learning how Kubernetes behaves after each configuration change takes time and careful focus.

By staying focused on these principles, we’ve been able to use Kubernetes in production with confidence. We’ll continue to grow and evolve how we use Kubernetes over time—for example, we’re watching AWS’s release of EKS with interest. We’re finishing work on another system to train machine learning models and are also exploring moving some HTTP services to Kubernetes. As we continue operating Kubernetes in production, we plan to contribute to the open-source project along the way.

Like this post? Join the Stripe engineering team. View Openings

December 20, 2017

APIs as infrastructure: future-proofing Stripe with versioning

Brandur Leach on August 15, 2017 in Engineering

When it comes to APIs, change isn’t popular. While software developers are used to iterating quickly and often, API developers lose that flexibility as soon as even one user starts consuming their interface. Many of us are familiar with how the Unix operating system evolved. In 1994, The Unix-Haters Handbook was published containing a long list of missives about the software—everything from overly-cryptic command names that were optimized for Teletype machines, to irreversible file deletion, to unintuitive programs with far too many options. Over twenty years later, an overwhelming majority of these complaints are still valid even across the dozens of modern derivatives. Unix had become so widely used that changing its behavior would have challenging implications. For better or worse, it established a contract with its users that defined how Unix interfaces behave.

Similarly, an API represents a contract for communication that can’t be changed without considerable cooperation and effort. Because so many businesses rely on Stripe as infrastructure, we’ve been thinking about these contracts since Stripe started. To date, we’ve maintained compatibility with every version of our API since the company’s inception in 2011. In this article, we’d like to share how we manage API versions at Stripe.

Code written to integrate with an API has certain inherent expectations built into it. If an endpoint returns a boolean field called verified to indicate the status of a bank account, a user might write code like this:

if bank_account[:verified]

If we later replaced the bank account’s verified boolean with a status field that might include the value verified (like we did back in 2014), the code will break because it depends on a field that no longer exists. This type of change is backwards-incompatible, and we avoid making them. Fields that were present before should stay present, and fields should always preserve their same type and name. Not all changes are backwards-incompatible though; for example, it’s safe to add a new API endpoint, or a new field to an existing API endpoint that was never present before.

With enough coordination, we might be able to keep users apprised of changes that we’re about to make and have them update their integrations, but even if that were possible, it wouldn’t be very user-friendly. Like a connected power grid or water supply, after hooking it up, an API should run without interruption for as long as possible.

Our mission at Stripe is to provide the economic infrastructure for the internet. Just like a power company shouldn’t change its voltage every two years, we believe that our users should be able to trust that a web API will be as stable as possible.

API versioning schemes

A common approach to allow forward progress in web APIs is to use versioning. Users specify a version when they make requests and API providers can make the changes they want for their next version while maintaining compatibility in the current one. As new versions are released, users can upgrade when it’s convenient for them.

This is often seen as a major versioning scheme with names like v1, v2, and v3 that are passed as a prefix to a URL (like /v1/widgets) or through an HTTP header like Accept. This can work, but has the major downside of changes between versions being so big and so impactful for users that it’s almost as painful as re-integrating from scratch. It’s also not a clear win because there will be a class of users that are unwilling or unable to upgrade and get trapped on old API versions. Providers then have to make the difficult choice between retiring API versions and by extension cutting those users off, or maintaining the old versions forever at considerable cost. While having providers maintain old versions might seem at first glance to be beneficial to users, they’re also paying indirectly in the form of reduced progress on improvements. Instead of working on new features, engineering time is diverted to maintaining old code.

At Stripe, we implement versioning with rolling versions that are named with the date they’re released (for example, 2017-05-24). Although backwards-incompatible, each one contains a small set of changes that make incremental upgrades relatively easy so that integrations can stay current.

The first time a user makes an API request, their account is automatically pinned to the most recent version available, and from then on, every API call they make is assigned that version implicitly. This approach guarantees that users don’t accidentally receive a breaking change and makes initial integration less painful by reducing the amount of necessary configuration. Users can override the version of any single request by manually setting the Stripe-Version header, or upgrade their account’s pinned version from Stripe’s dashboard.

Some readers might have already noticed that the Stripe API also defines major versions using a prefixed path (like /v1/charges). Although we reserve the right to make use of this at some point, it’s not likely to change for some time. As noted above, major version changes tend to make upgrades painful, and it’s hard for us to imagine an API redesign that’s important enough to justify this level of user impact. Our current approach has been sufficient for almost a hundred backwards-incompatible upgrades over the past six years.

Versioning under the hood

Versioning is always a compromise between improving developer experience and the additional burden of maintaining old versions. We strive to achieve the former while minimizing the cost of the latter, and have implemented a versioning system to help us with it. Let’s take a quick look at how it works. Every possible response from the Stripe API is codified by a class that we call an API resource. API resources define their possible fields using a DSL:

class ChargeAPIResource
  required :id, String
  required :amount, Integer

API resources are written so that the structure they describe is what we’d expect back from the current version of the API. When we need to make a backwards-incompatible change, we encapsulate it in a version change module which defines documentation about the change, a transformation, and the set of API resource types that are eligible to be modified:

class CollapseEventRequest < AbstractVersionChange
  description \
    "Event objects (and webhooks) will now render " \
    "`request` subobject that contains a request ID " \
    "and idempotency key instead of just a string " \
    "request ID."

  response EventAPIResource do
    change :request, type_old: String, type_new: Hash

    run do |data|
      data.merge(:request => data[:request][:id])

Elsewhere, version changes are assigned to a corresponding API version in a master list:

class VersionChanges
    '2017-05-25' => [
    '2017-04-06' => [Change::LegacyTransfers],
    '2017-02-14' => [
    '2017-01-27' => [Change::SourcedTransfersOnBts],

Version changes are written so that they expect to be automatically applied backwards from the current API version and in order. Each version change assumes that although newer changes may exist in front of them, the data they receive will look the same as when they were originally written.

When generating a response, the API initially formats data by describing an API resource at the current version, then determines a target API version from one of:

  • A Stripe-Version header if one was supplied.
  • The version of an authorized OAuth application if the request is made on the user’s behalf.
  • The user’s pinned version, which is set on their very first request to Stripe.

It then walks back through time and applies each version change module that finds along the way until that target version is reached.

Requests are processed by version change modules before returning a response.

Version change modules keep older API versions abstracted out of core code paths. Developers can largely avoid thinking about them while they’re building new products.

Changes with side effects

Most of our backwards-incompatible API changes will modify a response, but that’s not always the case. Sometimes a more complicated change is necessary which leaks out of the module that defines it. We assign these modules a has_side_effects annotation and the transformation they define becomes a no-op:

class LegacyTransfers < AbstractVersionChange
  description "..."

Elsewhere in the code a check will be made to see whether they’re active:

This reduced encapsulation makes changes with side effects more complex to maintain, so we try to avoid them.

Declarative changes

One advantage of self-contained version change modules is that they can declare documentation describing what fields and resources they affect. We can also reuse this to rapidly provide more helpful information to our users. For example, our API changelog is programmatically generated and receives updates as soon as our services are deployed with a new version.

We also tailor our API reference documentation to specific users. It notices who is logged in and annotates fields based on their account API version. Here, we’re warning the developer that there’s been a backwards-incompatible change in the API since their pinned version. The request field of Event was previously a string, but is now a subobject that also contains an idempotency key (produced by the version change that we showed above):

Screenshot of a tooltip in the Stripe API documentation indicating API changes made since the users current version

Our documentation detects the user’s API version and presents relevant warnings.

Minimizing change

Providing extensive backwards compatibility isn’t free; every new version is more code to understand and maintain. We try to keep what we write as clean as possible, but given enough time dozens of checks on version changes that can’t be encapsulated cleanly will be littered throughout the project, making it slower, less readable, and more brittle. We take a few measures to try and avoid incurring this sort of expensive technical debt.

Even with our versioning system available, we do as much as we can to avoid using it by trying to get the design of our APIs right the first time. Outgoing changes are funneled through a lightweight API review process where they’re written up in a brief supporting document and submitted to a mailing list. This gives each proposed change broader visibility throughout the company, and improves the likelihood that we’ll catch errors and inconsistencies before they’re released.

We try to be mindful of balancing stagnation and leverage. Maintaining compatibility is important, but even so, we expect to eventually start retiring our older API versions. Helping users move to newer versions of the API gives them access to new features, and simplifies the foundation that we use to build new features.

Principles of change

The combination of rolling versions and an internal framework to support them has enabled us to onboard vast numbers of users, make enormous changes to our API—all while having minimal impact on existing integrations. The approach is driven by a few principles that we’ve picked up over the years. We think it’s important that API upgrades are:

  • Lightweight. Make upgrades as cheap as possible (for users and for ourselves).
  • First-class. Make versioning a first-class concept in your API so that it can be used to keep documentation and tooling accurate and up-to-date, and to generate a changelog automatically.
  • Fixed-cost. Ensure that old versions add only minimal maintenance cost by tightly encapsulating them in version change modules. Put another way, the less thought that needs to be applied towards old behavior while writing new code, the better.

While we’re excited by the debate and developments around REST vs. GraphQL vs. gRPC, and—more broadly—what the future of web APIs will look like, we expect to continue supporting versioning schemes for a long time to come.

Like this post? Join the Stripe engineering team. View Openings

August 15, 2017

Connect: behind the front-end experience

Benjamin De Cock on June 19, 2017 in Engineering

We recently released a new and improved version of Connect, our suite of tools designed for platforms and marketplaces. Stripe’s design team works hard to create unique landing pages that tell a story for our major products. For this release, we designed Connect’s landing page to reflect its intricate, cutting-edge capabilities while keeping things light and simple on the surface.

In this blog post, we’ll describe how we used several next-generation web technologies to bring Connect to life, and walk through some of the finer technical details (and excitement!) on our front-end journey.

CSS Grid Layout

Earlier this year, three major browsers (Firefox, Chrome, and Safari) almost simultaneously shipped their implementation of the new CSS Grid Layout module. This specification provides authors with a two-dimensional layout system that is easy-to-use and incredibly powerful. Connect’s landing page relies on CSS grids pretty much everywhere, making some seemingly tricky designs almost trivial to achieve. As an example, let’s hide the header’s content and focus on its background:

Historically, we’ve created these background stripes (as we obviously call them) by using absolute positioning to precisely place each stripe on the page. This approach works, but fragile positioning often results in subtle issues: for example, rounding errors can cause a 1px gap between two stripes. CSS stylesheets also quickly become verbose and hard to maintain, since media queries need to be more complex to account for background differences at various viewport sizes.

With CSS Grid, pretty much all our previous issues go away. We simply define a flexible grid and place the stripes in their appropriate cells. Firefox has a handy grid inspector allowing you to visualize the structure of your layout. Let’s see how it looks:

We’ve highlighted three stripes and removed the tilt effect to make things easier to understand. Here’s what the CSS for our grid looks like:

header .stripes {
  display: grid;
  grid: repeat(5, 200px) / repeat(10, 1fr);

header .stripes :nth-child(1) {
  grid-column: span 3;

header .stripes :nth-child(2) {
  grid-area: 3 / span 3 / auto / -1;

header .stripes :nth-child(3) {
  grid-row: 4;
  grid-column: span 5;

We can then just transform the entire .stripes container to produce the tilted background:

header .stripes {
  transform: skewY(-12deg);
  transform-origin: 0;

And voilà! CSS Grid might look intimidating at first sight as it comes with an unusual syntax and many new properties and values, but the mental modal is actually very simple. And if you’re used to Flexbox, you’re already familiar with the Box Alignment module, which means you can reuse the properties you know and love such as justify-content and align-items.


The landing page’s header displays several cubes as a visual metaphor for the building blocks that compose Connect. These floating cubes rotate in 3D at random speeds (within a certain range) and benefit from the same light source, which dynamically illuminates the appropriate faces:

These cubes are simple DOM elements that are generated and animated in JavaScript. Each of them instantiate the same HTML template:

<!-- HTML -->
<template id="cube-template">
  <div class="cube">
    <div class="shadow"></div>
    <div class="sides">
      <div class="back"></div>
      <div class="top"></div>
      <div class="left"></div>
      <div class="front"></div>
      <div class="right"></div>
      <div class="bottom"></div>

// JavaScript
const createCube = () => {
  const template = document.getElementById("cube-template");
  const fragment = document.importNode(template.content, true);
  return fragment;

Pretty straightforward. We can now easily turn these blank and empty elements into a three-dimensional shape. Thanks to 3D transforms, adding perspective and moving the sides along the z-axis is fairly natural:

.cube, .cube * {
  position: absolute;
  width: 100px;
  height: 100px

.sides {
  transform-style: preserve-3d;
  perspective: 600px

.front  { transform: rotateY(0deg)    translateZ(50px) }
.back   { transform: rotateY(-180deg) translateZ(50px) }
.left   { transform: rotateY(-90deg)  translateZ(50px) }
.right  { transform: rotateY(90deg)   translateZ(50px) }
.top    { transform: rotateX(90deg)   translateZ(50px) }
.bottom { transform: rotateX(-90deg)  translateZ(50px) }

While CSS makes it trivial to model the cube, it doesn’t provide advanced animation features like dynamic shading. The cube’s animation instead relies on requestAnimationFrame to calculate and update each side at any point in the rotation. There are three things to determine on every frame:

  • Visibility. There are never more than three faces visible at the same time, so we can avoid any computations and expensive repaints for hidden sides.
  • Transformation. Each visible side of the cube needs to be transformed based on its initial rotation, current animation state, and the speed of each axis.
  • Shading. While CSS lets you position elements in a three-dimensional space, there are no traditional concepts from 3D environments (e.g. light sources). In order to mimic a 3D environment, we can render a light source by progressively darkening the sides of the cube as they move away from a particular point.

There are other considerations to take into account (such as improving performance using requestIdleCallback in JavaScript and backface-visibility in CSS), but these are the main pillars behind the logic of the animation.

We can calculate the visibility and transformation of each side by continually tracking their state and updating them with basic math operations. With the help of pure functions and ES2015 features such as template literals, things become even easier. Here are two short excerpts of JavaScript code to compute and define the current transformation:

const getDistance = (state, rotate) =>
  ["x", "y"].reduce((object, axis) => {
    object[axis] = Math.abs(state[axis] + rotate[axis]);
    return object;
  }, {});

const getRotation = (state, size, rotate) => {
  const axis = rotate.x ? "Z" : "Y";
  const direction = rotate.x > 0 ? -1 : 1;

  return `
    rotateX(${state.x + rotate.x}deg)
    rotate${axis}(${direction * (state.y + rotate.y)}deg)
    translateZ(${size / 2}px)

The most challenging piece of the puzzle is how to properly calculate shading for each face of the cube. In order to simulate a virtual light source at the center of the stage, we can gradually increase each face’s lighting effect as they approach the center point—on all axes. Concretely, that means we need to calculate the luminosity and color for each face. We’ll perform this calculation on every frame by interpolating the base color and the current shading factor.

// Linear interpolation between a and b
// Example: (100, 200, .5) = 150
const interpolate = (a, b, i) => a * (1 - i) + b * i;

const getShading = (tint, rotate, distance) => {
  const darken = ["x", "y"].reduce((object, axis) => {
    const delta = distance[axis];
    const ratio = delta / 180;
    object[axis] = delta > 180 ? Math.abs(2 - ratio) : ratio;
    return object;
  }, {});

  if (rotate.x)
    darken.y = 0;
  else {
    const {x} = distance;
    if (x > 90 && x < 270)
      directions.forEach(axis => darken[axis] = 1 - darken[axis]);

  const alpha = (darken.x + darken.y) / 2;
  const blend = (value, index) =>
    Math.round(interpolate(value, tint.shading[index], alpha));

  const [r, g, b] =;
  return `rgb(${r}, ${g}, ${b})`;

Phew! The rest of the code is fortunately far less hairy and mostly composed of boilerplate code, DOM helpers and other elementary abstractions. One last detail that’s worth mentioning is the technique used to make the animations less obtrusive depending on the user’s preferences:

On macOS, when Reduce Motion is enabled in System Preferences, the new prefers-reduced-motion media query will be triggered (only in Safari for now), and all decorative animations on the page will be disabled. The cubes use both CSS animations to fade in and JavaScript animations to rotate. We can cancel these animations with a combination of a @media block and the MediaQueryList Interface:

/* CSS */
@media (prefers-reduced-motion) {
  #header-hero * {
    animation: none

// JavaScript
const reduceMotion = matchMedia("(prefers-reduced-motion)").matches;
const tick = () => {
  if (reduceMotion) return;

More CSS 3D!

We use custom 3D-rendered devices across the site to showcase Stripe customers and apps in situ. In our never-ending quest to reduce file sizes and loading time, we considered a few options to achieve a soft three-dimensional look and feel with lightweight and resolution-independent assets. Drawing the devices directly in CSS fulfilled our objectives. Here’s the CSS laptop:

Defining the object in CSS is obviously less convenient than exporting a bitmap, but it’s worth the effort. The laptop above weighs less than 1KB and is easy to tweak. We can add hardware-acceleration, animate any part, make it responsive without losing image quality, and precisely position DOM elements (e.g. other images) within the laptop’s display. This flexibility doesn’t mean giving up on clean code—the markup stays clear, concise and descriptive:

<div class="laptop">
  <span class="shadow"></span>
  <span class="lid"></span>
  <span class="camera"></span>
  <span class="screen"></span>
  <span class="chassis">
    <span class="keyboard"></span>
    <span class="trackpad"></span>

Styling the laptop involves a mix of gradients, shadows and transforms. In many ways, it’s a simple translation of the workflow and concepts you know and use in your graphic tools. For example, here’s the CSS code for the lid:

.laptop .lid {
  position: absolute;
  width: 100%;
  height: 100%;
  border-radius: 20px;
  background: linear-gradient(45deg, #E5EBF2, #F3F8FB);
  box-shadow: inset 1px -4px 6px rgba(145, 161, 181, .3)

Choosing the right tool for the job isn’t always obvious—between CSS, SVG, Canvas, WebGL and images the choice isn’t as clear as it used to be. It’s easy to dismiss CSS as something exclusively meant for presenting documents, but it’s just as easy to go overboard and abuse its visual capabilities. No matter the technology you choose, optimize for the user! This means paying close attention to client-side performance, accessibility needs, and fallback options for older browsers.

Web Animations API

The Onboarding & Verification section showcases a demo of Express, Connect’s new user onboarding flow. The whole animation is built in code and relies for the most part on the new Web Animations API.

The Web Animations API provides the performance and simplicity of CSS @keyframes in JavaScript, making it easy to create smooth, chainable animation sequences. As opposed to the requestAnimationFrame low-level API, you get all the niceties of CSS animations for free, such as native support for cubic-bezier easing functions. As an example, let’s take a look at the code for our keyboard sliding animation:

const toggleKeyboard = (element, callback, action) => {
  const keyframes = {
    transform: [100, 0].map(n => `translateY(${n}%)`)

  const options = {
    duration: 800,
    fill: "forwards",
    easing: "cubic-bezier(.2, 1, .2, 1)",
    direction: action == "hide" ? "reverse" : "normal"

  const animation = element.animate(keyframes, options);
  animation.addEventListener("finish", callback, {once: true});

Nice and simple! The Web Animations API covers the vast majority of typical UI animation needs without requiring a third-party dependency (as a result, the whole Express animation is about 5KB all included: scripts, images, etc.). That being said, it is not a downright replacement for requestAnimationFrame which still provides finer control over your animation and allows you to create effects otherwise impossible, such as spring curves and independent transform functions. If you’re not sure about the right technology to use for your animations, you can probably prioritize your options like this:

  1. CSS transitions. This is the fastest, easiest, and most efficient way to animate. For simple things like hover effects, this is the way to go.
  2. CSS animations. These have the same performance characteristics as CSS transitions: they’re declarative animations that can be highly optimized by the browsers and run on a separate thread. CSS animations are more powerful than transitions and allow for multiple steps and multiple iterations. They’re also more intricate to implement as they require named @keyframes declaration and often need an explicit animation-fill-mode. (And naming things is always one of the hardest things in computer science!)
  3. Web Animations API. This API offers almost the same performance as CSS animations (these animations are driven by the same engine, but JavaScript code will still run on the main thread) and nearly the same ease of use. This should be your default choice for any animation where you need interactivity, random effects, chainable sequences, and anything richer than a purely declarative animation.
  4. requestAnimationFrame. The sky is the limit, but you have to engineer the rocket ship. The possibilities are endless and the rendering methods unlimited (HTML, SVG, canvas—you name it), but it’s a lot more complicated to use and may not perform as well as the previous options.

No matter the technique you use, there are a few simple tips you can apply everywhere to make your animations look significantly better:

  • Custom curves. You almost never want to use a built-in timing-function like ease-in, ease-out and linear. A nice time-saver is to globally define a number of custom cubic-bezier variables.
  • Performance. Avoid jank in your animations at all costs. In CSS, this means exclusively animating cheap properties (transform and opacity) and offloading animations to the GPU when you can (using will-change).
  • Speed. Animations should never get in the way. The very goal of animations is to make a UI feel responsive, harmonious, enjoyable and polished. There’s no hard limit on the exact animation duration as it depends on the effect and the curve, but in most cases you’ll want to stay under 500 milliseconds.

Intersection Observer

The Express animation starts playing automatically as soon as it’s visible in the viewport (you can try it by scrolling the page). This is usually accomplished by observing scroll movements to trigger some callback, but historically this meant adding expensive event listeners, resulting in verbose and inefficient code.

Connect’s landing page uses the new Intersection Observer API which provides a much more robust and performant way to detect the visibility of an element. Here’s how we start playing the Express animation:

const observeScroll = (element, callback) => {
  const observer = new IntersectionObserver(([entry]) => {
    if (entry.intersectionRatio < 1) return;

    // Stop watching the element
    threshold: 1

  // Start watching the element

const element = document.getElementById("express-animation");
observeScroll(element, startAnimation);

The observeScroll helper simplifies our detection behavior (i.e. when an element is fully visible, the callback is triggered once) without executing anything on the main thread. Thanks to the Intersection Observer API, we’re now one step closer to buttery-smooth web pages!

Polyfills and fallbacks

All these new and shiny APIs are exciting, but they’re unfortunately not yet available everywhere. The common workaround is to use polyfills to feature-test for a particular API and execute only if the API is missing. The obvious downside to this approach is that it penalizes everyone, forever, by forcing them to download the polyfill regardless of whether it’s used. We decided on a different solution:

For JavaScript APIs, Connect’s landing page feature-tests whether a polyfill is necessary and can dynamically insert it in the page. Scripts that are dynamically created and added to the document are asynchronous by default, which means the order of execution isn’t guaranteed. That’s obviously a problem, as a given script may execute before an expected polyfill. Thankfully, we can fix that by explicitly marking our scripts as not asynchronous and therefore lazy-load only what’s required:

const insert = name => {
  const el = document.createElement("script");
  el.src = `${name}.js`;
  el.async = false; // Keep the execution order

const scripts = ["main"];

if (!Element.prototype.animate)

if (!("IntersectionObserver" in window))


For CSS, the problem and solution are pretty much the same as for JavaScript polyfills. The typical way to use modern CSS features is to write the fallback first and override it when possible:

div { display: flex }

@supports (display: grid) {
  div { display: grid }

CSS feature queries are easy, reliable, and they should likely be your default choice. However, they weren’t suited to our audience since close to 90% of our visitors already use a Grid-friendly browser (❤️). In our case, it didn’t make sense to penalize the overwhelming majority of our users with hundreds of fallback rules for a small and decreasing percentage of browsers. Given these statistics, we chose to dynamically create and insert a fallback stylesheet when needed:

// Some browsers not supporting Grid don’t support CSS.supports
// either, so we need to feature-test it the old-fashioned way:

if (!("grid" in {
  const fallback = "<link rel=stylesheet href=fallback.css>";
  document.head.insertAdjacentHTML("beforeend", fallback);

That’s a wrap!

We hope you enjoyed (and maybe even learned) some of these front-end tips! Modern browsers provide us with powerful tools to create rich, fast and engaging experiences, letting our creativity shine on the web. If you’re as excited as we are about the possibilities, we should probably experiment with them together.

Like this post? Join the Stripe design team. View Openings

June 19, 2017