Blog Engineering

Follow Stripe on Twitter

Stripe’s payments APIs: the first ten years

Michelle Bu on December 15, 2020 in Engineering

A few years ago, Bloomberg Businessweek published a feature story on Stripe. Four words spanned the center of the cover: “seven lines of code,” suggesting that’s all it took for a business to power payments on Stripe. The assertion was bold—and became a theme and meme for us.

To this day, it’s not entirely clear which seven lines the article referenced. The prevailing theory is that it’s the roughly seven lines of curl it took to create a Charge. However, a search for the seven lines of code ultimately misses the point: the ability to open up a terminal, run this curl snippet, then immediately see a successful credit card payment felt like seven lines of code. It’s unlikely that a developer believed a production-ready payments integration involved literally only seven lines of code. But taking something as complex as credit card processing and reducing the integration to only a few lines of code that, when run, immediately returns a successful Charge object is really quite magical.

Abstracting away the complexity of payments has driven the evolution of our APIs over the last decade. This post provides the context, inflection points, and conceptual frameworks behind our API design. It’s the extreme exception that our approach to APIs makes the cover of a business magazine. This post shares a bit more of how we’ve grown around and beyond those seven lines.

A condensed history of Stripe’s payments APIs

Successful products tend to organically expand over time, resulting in product debt. Similar to tech debt, product debt accumulates gradually, making the product harder to understand for users and change for product teams. For API products, it’s particularly tempting to accrue product debt because it’s hard to get your users to fundamentally restructure their integration; it’s much easier to get them to add a parameter or two to their existing API requests.

In retrospect, we see clearly how our APIs have evolved—and which decisions were pivotal in shaping them. Here are the milestones that defined our payments APIs and led to the PaymentIntents API.

Read more

December 15, 2020

To design and develop an interactive globe

Nick Jones on September 1, 2020 in Engineering

As humans, we’re driven to build models of our world.

A traditional globemaker molds a sphere, mounts it on an axle, balances it with hidden weights, and precisely applies gores—triangular strips of printed earth—to avoid overlap and align latitudes. Cartographers face unenviable trade-offs when making maps. They can either retain the shape of countries, but warp their size—or maintain the size of countries, but contort their shape. In preserving one aspect of our world, they distort another.

Read more

September 1, 2020

Stripe's remote engineering hub, one year in

Jay Shirley on May 28, 2020 in Engineering

Last May, Stripe launched our remote engineering hub, a virtual office coequal with our physical engineering offices in San Francisco, Seattle, Dublin, and Singapore. We set out to hire 100 new remote engineers over the year—and did. They now work across every engineering group at Stripe. Over the last year, we’ve tripled the number of permanently remote engineers, up to 22% of our engineering population. We also hired more remote employees across all other teams, and tripled the number of remote Stripes across the company.

Like many organizations, Stripe has temporarily become fully remote to support our employees and customers during the COVID-19 pandemic. Distributed work isn’t new to Stripe. We’ve had remote employees since inception—and formally began hiring remote engineers in 2013. But as we grew, we developed a heavily office-centric organizational structure. Last year, we set out to rebalance our mix of remote and centralized working by establishing our virtual hub. It’s now the backbone of a new working model for the whole company.

We think our experience might be interesting, particularly for businesses that haven’t been fully distributed from the start or are considering flipping the switch to being fully remote, even after the pandemic. We’ve seen promising gains in how we communicate, build more resilient and relevant products, and reach and retain talented engineers. Here is what we learned.

Read more

May 28, 2020

Stripe CLI

Tomer Elmalem on November 5, 2019 in Engineering

Building and testing a Stripe integration can require frequent switching between the terminal, your code editor, and the Dashboard. Today, we’re excited to launch the Stripe command-line interface (CLI). It lets you interact with Stripe right from the terminal and makes it easier to build, test, and manage your integration.

Read more

November 5, 2019

Designing accessible color systems

Daryl Koopersmith and Wilson Miner on October 15, 2019 in Engineering

Color contrast is an important aspect of accessibility. Good contrast makes it easier for people with visual impairments to use products, and helps in imperfect conditions like low-light environments or older screens. With this in mind, we recently updated the colors in our user interfaces to be more accessible. Text and icon colors now reliably have legible contrast throughout the Stripe Dashboard and all other products built with our internal interface library.

Read more

October 15, 2019

Fast and flexible observability with canonical log lines

Brandur Leach on July 30, 2019 in Engineering

We’ve found using a slight augmentation to traditional logging immensely useful at Stripe—an idea that we call canonical log lines. It’s quite a simple technique: in addition to their normal log traces, requests also emit one long log line at the end that includes many of their key characteristics. Having that data colocated in single information-dense lines makes queries and aggregations over it faster to write, and faster to run.

Read more

July 30, 2019

The secret life of DNS packets: investigating complex networks

Jeff Jo on May 21, 2019 in Engineering

DNS is a critical piece of infrastructure used to facilitate communication across networks. It’s often described as a phonebook: in its most basic form, DNS provides a way to look up a host’s address by an easy-to-remember name. For example, looking up the domain name will direct clients to the IP address, where one of Stripe’s servers is located. Before any communication can take place, one of the first things a host must do is query a DNS server for the address of the destination host. Since these lookups are a prerequisite for communication, maintaining a reliable DNS service is extremely important. DNS issues can quickly lead to crippling, widespread outages, and you could find yourself in a real bind.

It’s important to establish good observability practices for these systems so when things go wrong, you can clearly understand how they’re failing and act quickly to minimize any impact. Well-instrumented systems provide visibility into how they operate; establishing a monitoring system and gathering robust metrics are both essential to effectively respond to incidents. This is critical for post-incident analysis when you’re trying to understand the root cause and prevent recurrences in the future.

In this post, I’ll describe how we monitor our DNS systems and how we used an array of tools to investigate and fix an unexpected spike in DNS errors that we encountered recently.

DNS infrastructure at Stripe

At Stripe, we operate a cluster of DNS servers running Unbound, a popular open-source DNS resolver that can recursively resolve DNS queries and cache the results. These resolvers are configured to forward DNS queries to different upstream destinations based on the domain in the request. Queries that are used for service discovery are forwarded to our Consul cluster. Queries for domains we configure in Route 53 and any other domains on the public Internet are forwarded to our cluster’s VPC resolver, which is a DNS resolver that AWS provides as part of their VPC offering. We also run resolvers locally on every host, which provides an additional layer of caching.

Unbound runs locally on every host as well as on the DNS servers.

Unbound exposes an extensive set of statistics that we collect and feed into our metrics pipeline. This provides us with visibility into metrics like how many queries are being served, the types of queries, and cache hit ratios.

We recently observed that for several minutes every hour, the cluster’s DNS servers were returning SERVFAIL responses for a small percentage of internal requests. SERVFAIL is a generic response that DNS servers return when an error occurs, but it doesn’t tell us much about what caused the error.

Without much to go on initially, we found another clue in the request list depth metric. (You can think of this as Unbound’s internal todo list, where it keeps track of all the DNS requests it needs to resolve.)

An increase in this metric indicates that Unbound is unable to process messages in a timely fashion, which may be caused by an increase in load. However, the metrics didn’t show a significant increase in the number of DNS queries, and resource consumption didn’t appear to be hitting any limits. Since Unbound resolves queries by contacting external nameservers, another explanation could be that these upstream servers were taking longer to respond.

Tracking down the source

We followed this lead by logging into one of the DNS servers and inspecting Unbound’s request list.

$ unbound-control dump_requestlist
thread #0
#   type cl name    seconds    module status
  0    A IN - iterator wait for
  1  PTR IN - iterator wait for
  2  PTR IN - iterator wait for
  3  PTR IN - iterator wait for
  4  PTR IN - iterator wait for
  5  PTR IN - iterator wait for
  6  PTR IN - iterator wait for
  7  PTR IN - iterator wait for
  8  PTR IN - iterator wait for
  9  PTR IN - iterator wait for
 10  PTR IN - iterator wait for

This confirmed that requests were accumulating in the request list. We noticed some interesting details: most of the entries in the list corresponded to reverse DNS lookups (PTR records) and they were all waiting for a response from, which is the IP address of the VPC resolver.

We then used tcpdump to capture the DNS traffic on one of the servers to get a better sense of what was happening and try to identify any patterns. We wanted to make sure we captured the traffic during one of these spikes, so we configured tcpdump to write data to files over a period of time. We split the files across 60 second collection intervals to keep file sizes small, which made it easier to work with them.

# Capture all traffic on port 53 (DNS traffic)
# Write data to files in 60 second intervals for 30 minutes
# and format the filenames with the current time
$ tcpdump -n -tt -i any -W 30 -G 60 -w '%FT%T.pcap' port 53

The packet captures revealed that during the hourly spike, 90% of requests made to the VPC resolver were reverse DNS queries for IPs in the CIDR range. The vast majority of these queries failed with a SERVFAIL response. We used dig to query the VPC resolver with a few of these addresses and confirmed that it took longer to receive responses.

By looking at the source IPs of clients making the reverse DNS queries, we noticed they were all coming from hosts in our Hadoop cluster. We maintain a database of when Hadoop jobs start and finish, so we were able to correlate these times to the hourly spikes. We finally narrowed down the source of the traffic to one job that analyzes network activity logs and performs a reverse DNS lookup on the IP addresses found in those logs.

One more surprising detail we discovered in the tcpdump data was that the VPC resolver was not sending back responses to many of the queries. During one of the 60-second collection periods the DNS server sent 257,430 packets to the VPC resolver. The VPC resolver replied back with only 61,385 packets, which averages to 1,023 packets per second. We realized we may be hitting the AWS limit for how much traffic can be sent to a VPC resolver, which is 1,024 packets per second per interface. Our next step was to establish better visibility in our cluster to validate our hypothesis.

Counting packets

AWS exposes its VPC resolver through a static IP address relative to the base IP of the VPC, plus two (for example, if the base IP is, then the VPC resolver will be at We need to track the number of packets sent per second to this IP address. One tool that can help us here is iptables, since it keeps track of the number of packets matched by a rule.

We created a rule that matches traffic headed to the VPC resolver IP address and added it to the OUTPUT chain, which is a set of iptables rules that are applied to all packets sent from the host.

# Create a new chain called VPC_RESOLVER
$ iptables -N VPC_RESOLVER

# Match packets destined to VPC resolver and jump to the new chain
$ iptables -A OUTPUT -d -j VPC_RESOLVER

# Add an empty rule to the new chain to help parse the output
$ iptables -A VPC_RESOLVER

We configured the rule to jump to a new chain called VPC_RESOLVER and added an empty rule to that chain. Since our hosts could contain other rules in the OUTPUT chain, we added this rule to isolate matches and make it a little easier to parse the output.

Listing the rules, we see the number of packets sent to the VPC resolver in the output:

$ iptables -L -v -n -x

Chain OUTPUT (policy ACCEPT 41023 packets, 2569001 bytes)
  pkts   bytes target     prot opt in     out     source               destination
 41023 2569001 VPC_RESOLVER  all  --  *      *  

Chain VPC_RESOLVER (1 references)
  pkts   bytes target     prot opt in     out     source               destination
 41023 2569001            all  --  *      *  

With this, we wrote a simple service that reads the statistics from the VPC_RESOLVER chain and reports this value through our metrics pipeline.

while :
  PACKET_COUNT=$(iptables -L VPC_RESOLVER 1 -x -n -v | awk '{ print $1 }')
  report-metric $PACKET_COUNT "vpc_resolver.packet_count"
  sleep 1

Once we started collecting this metric, we could see that the hourly spikes in SERVFAIL responses lined up with periods where the servers were sending too much traffic to the VPC resolver.

Traffic amplification

The data we saw from iptables (the number of packets per second sent to the VPC resolver) indicated a significant increase in traffic to the VPC resolvers during these periods, and we wanted to better understand what was happening. Taking a closer look at the shape of the traffic coming into the DNS servers from the Hadoop job, we noticed the clients were sending the request five times for every failed reverse lookup. Since the reverse lookups were taking so long or being dropped at the server, the local caching resolver on each host was timing out and continually retrying the requests. On top of this, the DNS servers were also retrying requests, leading to request volume amplifying by an average of 7x.

Spreading the load

One thing to remember is that the VPC resolver limit is imposed per network interface. Instead of performing the reverse lookups solely on our DNS servers, we could instead distribute the load and have each host contact the VPC resolver independently. With Unbound running on each host we can easily control this behavior. Unbound allows you to specify different forwarding rules per DNS zone. Reverse queries use the special domain, so configuring this behavior was a matter of adding a rule that forwards requests for this zone to the VPC resolver.

We knew that reverse lookups for private addresses stored in Route 53 would likely return faster than reverse lookups for public IPs that required communication with an external nameserver. So we decided to create two forwarding configurations, one for resolving private addresses (the zone) and one for all other reverse queries (the zone). Both rules were configured to send requests to the VPC resolver. Unbound calculates retry timeouts based on a smoothed average of historical round trip times to upstream servers and maintains separate calculations per forwarding rule. Even if two rules share the same upstream destination the retry timeouts are computed independently, which helps isolate the impact of inconsistent query performance on timeout calculations.

After applying the forwarding configuration change to the local Unbound resolvers on the Hadoop nodes we saw that the hourly load spike to the VPC resolvers had gone away, eliminating the surge of SERVFAILS we were seeing:

Adding the VPC resolver packet rate metric gives us a more complete picture of what’s going on in our DNS infrastructure. It alerts us if we approach any resource limits and points us in the right direction when systems are unhealthy. Some other improvements we’re considering include collecting a rolling tcpdump of DNS traffic and periodically logging the output of some of Unbound’s debugging commands, such as the contents of the request list.

Visibility into complex systems

When operating such a critical piece of infrastructure like DNS, it’s crucial to understand the health of the various components of the system. The metrics and command line tools that Unbound provides gives us great visibility into one of the core components of our DNS systems. As we saw in this scenario, these types of investigations often uncover areas where monitoring can be improved, and it’s important to address these gaps to better prepare for incident response. Gathering data from multiple sources allows you to see what’s going on in the system from different angles, which can help you narrow in on the root cause during an investigation. This information will also identify if the remediations you put in place have the intended effect. As these systems grow to handle more scale and increase in complexity, how you monitor them must also evolve to understand how different components interact with each other and build confidence that your systems are operating effectively.

Like this post? Join the Stripe engineering team. View openings

May 21, 2019

Railyard: how we rapidly train machine learning models with Kubernetes

Rob Story on May 7, 2019 in Engineering

Stripe uses machine learning to respond to our users’ complex, real-world problems. Machine learning powers Radar to block fraud, and Billing to retry failed charges on the network. Stripe serves millions of businesses around the world, and our machine learning infrastructure scores hundreds of millions of predictions across many machine learning models. These models are powered by billions of data points, with hundreds of new models being trained each day. Over time, the volume, quality of data, and number of signals have grown enormously as our models continuously improve in performance.

Running infrastructure at this scale poses a very practical data science and ML problem: how do we give every team the tools they need to train their models without requiring them to operate their own infrastructure? Our teams also need a stable and fast ML pipeline to continuously update and train new models as they respond to a rapidly changing world. To solve this, we built Railyard, an API and job manager for training these models in a scalable and maintainable way. It’s powered by Kubernetes, a platform we’ve been working with since late 2017. Railyard enables our teams to independently train their models on a daily basis with a centrally managed ML service.

In many ways, we’ve built Railyard to mirror our approach to products for Stripe’s users: we want teams to focus on their core work training and developing machine learning models rather than operating infrastructure. In this post, we’ll discuss Railyard and best practices for operating machine learning infrastructure we’ve discovered while building this system.

Effective machine learning infrastructure for organizations

We’ve been running Railyard in production for a year and a half, and our ML teams have converged on it as their common training environment. After training tens of thousands of models on this architecture over that period, here are our biggest takeaways:

  • Build a generic API, not tied to any single machine learning framework. Teams have extended Railyard in ways we did not anticipate. We first focused on classifiers, but teams have since adopted the system for applications such as time series forecasting and word2vec style embeddings.
  • A fully managed Kubernetes cluster reduces operational burden across an organization. Railyard interacts directly with the Kubernetes API (as opposed to a higher level abstraction), but the cluster is operated entirely by another team. We’re able to learn from their domain knowledge to keep the cluster running reliably so we can focus on ML infrastructure.
  • Our Kubernetes cluster gives us great flexibility to scale up and out. We can easily scale our cluster volume when we need to train more models, or quickly add new instance types when we need additional compute resources.
  • Centrally tracking model state and ownership allows us to easily observe and debug training jobs. We’ve moved from asking, “Did you save the output of your job anywhere so we can look at?” to “What’s your job ID? We’ll figure the rest out.” We observe aggregate metrics and track the overall performance of training jobs across the cluster.
  • Building an API for model training enables us to use it everywhere. Teams can call our API from any service, scheduler, or task runner. We now use Railyard to train models using an Airflow task definition as part of a larger graph of data jobs.

The Railyard architecture

In the early days of model training at Stripe, an engineer or data scientist would SSH into an EC2 instance and manually launch a Python process to train a model. This served Stripe’s needs at the time, but had a number of challenges and open questions for our Machine Learning Infrastructure team to address as the company grew:

  • How do we scale model training from ad-hoc Python processes on shared EC2 instances to automatically training hundreds of models a day?
  • How do we build an interface that is generic enough to support multiple training libraries, frameworks, and paradigms while remaining expressive and concise?
  • What metrics and metadata do we want to track for each model run?
  • Where should training jobs be executed?
  • How do we scale different compute resource needs (CPU, GPU, memory) for different model types?

Our goal when designing this system was to enable our data scientists to think less about how their machine learning jobs are run on our infrastructure, and instead focus on their core inquiry. Machine learning workflows typically involve multiple steps that include loading data, training models, serializing models, and persisting evaluation data. Because Stripe runs its infrastructure in the cloud, we can manage these processes behind an API: this reduces cognitive burden for our data science and engineering teams and moves local processes to a collaborative, shared environment. After a year and a half of iteration and collaboration with teams across Stripe, we’ve converged on the following system architecture for Railyard. Here’s a high-level overview:

Railyard runs on a Kubernetes cluster and pairs jobs with the right instance type.

Railyard provides a JSON API and is a Scala service that manages job history, state, and provenance in a Postgres database. Jobs are executed and coordinated using the Kubernetes API, and our Kubernetes cluster provides multiple instance types with different compute resources. The cluster can pair jobs with the right instance type: for example, most jobs default to our high-CPU instances, data-intensive jobs run on high-memory instances, and specialized training jobs like deep learning run on GPU instances.

We package the Python code for model training using Subpar, a Google library that creates a standalone executable that includes all dependencies in one package. This is included in a Docker container, deployed to the AWS Elastic Container Registry, and executed as a Kubernetes job. When Railyard receives an API request, it runs the matching training job and logs are streamed to S3 for inspection. A given job will run through multiple steps, including fetching training and holdout data, training the model, and serializing the trained model and evaluation data to S3. These training results are persisted in Postgres and exposed in the Railyard API.

Railyard’s API design

The Railyard API allows you to specify everything you need to train a machine learning model, including data sources and model parameters. In designing this API we needed to answer the following question: how do we provide a generic interface for multiple training frameworks while remaining expressive and concise for users?

We iterated on a few designs with multiple internal customers to understand each use case. Some teams only needed ad-hoc model training and could simply use SQL to fetch features, while others needed to call an API programmatically hundreds of times a day using features stored in S3. We explored a number of different API concepts, arriving at two extremes on either end of the design spectrum.

On one end, we explored designing a custom DSL to specify the entire training job by encoding scikit-learn components directly in the API itself. Users could include scikit-learn pipeline components in the API specification and would not need to write any Python code themselves.

On the other end of the spectrum we reviewed designs to allow users to write their own Python classes for their training code with clearly defined input and output interfaces. Our library would be responsible for both the necessary inputs to train models (fetching, filtering, and splitting training and test data) and the outputs of the training pipeline (serializing the model, and writing evaluation and label data). The user would otherwise be responsible for writing all training logic.

In the end, any DSL-based approach ended up being too inflexible: it either tied us to a given machine learning framework or required that we continuously update the API to keep pace with changing frameworks or libraries. We converged on the following split: our API exposes fields for changing data sources, data filters, feature names, labels, and training parameters, but the core logic for a given training job lives entirely in Python.

Here’s an example of an API request to the Railyard service:

  // What does this model do?
  "model_description": "A model to predict fraud",
  // What is this model called?
  "model_name": "fraud_prediction_model",
  // What team owns this model?
  "owner": "machine-learning-infrastructure",
  // What project is this model for?
  "project": "railyard-api-blog-post",
  // Which team member is training this model?
  "trainer": "robstory",
  "data": {
    "features": [
        // Columns we’re fetching from Hadoop Parquet files
        "names": ["created_at", "charge_type", "charge_amount",
                  "charge_country", "has_fraud_dispute"],
        // Our data source is S3
        "source": "s3",
        // The path to our Parquet data
        "path": "s3://path/to/parquet/fraud_data.parq"
    // The canonical date column in our dataset
    "date_column": "created_at",
    // Data can be filtered multiple times
    "filters": [
      // Filter out data before 2018-01-01
        "feature_name": "created_at",
        "predicate": "GtEq",
        "feature_value": {
          "string_val": "2018-01-01"
      // Filter out data after 2019-01-01
        "feature_name": "created_at",
        "predicate": "LtEq",
        "feature_value": {
          "string_val": "2019-01-01"
      // Filter for charges greater than $10.00
        "feature_name": "charge_amount",
        "predicate": "Gt",
        "feature_value": {
          "float_val": 10.00
      // Filter for charges in the US or Canada
        "feature_name": "charge_country",
        "predicate": "IsIn",
        "feature_value": {
          "string_vals": ["US", "CA"]
    // We can specify how to treat holdout data
    "holdout_sampling": {
      "sampling_function": "DATE_RANGE",
      // Split holdout data from 2018-10-01 to 2019-01-01
      // into a new dataset
      "date_range_sampling": {
        "date_column": "created_at",
        "start_date": "2018-10-01",
        "end_date": "2019-01-01"
  "train": {
    // The name of the Python workflow we're training
    "workflow_name": "StripeFraudModel",
    // The list of features we're using in our classifier
    "classifier_features": [
      "charge_type", "charge_amount", "charge_country"
    "label": "is_fraudulent",
    // We can include hyperparameters in our model
    "custom_params": {
      "objective": "reg:linear",
      "max_depth": 6,
      "n_estimators": 500,
      "min_child_weight": 50,
      "learning_rate": 0.02

We learned a few lessons while designing this API:

  • Be flexible with model parameters. Providing a free-form custom_params field that accepts any valid JSON was very important for our users. We validate most of the API request, but you can’t anticipate every parameter a machine learning engineer or data scientist needs for all of the model types they want to use. This field is most frequently used to include a model’s hyperparameters.
  • Not providing a DSL was the right choice (for us). Finding the sweet spot for expressiveness in an API for machine learning is difficult, but so far the approach outlined above has worked out well for our users. Many users only need to change dates, data sources, or hyperparameters when retraining. We haven’t gotten any requests to add more DSL-like features to the API itself.

The Python workflow

Stripe uses Python for all ML model training because of its support for many best-in-class ML libraries and frameworks. When the Railyard project started we only had support for scikit-learn, but have since added XGBoost, PyTorch, and FastText. The ML landscape changes very quickly and we needed a design that didn’t pick winners or constrain users to specific libraries. To enable this extensibility, we defined a framework-agnostic workflow that presents an API contract with users: we pass data in, you pass a trained model back out, and we’ll score and serialize the model for you. Here’s what a minimal Python workflow looks like:

class StripeFraudModel(StripeMLWorkflow):
  # A basic model training workflow: all workflows inherit
  # Railyard’s StripeMLWorkflow class
  def train(self, training_dataframe, holdout_dataframe):
    # Construct an estimator using specified hyperparameters
    estimator = xgboost.XGBRegressor(**self.custom_params)

    # Serialize the trained model once training is finished;
    # we're using an in-house serialization library.
    serializable_estimator = stripe_ml.make_serializable(estimator)

    # Train our model
    fitted_model =

    # Hand our fitted model back to Railyard to serialize
    return fitted_model

Teams start adopting Railyard with an API specification and a workflow that defines a train method to train a classifier with the data fetched from the API request. The StripeMLWorkflow interface supports extensive customization to adapt to different training approaches and model types. You can preprocess your data before it gets passed in to the train function, define your own data fetching implementation, specify how you want training/holdout data to be scored, and run any other Python code you need. For example, some of our deep learning models have custom data fetching code to stream batches of training data for model training. When your training job finishes you’ll end up with two output: a model identifier for your serialized model that can be put into production, and your evaluation data in S3.

If you build a machine learning API specification, here are a few things to keep in mind:

  • Interfaces are important. Users will want to load and transform data in ways you didn’t anticipate, train models using unsupported patterns, and write out unfamiliar types of evaluation data. It’s important to provide standard API interfaces like fetch_data, preprocess, train, and write_evaluation_data that specify some standard data containers (e.g., Pandas DataFrame and Torch Dataset) but are flexible in how they are generated and used.
  • Users should not need to think about model serialization or persistence. Reducing their cognitive burden makes their lives easier and gives them more time to be creative and focus on modeling and feature engineering. Data scientists and ML engineers already have enough to think about between feature engineering, modeling, evaluation, and more. They should be able to train and hand over their model to your scoring infrastructure without ever needing to think about how it gets serialized or persisted.
  • Define metrics for each step of the training workflow. Make sure you’re gathering fine-grained metrics for each training step: data loading, model training, model serialization, evaluation data persistence, etc. We store high-level success and failure metrics that can be examined by team, project, or the individual machine performing the training. On a functional level,our team uses these metrics to debug and profile long-running or failed jobs, and provide feedback to the appropriate team when there’s a problem with a given training job. And on a collaborative level, these metrics have changed how our team operates. Moving from a reactive stance (“My model didn’t train, can you help?”) to a proactive one (“Hey, I notice your model didn’t train, here’s what happened”) has helped us be better partners to the many teams we work with.

Scaling Kubernetes

Railyard coordinates hundreds of machine learning jobs across our cluster, so effective resource management across our instances is crucial. The first version of Railyard simply ran individual subprocesses from the Scala service that manages all jobs across our cluster. We would get a request, start Java’s ProcessBuilder, and kick off a subprocess to build a Python virtualenv and train the model. This basic implementation allowed us to quickly iterate on our API in our early days, but managing subprocesses wasn’t going to scale very well. We needed a proper job management system that met a few requirements:

  • Scaling the cluster quickly for different resource/instance types
  • Routing models to specific instances based on their resource needs
  • Job queueing to prioritize resources for pending work

Luckily, our Orchestration team had been working hard to build a reliable Kubernetes cluster and suggested this new cluster would be a good platform for Railyard’s needs. It was a great fit; a fully managed Kubernetes cluster provides all of the pieces we needed to meet our system’s requirements.

Containerizing Railyard

To run Railyard jobs on Kubernetes, we needed a way to reliably package our Python code into a fully executable binary. We use Google’s Subpar library which allows us to package all of our Python requirements and source code into a single .par file for execution. The library also includes support for the Bazel build system out of the box. Over the past few years, Stripe has been moving many of its builds to Bazel; we appreciate its speed, correctness, and flexibility in a multi-language environment.

With Subpar you can define an entrypoint to your Python executable and Bazel will build your .par executable to bundle into a Dockerfile:

    name = "railyard_train",
    srcs = ["@.../ml:railyard_srcs"],
    data = ["@.../ml:railyard_data"],
    main = "@.../ml:railyard/",
    deps = all_requirements,

With the Subpar package built, the Kubernetes command only needs to execute it with Python:

command: ["sh"]
args: ["-c", "python /railyard_train.par"]

Within the Dockerfile we package up any other third-party dependencies that we need for model training, such as the CUDA runtime to provide GPU support for our PyTorch models. After our Docker image is built, we deploy it to AWS’s Elastic Container Repository so our Kubernetes cluster can fetch and run the image.

Running diverse workloads

Some machine learning tasks can benefit from a specific instance type with resources optimized for a given workload. For example, a deep learning task may be best suited for a GPU instance while fraud models that employ huge datasets should be paired with high-memory instances. To support these mixed workloads we added a new top-level field to the Railyard API request to specify the compute resource for jobs running on Kubernetes:

    "compute_resource": "GPU"

Railyard supports training models on CPU, GPU, or memory-optimized instances. Models for our largest datasets can require hundreds of gigabytes of memory to train, while our smaller models can train quickly on smaller (and less expensive) instance types.

Scheduling and distributing jobs

Railyard exerts a fine-grained level of control on how Kubernetes distributes jobs across the cluster. For each request, we look at the requested compute resource and set both a Kubernetes Toleration and an Affinity to specify the type of node that we would like to run on. These parameters effectively tell the Kubernetes cluster:

  • the affinity, or which nodes the job should run on
  • the toleration, or which nodes should be reserved for specific tasks

Kubernetes will use the affinity and toleration properties for a given Kubernetes pod to compute how jobs should be best distributed across or within each node.

Kubernetes supports per-job CPU and memory requirements to ensure that workloads don’t experience resource starvation due to neighboring jobs on the same host. In Railyard, we determine limits for all jobs based on their historic and future expected usage of resources. In the case of high-memory or GPU training jobs, these limits are set so that each job gets an entire node to itself; if all nodes are occupied, then the scheduler will place the job in a queue. Jobs with less intensive resource requirements are scheduled on nodes to run in parallel.

With these parameters in place, we can lean on the Kubernetes resource scheduler to balance our jobs across available nodes. Given a set of job and resource requests, the scheduler will intelligently distribute those jobs to nodes across the cluster.

One year later: running at scale

Moving our training jobs to a Kubernetes cluster has enabled us to rapidly spin up new resources for different models and expand the cluster to support more training jobs. We can use a single command to expand the cluster and new instance types only require a small configuration change. When the memory requirements of running jobs outgrew our CPU-optimized instance types, we started training on memory-optimized instances the very next day; when we observe a backlog of jobs, we can immediately expand the cluster to process the queue. Model training on Kubernetes is available to any data scientist or engineer at Stripe: all that’s needed is a Python workflow and an API request and they can start training models on any resource type in the cluster.

To date, we’ve trained almost 100,000 models on Kubernetes, with new models trained each day. Our fraud models automatically retrain on a regular basis using Railyard and Kubernetes, and we’re steadily moving more of Stripe’s models onto an automated retraining cycle. Radar’s fraud model is built on hundreds of distinct ML models and has a dedicated service that trains and deploys all of those models on a daily cadence. Other models retrain regularly using an Airflow task that uses the Railyard API.

We’ve learned a few key considerations for scaling Kubernetes and effectively managing instances:

  • Instance flexibility is really important. Teams can have very different machine learning workloads. In any given day we might train thousands of time series forecasts, a long-running word embedding model, or a fraud model with hundreds of gigabytes of data. The ability to quickly add new instance types and expand the cluster are equally important for scalability.
  • Managing memory-intensive workflows is hard. Even using various instance sizes and a managed cluster, we still sometimes have jobs that run out of memory and are killed. This is a downside to providing so much flexibility in the Python workflow: modelers are free to write memory-intensive workflows. Kubernetes allows us to proactively kill jobs that are consuming too many resources, but it still results in a failed training job for the modeler. We’re thinking about ways to better manage this, including smart retry behavior to automatically reschedule failed jobs on higher-capacity instances and moving to distributed libraries like dask-ml.
  • Subpar is an excellent solution for packaging Python code. Managing Python dependencies can be tricky, particularly when you’d like to bundle them as an executable that can be shipped to different instances. If we were to build this from scratch again we would probably take a look at Facebook’s XARs, but Subpar is very compatible with Bazel and it’s been running well in production for over a year.
  • Having a good Kubernetes team is a force multiplier. Railyard could not have been a success without the support of our Orchestration team, which manages our Kubernetes cluster and pushes the platform forward for the whole organization. If we had to manage and operate the cluster in addition to building our services, we would have needed more engineers and taken significantly longer to ship.

Building ML infrastructure

We’ve learned that building common machine learning infrastructure enables teams across Stripe to operate independently and focus on their local ML modeling goals. Over the last year we’ve used Railyard to train thousands of models spanning use cases from forecasting to deep learning. This system has enabled us to build rich functionality for model evaluation and design services to optimize hyperparameters for our models at scale.

While there is a wealth of information available on data science and machine learning from the modeling perspective, there isn’t nearly as much published about how companies build and operate their production machine learning infrastructure. Uber, Airbnb, and Lyft have all discussed how their infrastructure operates, and we’re following their lead in introducing the design patterns that have worked for us. We plan to share more lessons from our ML architecture in the months ahead. In the meantime, we’d love to hear from you: please let us know which lessons are most useful and if there are any specific topics about which you’d like to hear more.

Like this post? Join the Stripe engineering team. View openings

May 7, 2019

Stripe’s fifth engineering hub is Remote

David Singleton on May 2, 2019 in Engineering

Stripe has engineering hubs in San Francisco, Seattle, Dublin, and Singapore. We are establishing a fifth hub that is less traditional but no less important: Remote. We are doing this to situate product development closer to our customers, improve our ability to tap the 99.74% of talented engineers living outside the metro areas of our first four hubs, and further our mission of increasing the GDP of the internet.

Stripe will hire over a hundred remote engineers this year. They will be deployed across every major engineering workstream at Stripe.

Read more

May 2, 2019

Effectively using AWS Reserved Instances

Ryan Lopopolo on June 26, 2018 in Engineering

Stripe uses Amazon Web Services to power our infrastructure. With AWS, we can dynamically scale our fleet of servers in real-time. This elasticity enables us to reliably serve a rapidly growing user base and scale along with their businesses. We use AWS Reserved Instances, which allow us to predictably forecast our cloud spend given a dynamic fleet with rapidly changing compute requirements.

Read more

June 26, 2018