Blog Engineering

Partager cet article sur Twitter

Reproducible research: Stripe’s approach to data science

Dan Frank on November 22, 2016 in Engineering

When people talk about their data infrastructure, they tend to focus on the technologies: Hadoop, Scalding, Impala, and the like. However, we’ve found that just as important as the technologies themselves are the principles that guide their use. We’d like to share our experience with one such principle that we’ve found particularly useful: reproducibility.

We’ll talk about our motivation for focusing on reproducibility, how we’re using Jupyter Notebooks as our core tool, and the workflow we’ve developed around Jupyter to operationalize our approach.

Jupyter notebooks are a fantastic way to create reproducible data science research.


Motivation

Data tools are most often used to generate some kind of exploratory analysis report. At Stripe, an example is an investigation of the probability that a card gets declined, given the time since its last charge. The investigator writes a query, which is executed by a query engine like Redshift, and then runs some further code to interpret and visualize the results.

The most common way to share results from these sorts of studies is to compose an email and attach some graphs. But this means that viewers of the report don’t know how the query was constructed and analyzed. As a result, they are unable to review the work in depth, or to extend it themselves. It’s very easy to commit methodological errors when asking questions of data; an unintended bias here, or a missed corner case there, can lead to entirely incorrect conclusions.

In academia, the peer review system helps catch these errors. Many in the scientific community have championed the practice of open science, where data and code are released along with experimental results, such that reviewers can independently recreate the original results. Taking inspiration from this movement, we sought to make data reports within Stripe transparent and reproducible, so that anyone at the company can look at a report and understand how it was generated. Just like an always-green test suite forces developers to write better code, we wanted to see if requiring all analyses be reproducible would force us to produce better reports.


Implementation

Our implementation of reproducible analysis centers on Jupyter Notebook, a web-based frontend to the Jupyter interactive computing environment which provides an interface similar to Mathematica or Matlab.

Jupyter Notebook also comes with built-in functionality to convert a notebook into a publishable HTML document. You can see a sample of one of our published notebooks, studying the relationship between Benford’s Law and the amounts of each charge made on Stripe.

Now, let’s say that Alice wants to share a notebook with Bob. The state of the interactive environment can be persisted as a JSON file containing both the code input to the notebook and data output from it. To share the notebook, Alice would typically send this notebook file directly to Bob. Now, when Bob opens it, he’ll see the same outputs as Alice, but may not be able to do much with them. These outputs include computational results and plots’ image data, but not the values of any of the variables that Alice was working with. To inspect these variables and extend Alice’s work, he’ll have to recompute them from the code inputs. However, there may have been certain cells that only run correctly on the Alice’s computer, or some cells might have been rearranged in a way that unintentionally broke the flow of computation. It’s easy to miss mistakes like these when you’re able to share a notebook with the results embedded, so we decided to try something different.

In our workflow, developers and data scientists work on a notebook locally and check this source file into Git. To publish their work, they use our common deployment framework, which executes the notebook code once it hits our servers. The results are translated into HTML, which are served statically. Importantly, we strip results from the notebook files in a pre-commit hook, meaning that only code is checked into our repositories. This ensures that the results are fully reproduced from scratch when the notebook is published. Thus, it’s a requirement that all notebooks be programmatically executable from back to front, without needing any manual steps to run. If you were on a Stripe computer, you could run the notebook above with one click and obtain the same results. This is a huge deal!

To make this workflow possible, we had to write some additional tooling to enable the same code to run on developers’ laptops and production servers. The bulk of this work involved access to our query engines, which is perhaps the most common obstacle to collaboration on data analysis projects. Even very well-organized workflows often require a data file to be present at a particular path, or some out of band authentication step with the machines running the queries. The key to overcoming these challenges was to create a common entry point in code to access these query engines from developers’ laptops, as well as our servers. This way, a notebook that runs on one developer’s computer will always run correctly on everyone else’s.

Adding this tooling also greatly improved the experience of doing exploratory data analysis within the notebook. Prior to our reproducibility tooling, setting up data access was tedious, time-consuming, and error-prone. Automating and standardizing this process allowed data scientists and developers to focus on their analysis instead.


Conclusion

Reproducibility makes data science at Stripe feel like working on GitHub, where anyone can obtain and extend others’ work. Instead of islands of analysis, we share our research in a central repository of knowledge. This makes it dramatically easier for anyone on our team to work with our data science research, encouraging independent exploration.

We approach our analyses with the same rigor we apply to production code: our reports feel more like finished products, research is fleshed out and easy to understand, and there are clear programmatic steps from start to finish for every analysis.

We’ve switched over to reproducible reports, and we’re not looking back. Delivering them requires more up-front work, but we’ve found it to be a good long-term investment. If you give it a try, we think you’ll feel the same way!

Like this post? Join the Stripe engineering team. View Openings