When people talk about their data infrastructure, they tend to focus on the technologies: Hadoop, Scalding, Impala, and the like. However, we’ve found that just as important as the technologies themselves are the principles that guide their use. We’d like to share our experience with one such principle that we’ve found particularly useful: reproducibility.
With so many new technologies coming out every year (like Kubernetes or Habitat), it’s easy to become so entangled in our excitement about the future that we forget to pay homage to the tools that have been quietly supporting our production environments. One such tool we've been using at Stripe for several years now is Consul. Consul helps discover services (that is, it helps us navigate the thousands of servers we run with various services running on them and tells us which ones are up and available for use). This effective and practical architectural choice wasn't flashy or entirely novel, but has served us dutifully in our continued mission to provide reliable service to our users around the world.
Stripe Radar is a collection of tools to help businesses detect and prevent fraud. At Radar’s core is a machine learning engine that scans every card payment across Stripe’s 100,000+ businesses, aggregates information from those payments into behavioral signals that are predictive of fraud, and blocks payments that have a high probability of being fraudulent.
Radar’s power comes from all the data we obtain from the Stripe “network.” Instead of requiring users to label charges manually, Radar obtains the “ground truth” of fraud directly from our banking partners. Just as importantly, the signals we use in our models include aggregates over the entire stream of payments processed by Stripe: when a card is used for the first time on a Stripe business, there’s an 80% chance we’ve seen that card elsewhere on the Stripe network, and those previous interactions provide valuable information about potential fraud.
If you’re curious to learn more, we’ve put together a detailed outline that describes how we use machine learning at Stripe to detect and prevent fraud.
When a company writes about their observability stack, they often focus on sweet visualizations, advanced anomaly detection or innovative data stores. Those are well and good, but today we’d like to talk about the tip of the spear when it comes to observing your systems: metrics pipelines! Metrics pipelines are how we get metrics from where they happen—our hosts and services—to storage quickly and efficiently so they can be queried, all without interrupting the host service.
At Stripe, we make extensive use of automated testing to help ensure the stability and reliability of our services. We have expansive test coverage for our API and other core services, we run tests on a continuous integration server over every git branch, and we never deploy without green tests.
When we announced the Open Source retreat, we'd pictured it primarily as giving people the opportunity to work on projects they'd already been meaning to do. However, the environment we provided also became a place for people to come up with new ideas and give them a try. One of these ideas, Libscore, is launching publicly today.
We’ve started running game day exercises at Stripe. During a recent game day, we tested failing over a Redis cluster by running
kill -9 on its primary node , and ended up losing all data in the cluster. We were very surprised by this, but grateful to have found the problem in testing. This result and others from this exercise convinced us that game days like these are quite valuable, and we would highly recommend them for others.