Engineering challenges at Stripe
It's kind of crazy when you think about it, but in contrast to every other part of the web, payments on the internet today look largely like they did fifteen years ago. There are a few companies that have shown what’s possible when there’s a good payment ecosystem in place, but they are by in large walled gardens. No one has yet to bring anything similar to the internet at large.
That’s where Stripe comes in. By building a better payments infrastructure, we want to enable more businesses and transactions. Our aim is to expand the commerce on the internet—simply replacing the legacy payment providers would probably be a great business success, but it’s not all that interesting as a goal.
All of our engineering challenges derive from this. We’re roughly segmented into six engineering teams, built around the core challenges we face.
Product
On the product front, our primary challenge is redesigning online payments (and the associated tooling) from the ground up. Every other team at Stripe is, in a way, supporting the products that we present to the world.
Many of these challenges aren’t unique to payments. Our API is a major part of our product, and most web APIs can be pretty confusing and hard to use. In an effort to do better, we’ve had to create a number of new standards around how to build an API (such as better ways to do webhooks, versioning, logging, and documentation) along the way.
Payments are complex, and choosing abstractions that balance power and flexibility with simplicity and clarity is hard. In other engineering groups, you tend to largely take advantage of existing software; in the product group, you need to build a deeper stack of abstractions and tooling.
More than other groups, the engineering decisions made in the product group need to balance non-engineering factors. Implementing the products might itself be tough, but even harder is choosing what to implement in the first place—the problems and prioritization are open-ended. Among a myriad issues, you have to thoughtfully prioritize everything from user experience, aesthetics, legal, and financial considerations. The properties you want in your datastore might be clear, but the bounds of a new product are likely to be much murkier.
Financial operations
As any software engineer can attest, writing code that mostly does the right thing is hard. Writing bug-free software is next to impossible. But when you’re writing code that moves millions of dollars a day, as our ops team does, you somehow need to write code in a way that anticipates its own bugs and fails safely.
This is a very different constraint from traditional web development, where you can just ignore individual errors and hope the user will have better luck on the next try. On the other hand, it’s not quite like writing code for the space shuttle, where a mistake could mean loss of life. We need to figure out how to move quickly while still retaining important safety properties, and while we can tolerate some bugs, we need to make sure each of those issues are discovered and handled before they can affect users.
A lot of our time in the ops group is spent building robust frameworks. When you design the right abstraction, only one person has to think about the Hard Problems, and everyone else can use it without having to think too hard. For example, we’ve designed a framework that allows animplementor to model complex system actions as a series of individually simple state transitions. This allows us to centralize how we handle scheduling, isolate failure, or mitigate bugs (since any bug’s impact is scoped to a single state transition).
Systems
One of the consequences of processing payments is that the load on our systems will always be much lower than other companies of equivalent scale (i.e. the dollar value per bit flowing through our systems is incredibly high). As a result, our primary problems are availability and consistency, and we get to push off the scaling challenges most other companies face for a lot longer. This has a very positive effect, allowing us to spend far more of our time writing business logic rather than making low-level optimizations.
The counterpoint is we care about availability in a way that other companies don’t. We generally hover between four and five nines of uptime. We’ve had to build our own highly-available load balancing layer on EC2 since EC2’s own load balancer doesn’t have the availability properties we want.
We’ve also had to build our own event-processing system, affectionately dubbed Monster, in order to get a hard guarantee that we never lose events and that failovers always happen without human intervention. We never accept downtime for maintenance, which has meant we need to bulid our own zero-downtime migration infrastructure. We’re currently pushing about 50 million events per day, and have designed and implemented a sharding framework that allows us to scale our databases horizontally.
Risk
In typical security work, you’ll spend most of your time defending against a theoretical adversary—in reality, your attack surface is so broad that even at scale any given system won’t see that much in the way of sophisticated attacks. In contrast, we see targeted attacks by fraudsters against Stripe and our users every single day. Many of these attackers are quite clever and strongly motivated (successfully pulling off a scheme directly translates to money in their bank account). Consequently, we’re continually building and adapting our systems to keep fraudsters away without degrading the experience for good users.
Risk engineering involves everything from creating a machine learning infrastructure to instantly onboarding new users. However, in corner cases, human interaction will always be needed. We have a team of risk analysts, and we’ve spent a lot of time creating interfaces to allow them to easily monitor accounts and transaction patterns.
Tools
In order to build everything we need to build, we need to be able to move fast (but not break things). Great tooling is the only way to accomplish this. We work hard to maximize developer productivity and minimize the time between code being written and pushed to production.
Data
In many ways, we can see the evolving shape of the internet in our systems—the fastest-growing and most innovative companies are using Stripe, and we can directly see how much our quest to make the web a better place is succeeding. We have some of the most interesting online commerce data one could ask for.
Ingesting and digesting our data is a pretty difficult challenge, and as a result, a lot of our work on data thus far has been building out our data infrastructure. Consequuently, we’ve built systems for tailing our production data into HDFS (where it can be queried via Impala or our homegrown Scalding). We also maintain Tiller, a tool that makes it easy to build dynamic dashboards. We’re just now starting to think about building out an analyst team, which will help us better understand our mountains of data.
We’re working to make Stripe available around the world—the short answer is that we have far more challenges than our small (but growing!) team of engineers could hope to solve on their own.