How we built it: Real-time analytics for Stripe Billing

Reed Trevelyan Revenue Intelligence
Blog > Real-time analytics for Stripe Billing > Header image

In a recent Stripe survey, 84% of global business leaders agreed that adapting pricing quickly will be a key competitive advantage over the next 1–2 years. This echoed what we’ve been hearing directly from our customers: that to stay nimble, they need to be able to spot new patterns of customer behavior just as they emerge—something that’s only possible when high-quality billing data is available in real time. 

That’s why we’ve developed a new, real-time streaming analytics system for Stripe Billing. Now when customers use the Stripe Dashboard to explore and visualize subscription metrics such as monthly recurring revenue (MRR) growth, churn rates, trial conversion rates, and more, they’re getting data that reflects any new subscription activity with latency as low as 15 minutes. This upgrade allows customers to get the real-time visibility they need to stay ahead of fast-moving trends, and it ensures accurate historical data even as their business changes. 

Creating this system meant replacing traditional batch processing methods, which had a 24-hour average lag for subscription updates, with new architecture and processes. We broke the problem into three main components: 

  1. Rebuilding our data architecture to support real-time subscription updates
  2. Upgrading our data aggregation system to reflect these real-time updates in the Dashboard on the same time frame 
  3. Letting customers freely adjust definitions of metrics without impacting real-time or historical analytics 

We’ll explore how we built each of these functionalities, the engineering challenges involved in them, and how they work together to create a fast, flexible, and reliable real-time analytics platform.

Low-latency analytics required an event-driven pipeline from beginning to end

A subscription is a relatively simple way of paying for a service, but it’s a complicated idea to handle within a data structure. The most up-to-date picture of any given subscription relies on past information as much as present. In isolation, it doesn’t mean much to a business that a customer paid $20 in June. The business also needs to know that the customer has paid on time, every month, since signing up in January.

The most straightforward technical approach to subscriptions, and what our previous analytics system relied on, is to calculate the current state of a subscription by re-analyzing all of the data related to it from the beginning of time. But this approach means that analysis has to be done in batches, on a set cadence. Running batches frequently enough to support anything close to real-time analytics was impossible given architectural limits. 

To build our new system, we needed to create a new pipeline that transforms updates to subscription and invoice objects into analytics events. We accomplished this using Apache Flink, which stores a highly compressed version of subscription history as a “state,” and incrementally updates that state as new analytical events are received. But generating the initial Flink state for long-standing customers was still a challenge, because it would require replaying billions of historical events in order. To address this, we also built a custom tool that lets us run our streaming transformation logic as a data job in Apache Spark, which can process large amounts of historical events in parallel and output its results as verifiable flat files. This job efficiently generates the initial Flink state, and it also feeds a redundant offline data pipeline for validation and data export uses.

With our new architecture in place, we were able to achieve latency as low as 15 minutes on subscription updates. The $20 June subscription payment now no longer needs to be re-assessed alongside every other payment made since January; it’s simply added to the ongoing ledger contained within the Flink state.

Complex, low-latency aggregation became possible with the launch of a brand-new query engine 

We designed the Dashboard to respond to queries flexibly and responsively: Stripe users can filter, group, and drill down into their data without waiting as their requests are processed. We want our users to feel as though we’re simply opening a window to their data—but under the hood, aggregating subscription data in a way that supports these queries is a significant computing task. To let users visualize how MRR changes over time, for example, we needed to analyze the historical state of every subscription at every point of time within whatever period the user specifies. When we first built our billing analytics system, using Apache Pinot as our online analytical processing (OLAP) database, the best available solution was to preaggregate subscription data offline in a scheduled batch job.

To achieve real-time analytics, we needed to remove that preaggregation step. At query time, we had to be able to analyze the historical and current states of all subscriptions so that we could catch those that had just added new data to their ledgers. But we also needed to maintain the ultraresponsive queries that our users now expected, and which had led us to select Pinot as our OLAP in the first place. 

We found a solution when the maintainers of the open-source Pinot software released a brand-new v2 engine that could perform “windowed” aggregation queries. These queries segment data within multiple “windows” or date ranges, and they perform aggregation operations (summing, averaging, etc.) across the data in those ranges—enabling the simultaneous calculation of MRR over time without offline preaggregation. Pinot’s new engine also allowed us to perform more complex data joins, which opened up other real-time functions for the Dashboard: data gap filling, currency conversions, and custom query dimensions.

We worked closely with the Pinot maintainers to test and bring this new engine into production, which had never before been deployed in a user-facing context at Stripe’s scale. With the updated Dashboard, the $20 June subscription payment is now not only updated in real time, but it’s able to be queried almost instantly: most updates are processed in well under 1 minute, and nearly all are available to the user within 15 minutes. In production, we now see query latency of less than 300 milliseconds, maintaining the Dashboard’s fast, responsive feel.

Allowing customizable metric definitions while maintaining real-time updates required a delicate balance of flexibility and consistency

The definition of a seemingly straightforward metric such as MRR can vary significantly from one business to another. To accommodate this variation, we had let Billing users adjust the formula definitions used for MRR and other metrics. Switching to streaming analytics introduced a new challenge here: we needed to preserve this flexibility for users while maintaining data consistency amid real-time updates. 

For example, if a customer decided to begin excluding one-time coupons from their MRR calculations, we’d need to ensure this change was reflected consistently across all historical and incoming data. For a customer who has been with Stripe since 2017, that would mean taking hours to reprocess years of data to get historical MRR values consistent with the updated definition—all while continuing to handle new, incoming events.

Our solution is a workflow that balances historical recalculation with real-time updates:

  1. When a customer changes a metric definition, we initiate a batch process to align historical data with the new definition.
  2. Concurrently, we continue streaming and processing new events using the customer’s old metric definition in real time.
  3. While processing, these incoming events are also temporarily buffered in memory in our Flink application.
  4. Once the historical reprocessing is complete, we patch the Flink app’s state using recalculated historical data and we allow Flink to reprocess the stored events on top of the updated history.
  5. We transition the Dashboard to display the fully updated data, at which point we stop all processing that uses the old metric definition.

Throughout this process, the Dashboard remains responsive and useful, not grayed out or showing inconsistent data. Customers always see a consistent view of their data from the beginning of their history to the present moment—even while making definition changes and receiving real-time updates.

Looking ahead

When building our new streaming analytics system, we had two main goals: to give customers real-time access to data updates, and to help them query and sort that real-time data in the ways that were most useful to them. As we continue to evolve Billing analytics, we’re working on additional upgrades to address both of these goals:

  • We’re continuing to push data latency even lower, while still maintaining reliability and accuracy. 
  • We’re augmenting the Dashboard with more data and more query dimensions, including usage-based metrics and filters for customer geography and cohort. 

To learn more, read our docs or get in touch.

Like this post? Join our team.

Stripe builds financial tools and economic infrastructure for the internet.

Have any feedback or questions?

We’d love to hear from you.