Root cause analysis: significantly elevated error rates on 2019‑07‑10

July 12, 2019

Summary

Millions of businesses rely on Stripe. We see reliability as one of our most serious obligations and highest priorities. We invest in it heavily. However, on 2019-07-10 from 16:35 to 17:02 UTC, and again from 21:14 to 22:47 UTC, the Stripe API was severely degraded. A substantial majority of API requests during these windows failed. We’ve sent you an email if you had five or more failed POST requests.

We describe the failure in more detail below, but the very short summary is that two different database bugs and a configuration change interacted in an unforeseen way, causing a cascading failure across several critical services.

We’ve already taken a number of steps to ensure that this class of failures does not reoccur.

Here is what we have learned so far.

Timeline for first period of degradation

[Three months prior to the incident] We upgraded our databases to a new minor version that introduced a subtle, undetected fault in the database’s failover system.
[Four days prior to the incident] Two nodes became stalled for yet-to-be-determined reasons. These nodes stopped emitting metrics reporting their replication lag but continued to respond as healthy to active checks.
[2019-07-10 16:35 UTC] The first period of degradation started when the primary node for the database cluster failed.
[2019-07-10 16:36 UTC] Our team was alerted and we began incident response.
[2019-07-10 16:50 UTC] We determined the cluster was unable to elect a primary.
[2019-07-10 17:00 UTC] We restarted all nodes in the database cluster, resulting in a successful election.
[2019-07-10 17:02 UTC] The Stripe API fully recovered.

Timeline for second period of degradation

[2019-07-10 20:13 UTC] During our investigation into the root cause of the first event, we identified a code path likely causing the bug in a new minor version of the database’s election protocol.
[2019-07-10 20:42 UTC] We rolled back to a previous minor version of the election protocol and monitored the rollout.
[2019-07-10 21:14 UTC] We observed high CPU usage in the database cluster. The Stripe API started returning errors for users, marking the start of a second period of severe degradation.
[2019-07-10 21:26 UTC] We identified that our rolled-back election protocol interacted poorly with a recently-introduced configuration setting to trigger the second period of degradation. Applying the required change was slowed by several factors including CPU resource contention.
[2019-07-10 22:34 UTC] We successfully rolled out the new configuration and restarted the database cluster’s nodes.
[2019-07-10 22:47 UTC] The Stripe API fully recovered.

Root cause analysis

On 2019-07-10, the Stripe API experienced two periods of significant degradation, first from 16:35 UTC to 17:02 UTC and again from 21:14 UTC to 22:47 UTC. The first period was caused by the combination of two previously unobserved failure modes in a database cluster, and the latter was the result of our remediation efforts.

Stripe splits data by kind into different database clusters and by quantity into different shards. Each cluster has many shards, and each shard has multiple redundant nodes. We routinely exercise node failover logic during upgrades, maintenance, and failures.

Three months ago, we upgraded our databases to a new minor version. As part of the upgrade, we performed thorough testing in our quality assurance environment, and executed a phased production rollout, starting with less critical clusters and moving on to increasingly critical ones. The new version operated properly in production for the past three months, including many successful failovers. However, the new version also introduced a subtle fault in the database’s failover system that only manifested in the presence of multiple stalled nodes. On the day of the events, one shard was in the specific state that triggered this fault, and the shard was unable to elect a new primary.

Without a primary, the shard was unable to accept writes. Applications that write to the shard began to time out. Because of widespread use of this shard across applications, including the API, the unavailability of this shard starved compute resources for the API and cascaded into a severe API degradation. Automated monitoring detected the failed election within a minute. We began incident response within two minutes. Because this was a complex failure mode that we had not previously experienced, we needed to diagnose the underlying cause and determine the steps to remediate. Our team identified forcing the election of a new primary as the fastest remediation available, but this required restarting the database cluster. Once we restarted these nodes, 27 minutes after the event began, the Stripe API fully recovered.

After mitigating user impact, we investigated the root cause and identified a likely code path in a new version of the database’s election protocol. We decided to revert to the previous known stable version for all shards of the impacted cluster. We deployed this change within four minutes, and until 21:14 UTC the cluster was healthy.

At 21:14 UTC, automated alerts fired indicating that some shards in the cluster were unavailable, including the shard implicated in the first degradation. This began a second period of severely degraded availability that lasted until 22:47 UTC. We initially assumed that the same issue had reoccurred on multiple shards, as the symptoms appeared the same as the earlier event. We therefore followed the same mitigation playbook that succeeded earlier.

However, the second period of degradation had a different cause: our revert to a known stable version interacted poorly with a recently-introduced configuration change to the production shards. This interaction resulted in CPU starvation on all affected shards. Once we observed the CPU starvation, we were able to investigate and identify the root cause. We then updated the production configuration and restored the affected shards, which mitigated the incident at 22:34 UTC. After we verified that the cluster was healthy, we began ramping traffic back up, prioritizing services required for user-initiated API requests. We fully recovered at 22:47 UTC.

Remediations

In response to these events, we implemented additional monitoring to alert us when nodes stop reporting replication lag, and if a shard enters a state that could trigger this election fault in the database failover system. We are working with the database maintainers to develop a fix for this underlying fault.

We are also introducing several changes to prevent failures of individual shards from cascading across large fractions of API traffic. This includes additional circuit-breaking on failed operations to particular clusters, including the one implicated in these events. We will also pursue additional fault isolation techniques to contain the impact of a single failed shard and limit resource consumption by clients attempting repeated retries of failed requests.

Finally, we will introduce further procedures and tooling to increase the safety with which operators can make rapid configuration changes during incident response.

We know that you have the highest standards for the financial and technical infrastructure your business relies on. We share your standards; our business can only succeed if we are consistently reliable enough to power a material fraction of internet commerce. We deeply regret letting you down on July 10th.

We are redoubling our efforts to increase the resiliency of our systems. We have already conducted the first of a series of thorough reviews to identify improvements to our systems and practices. The improvements we have rolled out and will roll out soon will significantly reduce the likelihood of similar events in the future.

Payments

Revenue

Money Management

Platforms and marketplaces