Blog

Follow Stripe on Twitter

PagerDuty analytics with Postgres

Mark McGranaghan on December 2, 2014

We’re open-sourcing the tool we use to collect and analyze on-call data from PagerDuty. We use pd2pg to improve the on-call experience for engineers at Stripe, and we think it’ll be useful for your teams too.

PagerDuty data in Postgres

PagerDuty is an important source of data about how services behave in production and the on-call load experienced by engineers. This data has been instrumental for managing and evolving our on-call rotations: over five months, we’ve reduced on-call load for our systems team by about 75%.

We import data from the PagerDuty API into a Postgres database using pd2pg, where we can use the full power of Postgres’ SQL queries.

Here’s how you import your data:

$ export PAGERDUTY_SUBDOMAIN="your-company"
$ export PAGERDUTY_API_KEY="..."
$ export DATABASE_URL="postgres://..."
$ bundle exec ruby pd2pg.rb

The script incrementally updates existing data, so it’s trivial to refresh your database periodically. (It also fetches historical data from your account, so you can get started with long-term analysis right away.)

Querying PagerDuty data with SQL

You can start analyzing and exploring your PagerDuty data once it’s in the database with psql:

$ psql $DATABASE_URL
> \d incidents
           Column            |           Type           | Modifiers 
-----------------------------+--------------------------+-----------
 id                          | character varying        | not null
 incident_number             | integer                  | not null
 created_at                  | timestamp with time zone | not null
 html_url                    | character varying        | not null
 incident_key                | character varying        | 
 service_id                  | character varying        | 
 escalation_policy_id        | character varying        | 
 trigger_summary_subject     | character varying        | 
 trigger_summary_description | character varying        | 
 trigger_type                | character varying        | not null
 
> select count(*) from incidents;
 count 
-------
 3466
(1 row)

As an example of a real query, here’s how you’d count the number of incidents per service over the past 28 days:

select
  services.name,
  count(incidents.id)
from
  incidents,
  services
where
  incidents.created_at > now() - '28 days'::interval and
  incidents.service_id = services.id
group by
  services.name
order by
  count(incidents.id) desc

How we use pd2pg at Stripe

  • Weekly team report: Our sys team reviews a detailed on-call report each week. It covers all alerts sent by either a team-owned service or fielded by an engineer (which can include escalations from other team’s services). This detailed report helps us understand the types of incidents we’re seeing so we can prevent or respond to them better.
  • Per-service incident counts: Aggregates like per-service incident counts help give us a high-level overview. (They’re not actionable results in themselves, but do show us high-load services we should review further.)
  • Interrupted hours metric: A common way to measure on-call load is counting the number of incidents over a period a time. Sometimes, this over-represents issues that cause several related alerts to fire at the same time (which aren’t actually more costly than a single alert firing). To get a more accurate view of on-call load, we calculate an "interrupted hours" metric that counts the intervals in which an engineer receives one or more pages. This metric provides pretty good insight into real on-call load by suppressing noise from issues that result in multiple pages and more heavily weighting incidents with escalations.
  • On-hours vs. off-hours alerts: Pages during the work day are less costly than ones that wake an engineer up at 3am on a Sunday. So, we look at the metrics discussed above broken down by on-hours vs. off-hours incidents.
  • Escalation rate analysis: Frequent or repeated escalations may indicate that either that responders aren’t able to get to a computer, or they aren’t prepared to deal with the issue. Some escalations are expected, but keeping an eye on escalation rates across services helps us keep an eye out for organizational bugs.
  • Individual on-call load: Being primary on-call is a major responsibility, and high on-call load can cause burnout in engineers. To help understand on-call load at the individual level, we can perform user-specific variants of the above queries.

We’d love to hear how you use pd2pg. If you’ve got any feedback, please get in touch or send us a PR.

December 2, 2014

Open-sourcing tools for Hadoop

Colin Marc on November 21, 2014

Stripe’s batch data infrastructure is built largely on top of Apache Hadoop. We use these systems for everything from fraud modeling to business analytics, and we’re open-sourcing a few pieces today:

Timberlake

Timberlake is a dashboard that gives you insight into the Hadoop jobs running on your cluster. Jeff built it as a replacement for the web interfaces currently provided by YARN’s ResourceManager and MRv2’s JobHistory server, and it has some features we’ve found useful:

  • Map and reduce task waterfalls and timing plots
  • Scalding and Cascading awareness
  • Error tracebacks for failed jobs

Brushfire

Avi wrote a Scala framework for distributed learning of ensemble decision tree models called Brushfire. It’s inspired by Google’s PLANET, but built on Hadoop and Scalding. Designed to be highly generic, Brushfire can build and validate random forests and similar models from very large amounts of training data.

Sequins

Sequins is a static database for serving data in Hadoop’s SequenceFile format. I wrote it to provide low-latency access to key/value aggregates generated by Hadoop. For example, we use it to give our API access to historical fraud modeling features, without adding an online dependency on HDFS.

Herringbone

At Stripe, we use Parquet extensively, especially in tandem with Cloudera Impala. Danielle, Jeff, and Avi wrote Herringbone (a collection of small command-line utilities) to make working with Parquet and Impala easier.


If you’re interested in trying out these projects, there’s more info on how to use them (and how they were designed) in the READMEs. If you’ve got feedback, please get in touch or send us a PR.

Happy Hadooping!

November 21, 2014

Mayday.us

Avi Bryant on November 4, 2014

Earlier this year, after raising $1M in May, Lawrence Lessig’s Mayday PAC announced an ambitious goal to raise $5M by the 4th of July—a goal which they met mere hours before the deadline.

One of the remarkable things about this campaign is how transparent they’ve been through the whole process. In August, they released anonymized records of all contributions from the prior three months to “enable researchers to study the pattern and nature of the contributions” they received. (You can see some visualizations of this data, as well as download the full data set, here.)

Stripe helps Mayday to accept credit card payments, and with Mayday’s blessing, we did some digging of our own into the data relating to their $5M campaign. While we couldn’t look at every contribution (only those made using credit cards), we were able to discover certain patterns that wouldn’t necessarily show up in the published data set. We’d like to share here some of the interesting things we discovered.

Meeting a deadline

It shouldn’t be too surprising that the volume of donations went up as the July 4th deadline approached. By our count, over half of the donations were made in the last 48 hours of the campaign. We also saw some subtler changes in the final days:

  • Overall, a healthy 17% of donations came in via mobile devices. But on the last day of the campaign, mobile use doubled: 32% of donors donated from their phones or tablets instead of waiting to get to their laptops.
  • Repeat donations were three times as common in the final week. Between June 25th and July 4th, 14% of donations were from email addresses that had contributed at least once already. Although it’s true that repeat donations are more likely the longer a campaign goes on, it’s notable that in the previous week, repeat donations only made up 4% of the total—and only 1% the week before that.
  • Deadlines can be incredibly effective in fundraising: Mayday’s supporters were motivated to donate both immediately and repeatedly.

    Checking out

    Mayday’s donation page uses Stripe Checkout to collect payment information. Checkout optionally allows customers to store their payment info with Stripe, making future purchases easier. Since this works across all sites that use Checkout, Stripe already remembered the payment info for a portion of the people visiting Mayday for the first time. We were very curious to see how this would perform. Here’s what we found:

    • The overall conversion rate, once a visitor got to Checkout on Mayday for the first time, was 78%.
    • For users already logged in to a Stripe account, the rate shot up to 90%.

    To put it another way, the chance that a visitor would abandon their donation at the Checkout step halved from 22% to 10%.

    It’s worth repeating that these users weren’t on the Mayday site when they stored their details, and there’s no reason to expect they were any more likely to donate than anyone else—they just happened to have already used Stripe to buy something online in the past.

    Even for repeat visitors to Mayday, who are more likely to donate than anyone else, having a Stripe account made a substantial difference. In general, visitors who had donated before had a healthy 87% conversion rate, but for those who were already logged in to Stripe, it was 94%.

    Coming back for more

    Looking at repeat donations prompted us to ask: do people donate more or less their second time? On average, the answer is roughly 50% more. While first donations had a mean of $88 and a median of $30, repeat donations had a mean of $114 and a median of $50.

    Average doesn’t mean typical, however. If you look at each repeat donor one by one, it turns out they’re split almost exactly into thirds: 33% donate less the second time (most commonly half), 35% donate more (most commonly double), and 32% donate exactly the same. The averages get pushed up because doubling (and the occasional tripling or even quadrupling) makes a bigger difference overall than halving does.

    Supporting your supporters

    Supporting repeat donors was critical to the campaign’s success. When donors return to your site (probably at the last minute), make it easy for them: don’t make them find their laptop, and don’t make them enter their credit card again. Encourage them to increase their donation, but don’t expect it. When you make it easy enough, they’ll almost certainly help you out—94% of the time, anyway.

November 4, 2014

Stripe Dublin Meetup

Christina Mairs on October 29, 2014

Come join us and our friends from Intercom for a meetup in Dublin on Thursday night. A handful of Stripes will be around, and we’d love to see you all at Intercom’s new offices for a chat and a pint.

When:
Thursday, November 6th, starting at 6:30 PM
Where:
Intercom (2nd Floor, Stephen Court)

RSVP via our event page.

October 29, 2014

Game Day Exercises at Stripe:
Learning from `kill -9`

Marc Hedlund on October 28, 2014

We’ve started running game day exercises at Stripe. During a recent game day, we tested failing over a Redis cluster by running kill -9 on its primary node [0], and ended up losing all data in the cluster. We were very surprised by this, but grateful to have found the problem in testing. This result and others from this exercise convinced us that game days like these are quite valuable, and we would highly recommend them for others.

If you’re not familiar with game days, the best introductory article is this one from John Allspaw [1]. Below, we’ll lay out a playbook for how to run a game day, and describe the results from our latest exercise to show why we believe they are valuable.

How to run a game day exercise

The system we recently tested, scoring-srv, is one part of our fraud detection system. The scoring-srv processes run on a cluster of boxes and connect to a three-node Redis cluster to store fraud scoring data. Our internal charge-processing code connects to scoring-srv for each charge made on Stripe’s network, so it needs to be very low-latency; likewise, accurate scoring requires historical data, so it needs durable storage.

The scoring-srv developers and a member of our systems team, who could help run the tests, got together around a whiteboard. We drew a basic block diagram of the machines and processes, the data stores, and the network connections between the components. With that diagram, we were able to come up with a list of possible failures.

We came up with a list of six tests we could run easily:

  • destroying and restoring a scoring-srv box,
  • destroying progressively more scoring-srv boxes until calls to it began timing out,
  • partitioning the network between our charge processing code and scoring-srv,
  • increasing the load on the primary Redis node,
  • killing the primary Redis node, and
  • killing one of the Redis replicas.

Since the team was new to game days, we did not try to be comprehensive or clever. We instead chose the simplest, easiest to simulate failures we could think of. We’d take a blunt instrument, like kill -9 or aws ec2 terminate-instances, give the system a good hard knock, and see how it reacted [2].

For each test, we came up with one or more hypotheses for what would happen when we ran it. For instance, we guessed that partitioning the network between charge processing and scoring-srv would cause these calls to time out and fail open (that is, allow the charge to go through immediately). Then, we decided on an order to perform the tests, saved a backup of a recent Redis snapshot as a precaution, and dove in.

Here, then, is a quick-start checklist for running a game day:

  1. Get the development team together with someone who can modify the network and destroy or provision servers, and block off an afternoon to run the exercise.
  2. Make a simple block diagram of the machines, processes, and network connections in the system you’re testing.
  3. Come up with 5-7 of the simplest failures you can easily induce in the system.
  4. Write down one or more hypotheses for what will happen after each failure.
  5. Back up any data you can’t lose.
  6. Induce each failure and observe the results, filing bugs for each surprise you encounter.

Observations and results

We were able to terminate a scoring-srv machine and restore it with a single command in roughly the estimated time. This gave us confidence that replacing or adding cluster machines would be fast and easy. We also saw that killing progressively more scoring-srv machines never caused timeouts, showing we currently have more capacity than necessary. Partitioning the network between the charge-processing code and scoring-srv caused a spike in latency, where we’d expected calls to scoring-srv to time out and fail open quickly. This test also should have immediately alerted the teams responsible for this system, but did not.

The first Redis test went pretty well. When we stopped one of the replicas with kill -9, it flapped several times on restart, which was surprising and confusing to observe. As expected, though, the replica successfully restored data from its snapshot and caught up with replication from the primary.

Then we moved to the Redis primary node test, and had a bigger surprise. While developing the system, we had become concerned about latency spikes during snapshotting of the primary node. Because scoring-srv is latency-sensitive, we had configured the primary node not to snapshot its data to disk. Instead, the two replicas each made frequent snapshots. In the case of failure of the primary, we expected one of the two replicas to be promoted to primary; when the failed process came back up, we expected it to restore its data via replication from the new primary. That didn’t happen. Instead, when we ran kill -9 on the primary node (and it was restarted by daemontools), it came back up – after, again, flapping for a short time – with no data, but was still acting as primary. From there, it restarted replication and sent its empty dataset to the two replica nodes, which lost their datasets as a result. In a few seconds, we’d gone from a three-node replicated data store to an empty data set. Fortunately, we had saved a backup and were able to get the cluster re-populated quickly.

The full set of tests took about 3.5 hours to run. For each failure or surprise, we filed a bug describing the expected and actual results. We wound up with 15 total issues from the five tests we performed (we wound up skipping the Redis primary load test) – a good payoff for the afternoon’s work. Closing these, and re-running the game day to verify that we now know what to expect in these cases, will greatly increase our confidence in the system and its behavior.

Learning from the game day

The invalidation of our Redis hypothesis left us questioning our approach to data storage for scoring-srv. Our original Redis setup had all three nodes performing snapshots (that is, periodically saving data to disk). We had tested failover from the primary node due to a clean shutdown and it had succeeded. While analyzing the cluster once we had live data running through it, though, we observed that the low latency we’d wanted from it would hit significant spikes, above 1 second, during snapshotting:

Obviously these spikes were concerning for a latency-sensitive application. We decided to disable snapshotting on the primary node, leaving it enabled on the replica nodes, and you can see the satisfying results below, with snapshotting enabled, then disabled, then enabled again:

Since we believed that failover would not be compromised in this configuration, this seemed like a good trade-off: relying on the primary node for performance and replication, and the replica nodes for snapshotting, failover, and recovery. As it turned out, this change was made the day before the game day, as part of the final lead-up to production readiness. (One could imagine making a similar change in the run-up to a launch!)

The game day wound up being the first full test of the configuration including all optimizations and changes made during development. We had tested the system with a primary node shutdown, then with snapshotting turned off on the primary, but this was the first time we’d seen these conditions operating together. The value of testing on production systems, where you can observe failures under the conditions you intend to ship, should be clear from this result.

After discussing the results we observed with some friends, a long and heated discussion about the failure took place on Twitter, in which Redis’ author said he had not expected the configuration we were using. Since there is no guarantee the software you’re using supports or expects the way you’re using it, the only way to see for certain how it will react to a failure is to try it.

While Redis is functional for scoring-srv with snapshotting turned on, the needs of our application are likely better served by other solutions. The trade-off between high-latency spikes, with primary node snapshotting enabled, versus total cluster data loss, with it disabled, leaves us feeling neither option is workable. For other configurations at Stripe – especially single-node topologies for which data loss is less costly, such as rate-limiting counters – Redis remains a good fit for our needs.

Conclusions

In the wake of the game day, we’ve run a simple experiment with PostgreSQL RDS as a possible replacement for the Redis cluster in scoring-srv. The results suggest that we could expect comparable latency without suffering snapshotting spikes. Our testing, using a similar dataset, had a 99th percentile read latency of 3.2 milliseconds, and a 99th percentile write latency of 11.3 milliseconds. We’re encouraged by these results and will be continuing our experiments with PostgreSQL for this application (and obviously, we will run similar game day tests for all systems we consider).

Any software will fail in unexpected ways unless you first watch it fail for yourself. We completely agree with Kelly Sommers’ point in the Twitter thread about this:

We’d highly recommend game day exercises to any team deploying a complex web application. Whether your hypotheses are proven out or invalidated, either way you’ll leave the exercise with greater confidence in your ability to respond to failures, and less need for on-the-fly diagnosis. Having that happen for the first time while you’re rested, ready, and watching is the best failure you can hope for.

Notes

[0] We’ve chosen to use the terms “primary” and “replica” in discussing Redis, rather than the terms “master” and “slave” used in the Redis documentation, to support inclusivity. For some interesting and heated discussion of this substitution, we’d recommend this Django pull request and this Drupal change.

[1] Some other good background articles for further reading: “Weathering the Unexpected”; “Resilience Engineering: Learning to Embrace Failure”; “Training Organizational Resilience in Escalating Situations”; “When the Nerds Go Marching In.”

[2] If you’d like to run more involved tests and you’re on AWS, this Netflix Tech Blog post from last week describes the tools they use for similar testing approaches.

Thanks

Thanks much to John Allspaw, Jeff Hodges, Kyle Kingsbury, and Raffi Krikorian for reading drafts of this post, and to Kelly Sommers for permission to quote her tweet. Any errors are ours alone.

October 28, 2014

Apple Pay

Ray Morgan on October 20, 2014

Starting today, any Stripe user can begin accepting Apple Pay in their iOS apps. Apple Pay lets your customers frictionlessly pay with one touch using a stored credit card. We think Apple Pay will make starting a mobile business easier than ever.

Apple Pay doesn’t replace In-App Purchases. You should use Apple Pay when charging for physical goods (such as groceries, clothing, and appliances) or for services (such as club memberships, hotel reservations, and tickets for events). You should continue to use In-App Purchases to charge for virtual goods such as premium content in your app.

When your customer is ready to pay, they’ll authorize a payment using Touch ID. Then, Stripe generates a card token, which you can use to create charges as you normally would through the Stripe API. It just takes a few lines of code to set up and display the Apple Pay UI:

- (void)paymentAuthorizationViewController:(PKPaymentAuthorizationViewController *)controller
                       didAuthorizePayment:(PKPayment *)payment
                                completion:(void (^)(PKPaymentAuthorizationStatus))completion {

    [Stripe createTokenWithPayment:payment
                        completion:^(STPToken *token, NSError *error) {
        // charge your Stripe token as normal
    }];
}

The following Stripe-powered apps already have Apple Pay enabled. You can try it out as soon as their updates hit the App Store. We owe them special thanks for all their feedback and bugsquashing over the past few weeks.

If you’ve got any questions, or need help getting started, please get in touch.

Get started with Apple Pay View documentation

October 20, 2014

Open-Source Retreat meetup

Greg Brockman on October 16, 2014

A few months ago, we announced our Open-Source Retreat. Though we’d originally expected to sponsor two grantees, we ended up giving out three full grants (and then an additional shorter grant).

Here’s what happened with those grants:


If you’d like more details, we’ll be hosting a meetup at Stripe on Tuesday, October 21st. The grantees will talk about their projects and where they plan to go next. RSVP on our event page if you’d like to attend in person, or view our livestream.


If you have any questions about the retreat, the projects, or anything else, please get in touch!

October 16, 2014

Poodle

Steve Woodrow on October 15, 2014

As you’ve likely seen, a design flaw in SSL 3.0 was announced to the internet yesterday, nicknamed POODLE. Unfortunately, it’s not just an implementation flaw—the only way to disable the attack is to turn off the affected ciphers altogether. Fortunately, the only common browser which still relies on SSL 3.0 is Internet Explorer 6 on Windows XP, which is a small fraction of internet traffic.

We’ve deployed changes to ensure Stripe traffic remains secure.

Our response

We’ve taken an approach similar to Google’s: We’ve disabled the now easily-exploited CBC-mode SSL 3.0 ciphers. We’ve also deployed OpenSSL with support for TLS_FALLBACK_SCSV, which prevents newer browsers from being tricked into using SSL 3.0 at all. This means that IE6 customers will (for now) continue to be able to purchase from Stripe users, and there will be no immediate user-facing impact.

Ending support for SSL 3.0

While there do exist some mitigations, there is no configuration under which SSL 3.0 is totally secure. As well, with so many websites responding to POODLE by dropping SSL 3.0 support entirely, we expect that IE6 on XP will soon stop working on most of the web.

Our plan going forward:

  • Starting today, new Stripe users will not be able to send API requests or receive webhooks using SSL 3.0.
  • On November 15, 2014, we will drop SSL 3.0 support entirely (including for Stripe.js and Checkout).

In the meantime, we’ll notify any of our users who we expect to be affected by this change. If you have any questions, please don’t hesitate to get in touch.

October 15, 2014

Pagerbot

Karl-Aksel Puulmann on September 26, 2014

We’re open-sourcing Pagerbot, a tool we developed to make it easy to interact with PagerDuty through your internal chat system. (At the very least, we hope it'll help other companies respond to incidents like Shellshock or the ongoing AWS reboot cycle.)

Background

Like many tech companies, Stripe uses PagerDuty to help coordinate on-call schedules and incident response. The service is super reliable, does a great job of handling our normal rotations, and we appreciate being able to individually set preferences for how we want to get notified.

Fairly frequently, though, people will trade on-call shifts, whether because of travel, vacation, or even just making sure someone is keeping an eye on things while they’re out watching a movie. The communication about the trades mainly happens in one of our Slack channels.

Inspired by GitHub’s idea of chat-driven ops, we wanted PagerDuty schedule changes to happen in the same place as the rest of our communication.

We’ve tried to make Pagerbot easily handle our previous scheduling woes. For instance, with Stripes scattered all around the world, juggling timezones is very confusing, but if you don’t specify a timezone in your queries, Pagerbot automatically uses the timezone you configured in your PagerDuty profile.

Over time, we’ve added more commands to Pagerbot. For instance, based on Heroku’s incident response blog post, we added support for explicitly paging an individual:

Deploying to Heroku

Pagerbot supports both Slack and IRC and although you can always run Pagerbot on your own infrastructure, we’ve also made Pagerbot compatible with the new Heroku Button:

Deploy

(Note: Heroku requires you to provide a credit card to enable the Heroku MongoDB addon, though you won’t actually be charged anything).

Once you’ve deployed Pagerbot to Heroku, there’s a built-in admin panel you can use to get things set up. You’ll need to tell Pagerbot about your PagerDuty subdomain, your chat credentials, and any aliases you want for either people or schedules.

We’ve also tried to make it easy to add new commands to Pagerbot by building a simple plugin architecture. Feel free to fork Pagerbot and add your own plugins.


We’ve been using Pagerbot as our main interface to PagerDuty for over two years now. If you use PagerDuty and either Slack or IRC, we hope you’ll find it useful — check it out, and let us know what you think!

September 26, 2014

Official Go support

Cosmin Nicolaescu on September 23, 2014

We’re fans of Go (both the language and the game) here at Stripe, and it seems we’re not the only ones. In recent months, we’ve seen Go’s popularity rise amongst our users and more generally in the open-source community, so we decided to add an official Stripe library for Go.

Requests made to Stripe using Go in 2014

We’ve also started using more Go at Stripe internally. For example, parts of the system that power Checkout are built in Go (and use this library). When porting some existing services to Go, we’ve noticed 2-4x increases in throughput (and our engineers were pretty happy with the development process).

To get started with our Go library, go get github.com/stripe/stripe-go and then import it in your code. Here’s how you’d create a charge:

import (
  "github.com/stripe/stripe-go"
  "github.com/stripe/stripe-go/currency"
)

params := &stripe.ChargeParams {
    Amount:   1000,
    Currency: currency.USD,
    Card:     &stripe.CardParams{Token:"tok_14dlcYGBoqcjK6A1Th7tPXfJ"},
    Desc:     "Gopher t-shirt",
}

There are two ways to make calls with the library, based on your needs. The simplest way is to use the global implicit client and invoke the APIs:

import (
  "github.com/stripe/stripe-go/charge"
)

stripe.Key = "tGN0bIwXnHdwOa85VABjPdSn8nWY7G7I"
ch, err := charge.New(params)

If your scenario involves concurrent calls or you’re dealing with multiple API keys, you can use an explicit client:

import (
  "github.com/stripe/stripe-go/client"
)

sc := &client.Api{}
sc.Init("tGN0bIwXnHdwOa85VABjPdSn8nWY7G7I", nil)

ch, err := sc.Charges.New(params)

Given Go’s lack of built-in versioning, we highly recommend you use a package management tool to avoid any unforeseen upgrades.

The library features iterator-based listing, which handles pagination for you automatically. We’ve also added support for injecting mocks to make testing easier. And if you need more control, the library allows you to inject your ownhttpClient for transport-level customizations.

Check out the docs (or the GoDoc) for more details and examples. Let me know if you have any feedback, or send a pull request my way!

September 23, 2014