Blog Engineering

Suivre Stripe sur Twitter

Photo of Cory Watson

Introducing Veneur: high performance and global aggregation for Datadog

Cory Watson on October 18, 2016 in Engineering

When a company writes about their observability stack, they often focus on sweet visualizations, advanced anomaly detection or innovative data stores. Those are well and good, but today we’d like to talk about the tip of the spear when it comes to observing your systems: metrics pipelines! Metrics pipelines are how we get metrics from where they happen—our hosts and services—to storage quickly and efficiently so they can be queried, all without interrupting the host service.

First, let’s establish some technical context. About a year ago, Stripe started the process of migrating to Datadog. Datadog is a hosted product that offers metric storage, visualization and alerting. With them we can get some marvelous dashboards to monitor our Observability systems:

A screenshot of a DataDog dashboard showing several graphs

Observability Overview Dashboard

Previously, we’d been using some nice open-source software but it was sadly unowned and unmaintained internally. Facing the high cost—in money and people—we decided that outsourcing to Datadog was a great idea. Nearly a year later, we’re quite happy with the improved visibility and reliability we’ve gained through significant effort in this area. One of the most interesting aspects of this work was how to even metric!


Using StatsD for metrics

There are many ways to instrument your systems. Our preferred method is the StatsD style: a simple text-based protocol with minimal performance impact. Code is instrumented to emit UDP to a central server at runtime whenever measured stuff happens.

Like all of life, this choice has tradeoffs. For the sake of brevity, we’ll quickly mention the two downsides of StatsD that are most relevant to us: its use of inherently unreliable UDP, and its role as a Single Point of Failure for timer aggregation.

As you may know, UDP is a “fire and forget” protocol that does not require any acknowledgement by the receiver. This makes UDP pretty fast for the client, but also means that client has no way to ensure that the metric was received by anyone! When combined with the network and the host’s natural protections that cause traffic to be dropped, you’ve got a problem.

Another problem is the Single Point of Failure. The poor StatsD server has to process a lot of UDP packets if you’ve got a non trivial number of sources. Add to that the nightmare of the machine going down and the need to shard or use other tricks to scale out, and you’ve got your work cut out for you.


DogStatsD and the lack of “global”

Aware that a central StatsD can be a problem for some, Datadog takes a different approach: Each host runs an instance of DogStatsD as part of the Datadog agent. This neatly sidesteps most performance problems but created a large feature regression for Stripe: no more global percentiles. Datadog only supports per-host aggregations for histograms, timers and sets.

Remember that, with StatsD, you emit a metric to the downstream server each time the event occurs. If you’re measuring API requests and emitting that metric on each host, you are now sending your timer metric to the local Datadog agent which aggregates them and flushes them to Datadog’s servers in batches. For counters, this is great because you can just add them together! But for percentiles we’ve got problems. Imagine you’ve got hundreds of servers each doing an unequal number of API requests with unequal workloads. Our percentiles are not representative of how our whole API is behaving. Even worse, once we’ve generated the percentiles for our histograms there is no meaningful way, mathematically, to combine them. (More precisely, the percentiles of arbitrary subsamples of a distribution are not sufficient for the percentiles of the full distribution).

Stripe needs to know the overall percentiles because each host’s histogram only has a small subset of random requests. We needed something better!


Enter Veneur

To provide these features to Stripe we created Veneur, a DogStatsD server with global aggregation capability. We’re happily running it in production and you can too! It’s open-source and we’d love for you to take a look.

Veneur runs in place of Datadog’s bundled DogStatsD server, listening on the same port. It flushes metrics to Datadog just like you’d expect. That’s where the similarities end, however, and the magic begins.

Instead of aggregating the histogram and emitting percentiles at flush time, Veneur forwards the histogram on to a global Veneur instance which merges all the histograms and flushes them to Datadog at the next window. It adds a bit of delay—one flush period—but the result is a best-of-both mix of local and global metrics!

We monitor the performance of many of our API calls, such as this chart of various percentiles for creating a charge. Red bars are deploys!


Approximate, mergeable histograms

As mentioned earlier, the essential problem with percentiles is that, once reported, they can’t be combined together. If host A received 20 requests and host B received 15, the two numbers can be added to determine that, in total, we had 35 requests. But if host A has a 99th percentile response time of 8ms and host B has a 99th percentile response time of 10ms, what’s the 99th percentile across both hosts?

The answer is, “we don’t know”. Taking the mean of those two percentiles results in a number that is statistically meaningless. If we have more than two hosts, we can’t simply take the percentile of percentiles either. We can’t even use the percentiles of each host to infer a range for the global percentile—the global 99th percentile could, in rare cases, be larger than any of the individual hosts’ 99th percentiles. We need to take the original set of response times reported from host A, and the original set from host B, and combine those together. Then, from the combined set, we can report the real 99th percentile across both hosts. That’s what forwarding is for.

Of course, there are a few caveats. If each histogram stores all the samples it received, the final histogram on the global instance could potentially be huge. To sidestep this issue, Veneur uses an approximating histogram implementation called a t-digest, which uses constant space regardless of the number of samples. (Specifically, we wrote our own Go port of it.) As the name would suggest, approximating histograms return approximate percentiles with some error, but this tradeoff ensures that Veneur’s memory consumption stays under control under any load.


Degradation

The global Veneur instance is also a single point of failure for the metrics that pass through it. If it went down we would lose percentiles (and sets, since those are forwarded too). But we wouldn’t lose everything. Besides the percentiles, StatsD histograms report a counter of how many samples they’ve received, and the minimum/maximum samples. These metrics can be combined without forwarding (if we know the maximum response time on each host, the maximum across all hosts is just the maximum of the individual values, and so on), so they get reported immediately without any forwarding. Clients can opt out of forwarding altogether, if they really do want their percentiles to be constrained to each host.

Veneur’s Overview Dashboard, well instrumented and healthy!


Other cool features and errors

Veneur—named for the Grand Huntsman of France, master of dogs!—also has a few other tricks:

  • Drop-in replacement for Datadog’s included DogStatsD. It even processes events and service checks!
  • Written in Go to minimize deployment troubles to a single binary
  • Use of HyperLogLogs for counting the unique members of a set efficiently with fixed memory consumption
  • Extensive metrics (natch) so you can watch the watchers
  • Efficient compressed, chunked POST requests sent concurrently to Datadog’s API
  • Extremely fast

Over the course of Veneur’s development we also iterated a lot. Our initial implementation was purely a global DogStatsD implementation without the forwarding or merging. It was really fast, but we quickly decided that processing more packets faster wasn’t really going to get us very far.

Next we took some twists and turns through “smart clients” that tried to route metrics to the appropriate places. This was initially promising, but we found that supporting this for each of our language runtimes and use cases was prohibitively expensive and undermined (Dog)StatsD’s simplicity. Some of our instrumentation is as simple as an nc command and that simplicity is very helpful to quickly instrument things.

While overall our work was transparent, we did cause some trouble when we initially turned the global features back on. Some teams had come to rely on per-host for very specific metrics. When we had to fail back to host local for some refactoring, we caused problems to teams who had just adapted to using global features. Argh! Each of these wound up being positive learning experiences, and we found Stripe’s engineers to be very accommodating. Thanks!


Thanks and future work

The Observability team would like to thank Datadog for their support and advice in the creation of Veneur. We’d also like to thank our friends and teammates at Stripe for their patience as we iterated to where we are today. Specifically for the occasional broken charts, metrics outages and other hilarious-in-hindsight problems we caused along the way.

We’ve been running Veneur in production for months and have been enjoying the fruits of our labor. We’re now operating at a stable, more mature pace for improvements around efficiency learned from monitoring its production behavior. We hope to leverage Veneur in the future for continued improvements to the features and reliability of our metrics pipeline. We’ve discussed additional protocol features, unified formats, per-team accounting and even incorporating other sensor data like tracing spans. Veneur’s speed, instrumentation and flexibility give us lots of room to grow and improve. Someone’s gotta feed those wicked cool visualizations and anomaly detectors!

Like this post? Join the Stripe engineering team. View Openings

October 18, 2016

Photo of Nelson Elhage

Running three hours of Ruby tests in under three minutes

Nelson Elhage on August 13, 2015 in Engineering

At Stripe, we make extensive use of automated testing to help ensure the stability and reliability of our services. We have expansive test coverage for our API and other core services, we run tests on a continuous integration server over every git branch, and we never deploy without green tests.

The size and complexity of our codebase has grown over the past few years—and so has the size of the test suite. As of August 2015, we have over 1400 test files that define nearly 15,000 test cases and make over 130,000 assertions. According to our CI server, the tests would take over three hours if run sequentially.

With a large (and growing) group of engineers waiting for those tests with every change they make, the speed of running tests is critical. We’ve used a number of hosted CI solutions in the past, but as test runtimes crept past 10 minutes, we brought testing in-house to give us more control and room for experimentation.

Recently, we’ve implemented our own distributed test runner that brought the runtime of our tests to just under three minutes. While some of these tactics are specific to our codebase and systems, we hope sharing what we did to improve our test runtimes will help other engineering organizations.


Forking executor

We write tests using minitest, but we've implemented our own plugin to execute tests in parallel across multiple CPUs on multiple different servers.

In order to get maximum parallel performance out of our build servers, we run tests in separate processes, allowing each process to make maximum use of the machine's CPU and I/O capability. (We run builds on Amazon's c4.8xlarge instances, which give us 36 cores each.)

Initially, we experimented with using Ruby's threads instead of multiple processes, but discovered that using a large number of threads was significantly slower than using multiple processes. This slowdown was present even if the ruby threads were doing nothing but monitoring subprocess children. Our current runner doesn’t use Ruby threads at all.

When tests start up, we start by loading all of our application code into a single Ruby process so we don’t have to parse and load all our Ruby code and gem dependencies multiple times. This process then calls fork a number of times to produce N different processes that’ll each have all of the code pre-loaded and ready to go.

Each of those workers then starts executing tests. As they execute tests, our custom executor forks further: Each process forks and executes a single test file’s worth of tests inside the child process. The child process writes the results to the parent over a pipe, and then exits.

This second round of forking provides a layer of isolation between tests: If a test makes changes to global state, running the test inside a throwaway process will clean everything up once that process exits. Isolating state at a per-file level also means that running individual tests on developer machines will behave similarly to the way they behave in CI, which is an important debugging affordance.


Docker

The custom forking executor spawns a lot of processes, and creates a number of scratch files on disk. We run all builds at Stripe inside of Docker, which means we don't need to worry about cleaning up all of these processes or this on-disk state. At the end of a build, all of the state—be that in-memory processes or on disk—will be cleaned up by a docker stop, every time.

Managing trees of UNIX processes is notoriously difficult to do reliably, and it would be easy for a system that forks this often to leak zombie processes or stray workers (especially during development of the test framework itself). Using a containerization solution like Docker eliminates that nuisance, and eliminates the need to write a bunch of fiddly cleanup code.


Managing build workers

In order to run each build across multiple machines at once, we need a system to keep track of which servers are currently in-use and which ones are free, and to assign incoming work to available servers.

We run all our tests inside of Jenkins; Rather than writing custom code to manage worker pools, we (ab)use a Jenkins plugin called the matrix build plugin.

The matrix build plugin is designed for projects where you want a "build matrix" that tests a project in multiple environments. For example, you might want to build every release of a library against several versions of Ruby and make sure it works on each of them.

We misuse it slightly by configuring a custom build axis, called BUILD_ROLE, and telling Jenkins to build with BUILD_ROLE=leader, BUILD_ROLE=worker1, BUILD_ROLE=worker2, and so on. This causes Jenkins to run N simultaneous jobs for each build.

Combined with some other Jenkins configuration, we can ensure that each of these builds runs on its own machine. Using this, we can take advantage of Jenkins worker management, scheduling, and resource allocation to accomplish our goal of maintaining a large pool of identical workers and allocating a small number of them for each build.


NSQ

Once we have a pool of workers running, we decide which tests to run on each node.

One tactic for splitting work—used by several of our previous test runners—is to split tests up statically. You decide ahead of time which workers will run which tests, and then each worker just runs those tests start-to-finish. A simple version of this strategy just hashes each test and take the result modulo the number of workers; Sophisticated versions can record how long each test took, and try to divide tests into group of equal total runtime.

The problem with static allocations is that they’re extremely prone to stragglers. If you guess wrong about how long tests will take, or if one server is briefly slow for whatever reason, it’s very easy for one job to finish far after all the others, which means slower, less efficient, tests.

We opted for an alternate, dynamic approach, which allocates work in real-time using a work queue. We manage all coordination between workers using an nsqd instance. nsq is a super-simple queue that was developed at Bit.ly; we already use it in a few other places, so it was natural to adopt here.

Using the build number provided by Jenkins, we separate distinct test runs. Each run makes use of three queues to coordinate work:

  • The node with BUILD_ROLE=leader writes each test file that needs to be run into the test.<BUILD_NUMBER>.jobs queue.
  • As workers execute tests, they write the results back to the test.<BUILD_NUMBER>.results queue, where they are collected by the leader node.
  • Once the leader has results for each test, it writes "kill" signals to the test.<BUILD_NUMBER>.shutdown queue, one for each worker machine. A thread on each worker pulls off a single event and terminates all work on that node.

Each worker machine forks off a pool of processes after loading code. Each of those processes independently reads from the jobs queue and executes tests. By relying on nsq for coordination even within a single machine, we have no need for a second, machine-local, communication mechanism, which might risk limiting our concurrency across multiple CPUs.

Other than the leader node, all nodes are homogenous; they blindly pull work off the queue and execute it, and otherwise behave identically.

Dynamic allocation has proven to be hugely effective. All of our worker processes across all of our different machines reliably finish within a few seconds of each other, which means we're making excellent use of our available resources.

Because workers only accept jobs as they go, work remains well-balanced even if things go slightly awry: Even if one of the servers starts up slightly slowly, or if there isn't enough capacity to start all four servers right at once, or if the servers happen to be on different-sized hardware, we still tend to see every worker finishing essentially at once.


Visualization

Reasoning about and understanding performance of a distributed system is always a challenging task. If tests aren't finishing quickly, it's important that we can understand why so we can debug and resolve the issue.

The right visualization can often capture performance characteristics and problems in a very powerful (and visible) way, letting operators spot the problems immediately, without having to pore through reams of log files and timing data.

To this end, we've built a waterfall visualizer for our test runner. The test processes record timing data as they run, and save the result in a central file on the build leader. Some Javascript d3 code can then assemble that data into a waterfall diagram showing when each individual job started and stopped.

Waterfall diagrams of a slow test run and a fast test run.

Each group of blue bars shows tests run by a single process on a single machine. The black lines that drop down near the right show the finish times for each process. In the first visualization, you can see that the first process (and to a lesser extent, the second) took much longer to finish than all the others, meaning a single test was holding up the entire build.

By default, our test runner uses test files as the unit of parallelism, with each process running an entire file at a time. Because of stragglers like the above case, we implemented an option to split individual test files further, distributing the individual test classes in the file instead of the entire file.

If we apply that option to the slow files and re-run, all the "finished" lines collapse into one, indicating that every process on every worker finished at essentially the same time—an optimal usage of resources.

Notice also that the waterfall graphs show processes generally going from slower tests to faster ones. The test runner keeps a persistent cache recording how long each test took on previous runs, and enqueues tests starting with the slowest. This ensures that slow tests start as soon as possible and is important for ensuring an optimal work distribution.


The decision to invest effort in our own testing infrastructure wasn't necessarily obvious: we could have continued to use a third-party solution. However, spending a comparatively small amount of effort allowed the rest of our engineering organization to move significantly faster—and with more confidence. I'm also optimistic this test runner will continue to scale with us and support our growth for several years to come.

If you end up implementing something like this (or have already), send me a note! I'd love to hear what you've done, and what's worked or hasn't for others with similar problems.

August 13, 2015

Photo of Greg Brockman

Libscore

Greg Brockman on December 16, 2014 in Engineering

When we announced the Open Source retreat, we'd pictured it primarily as giving people the opportunity to work on projects they'd already been meaning to do. However, the environment we provided also became a place for people to come up with new ideas and give them a try. One of these ideas, Libscore, is launching publicly today.

Top libraries used across the web.

Libscore, built by Julian Shapiro with support from both us and Digital Ocean, makes it possible for frontend developers to see where their work is being used. The service periodically crawls the top million websites, determines the JavaScript libraries in use on each, and makes that data publicly queriable.

For example, wondering about MVC framework popularity? Backbone is used on about 8,000 of the top million sites while Ember appears on only 185. You can also query which libraries are used on your favorite site, or view some precompiled aggregates.


We were attracted to Libscore because it sounded like internet infrastructure that should exist. Sometimes—as with our support for Alipay—we get to build such components directly; sometimes, it seems better to support something external—as with Stellar. If you have other ideas, please let us know (or work on them here!).

December 16, 2014

Photo of Greg Brockman

Scaling email transparency

Greg Brockman on December 8, 2014 in Engineering

In February 2013, we blogged about email transparency at Stripe. Since then a number of other companies have implemented their own versions of it (which a few have talked about publicly). We often get asked whether email transparency is still around, and if so, how we've scaled it.

Email transparency continues to be one important tool for state transfer at Stripe. The vast majority of Stripe email (excluding particularly sensitive classes of email or threads where a participant has a strong expectation of privacy) remains publicly available through the company.

Today we're publishing two key components that have allowed us to scale it this far: our list manager tool and updated internal documentation reflecting what we've learned over the past year and a half. Hopefully these will make it easier for others to run email transparency at their own organizations.

Gaps

In the time since our first post, we've grown our mailing list count almost linearly with headcount: from 40 employees and 119 mailing lists in February 2013 to now 164 people and 428 lists. A plurality are project lists (sys@, sys-archive@, sys-bots@, sys-ask@), but there's also a long tail on topics ranging from country operations (australia@) to ideas for things Stripe should try (crazyideas@).

We use Google Groups for our email list infrastructure. Today we're releasing the web interface we've built on Google's APIs to make managing many list subscriptions (and associated filters) easy. This interface, called Gaps, lets you do things like:

  • Quickly subscribe to or unsubscribe from a list.
  • View your organization's lists (categorized by topic), and which you're subscribed to (including indirect subscriptions through other lists).
  • Get notifications when new lists are created.
  • Generate and upload GMail filters.

Here's a quick sample of what Gaps looks like:

Check it out and let us know what you think!

Updated internal documentation

Scaling email transparency has required active cultural effort and adaptation. As our team grew, we'd notice that formerly good patterns could turn sour. For example, at first email transparency would improve many conversations by letting people drop in with helpful tidbits. But with a larger team, having many people jumping into a conversation would instead grind the thread to a halt.

As we've identified cases where email transparency didn't scale well, we've made changes to our culture. Below is our updated internal documentation on how we approach email transparency. It embodies what we've learned about how to make email transparency work at an organization of our size:

Email transparency (from our internal wiki)

One of Stripe's core strategies is hiring great people and then making sure they have enough information to make good local decisions. Email transparency is one system that has helped make this possible. As with any rule at Stripe, you should consider the recommendations in this document to be strong defaults, which you should just override if they don't make sense in a particular circumstance.

How it works

Email transparency is fairly simple: make your emails transparent by CCing a list, and make it easy for others to be transparent by observing the responsibilities below.

The main mechanisms of email transparency are the specially-designated archive lists, to which you should CC all mail that would normally be off-list, but only because of its apparent irrelevance rather than out of any particular desire for secrecy. The goal isn't to share things that would otherwise be secret: it's to unlock the wealth of information that would otherwise be accidentally locked up in a few people's inboxes.

In general, if you are debating including an archive list, you should include it. This includes internal P2P email which you would normally leave off a list, emails to vendors, and scheduling email. Don't be afraid to send "boring" email to an archive list — people have specifically chosen to subscribe to that list. You should expect most people will autoarchive this list traffic (hence the name!), and then dip into it as they prefer.

If you're new to it, email transparency always feels a bit weird at first, but it doesn't take long to get used to it.

What's the point?

Email transparency is something few other organizations try to do. It's correspondingly on us to make sure we have really good indicators for how it's valuable. Here's a sample of things people have found useful about email transparency:

  • Provides the full history on interactions that are relevant to you. If you're pulled into something, you can always pull up the relevant state. This is especially useful for external communications with users or vendors.
  • Provides a way for serendipitous interactions to happen — someone who has more state on something may notice what's happening and jump in to help (subject to the limitations about jumping in).
  • Lets you keep up with things going on at various other parts of Stripe, at whatever granularity you want. This reduces siloing, makes it easier to function as a remote (and even just know what we're working on), and generally increases the feeling of connectedness.
  • Requires ~no additional effort from the sender.
  • Makes conversations persistent and linkable, which is particularly useful for new hires.
  • Forces us to think about how we're segmenting information — if you're tempted to send something off-list, you should think through why.
  • Makes spin-up easier by immersing yourself in examples of Stripe tone and culture, and enabling you to answer your own questions via the archives.
  • Helps you learn how different parts of the business work.

Reader responsibilities

Email transparency cuts two ways. Being able to see the raw feed of happenings at Stripe as they unfold is awesome, but it also implies an obligation to consume responsibly. Overall, threads on an archive list merit a level of civil inattention — you should feel free to read it, but be careful about adding your own contributions.

  • Talk to people rather than silently judging. If you see something on an email list that rubs you the wrong way or that you think doesn't make sense (e.g. "why are we working on that?", "that email seems overly harsh/un-Stripelike"), you should talk to that person directly (or their manager, if there's a reason you can't talk to them about it). Remember that we hire smart people, and if something seems off you're likely missing context or a view of the larger picture. No one wants their choice to send email on-list to result in a bunch of people making judgements without telling them, or chattering behind their back — if that can happen, then people will be less likely to CC a list in the future.
  • Avoid jumping in. A conversation on an archive list should be considered a private conversation between the participants. When people jump into the thread, it often grinds to a halt and nothing gets done. There will be some very rare occasions (e.g. if you have some factual knowledge the participants probably don't) where it's ok to lurk in to the thread, but in practice these should be very rare. By convention, the people on the thread may ignore your email; don't take it personally — it's just a way of making sure that email transparency doesn't accidentally make email communication harder. Knowing when to jump in is an art, and when in doubt, don't.
  • Don't penalize people for choosing to CC a list. Ideally, people are writing their emails exactly as they would if they were off-list. So be cognizant about creating additional overhead for people because they chose to CC the list. There may be typos or things that you're wondering about or don't make sense. If you're *concerned* about something being actively bad, then you should talk to the person, but if it's something small (e.g. "there's a typo", "this tone isn't Stripelike", "this conversation seems like a waste of time"), you should trust that there's either a reason, or the person's manager will be on the lookout to help them (especially if they're new).
  • Help others live by the above responsibilities. The only way we can preserve email transparency is by collectively nudging each other onto the right course. Whether it's poking someone to CC a list, or telling someone to stop venting about an email but just go talk to the author, the person responsible for fixing the shortfalls you see is the same as the one responsible for your availability.

Common scenarios/FAQs

  • I don't mind people being able to read this boring scheduling email, but I don't think it's worth anyone's time to read. You should still send it to an archive list! Archive lists are intended to be the feed of everything going on within a particular team — let the people who are subscribing decide if it's worth their time or not.
  • I have a small joke on this thread. Should I CC it to the list, or just send it to one person (or a small set of people)? Small jokes are good! The main cost is potentially derailing the relevant thread. So generally, if it's a productive, focused thread, just send your joke off-list, but if it's already fairly broad, then you should feel free to send the joke publicly.
  • I feel like I need to write my email for the broad audience that might be reading it, rather than the one person it's actually meant for. The only change between how you write emails for email transparency and how you would write them privately to other Stripes should be that one has a CC. That is, if you feel a need to rewrite your emails for the audience, then that likely indicates a bug in the organization we should fix. If you notice yourself having this tendency, talk to gdb — we should be able to shift the norms of the organization so this isn't a problem.
  • How do we make sure this respects outside people's expectations? In many ways, email transparency is just a more extreme version of what happens at other organizations — since it's opt-in, all of the emails are human-vetted to be shareable. Email transparency is mostly about changing the default thresholds. As a corollary, if someone requests that their email not be shared, then certainly respect their request.

Common exceptions

Like any tool, email transparency has its limitations. Since it's in many ways a one-way communication system, email transparency is bad for sensitive situations where people may react strongly. It's also important to preserve people's privacy. The following is a description of the classes of things which you may not see on an archive list.

  • Anything personnel related (e.g. performance).
  • Some recruiting conversations, especially during closing or when people are confidentially looking around. People's decision-making process at that stage is usually quite personal, and even if people have a hard time picking Stripe, we want to make sure that they start with a blank slate.
  • Communications of mixed personal and professional nature (e.g. recruiting a friend).
  • Early stage discussions about topics that will affect Stripes personally (e.g. changing our approach to compensation).
  • Some particularly sensitive partnerships.

As we said in the original email transparency post, it's hard to know how far it will scale. That doesn't bother us much: we continue to do unscalable things until they break down. The general sentiment at Stripe is that email transparency adds a lot of value, and it seems we'll keep being able to find tweaks to keep it going.

Hopefully these components will help you with email transparency in your own organization. If you end up implementing something similar, I'd love to hear about it!

December 8, 2014

Photo of Mark McGranaghan

PagerDuty analytics with Postgres

Mark McGranaghan on December 2, 2014 in Engineering

We’re open-sourcing the tool we use to collect and analyze on-call data from PagerDuty. We use pd2pg to improve the on-call experience for engineers at Stripe, and we think it’ll be useful for your teams too.

PagerDuty data in Postgres

PagerDuty is an important source of data about how services behave in production and the on-call load experienced by engineers. This data has been instrumental for managing and evolving our on-call rotations: over five months, we’ve reduced on-call load for our systems team by about 75%.

We import data from the PagerDuty API into a Postgres database using pd2pg, where we can use the full power of Postgres’ SQL queries.

Here’s how you import your data:

$ export PAGERDUTY_SUBDOMAIN="your-company"
$ export PAGERDUTY_API_KEY="..."
$ export DATABASE_URL="postgres://..."
$ bundle exec ruby pd2pg.rb

The script incrementally updates existing data, so it’s trivial to refresh your database periodically. (It also fetches historical data from your account, so you can get started with long-term analysis right away.)

Querying PagerDuty data with SQL

You can start analyzing and exploring your PagerDuty data once it’s in the database with psql:

$ psql $DATABASE_URL
> \d incidents
           Column            |           Type           | Modifiers
-----------------------------+--------------------------+-----------
 id                          | character varying        | not null
 incident_number             | integer                  | not null
 created_at                  | timestamp with time zone | not null
 html_url                    | character varying        | not null
 incident_key                | character varying        |
 service_id                  | character varying        |
 escalation_policy_id        | character varying        |
 trigger_summary_subject     | character varying        |
 trigger_summary_description | character varying        |
 trigger_type                | character varying        | not null

> select count(*) from incidents;
 count
-------
 3466
(1 row)

As an example of a real query, here’s how you’d count the number of incidents per service over the past 28 days:

select
  services.name,
  count(incidents.id)
from
  incidents,
  services
where
  incidents.created_at > now() - '28 days'::interval and
  incidents.service_id = services.id
group by
  services.name
order by
  count(incidents.id) desc

How we use pd2pg at Stripe

  • Weekly team report: Our sys team reviews a detailed on-call report each week. It covers all alerts sent by either a team-owned service or fielded by an engineer (which can include escalations from other team’s services). This detailed report helps us understand the types of incidents we’re seeing so we can prevent or respond to them better.
  • Per-service incident counts: Aggregates like per-service incident counts help give us a high-level overview. (They’re not actionable results in themselves, but do show us high-load services we should review further.)
  • Interrupted hours metric: A common way to measure on-call load is counting the number of incidents over a period a time. Sometimes, this over-represents issues that cause several related alerts to fire at the same time (which aren’t actually more costly than a single alert firing). To get a more accurate view of on-call load, we calculate an "interrupted hours" metric that counts the intervals in which an engineer receives one or more pages. This metric provides pretty good insight into real on-call load by suppressing noise from issues that result in multiple pages and more heavily weighting incidents with escalations.
  • On-hours vs. off-hours alerts: Pages during the work day are less costly than ones that wake an engineer up at 3am on a Sunday. So, we look at the metrics discussed above broken down by on-hours vs. off-hours incidents.
  • Escalation rate analysis: Frequent or repeated escalations may indicate that either that responders aren’t able to get to a computer, or they aren’t prepared to deal with the issue. Some escalations are expected, but keeping an eye on escalation rates across services helps us keep an eye out for organizational bugs.
  • Individual on-call load: Being primary on-call is a major responsibility, and high on-call load can cause burnout in engineers. To help understand on-call load at the individual level, we can perform user-specific variants of the above queries.

We’d love to hear how you use pd2pg. If you’ve got any feedback, please get in touch or send us a PR.

December 2, 2014

Photo of Colin Marc

Open-sourcing tools for Hadoop

Colin Marc on November 21, 2014 in Engineering

Stripe’s batch data infrastructure is built largely on top of Apache Hadoop. We use these systems for everything from fraud modeling to business analytics, and we’re open-sourcing a few pieces today:

Timberlake

Timberlake is a dashboard that gives you insight into the Hadoop jobs running on your cluster. Jeff built it as a replacement for the web interfaces currently provided by YARN’s ResourceManager and MRv2’s JobHistory server, and it has some features we’ve found useful:

  • Map and reduce task waterfalls and timing plots
  • Scalding and Cascading awareness
  • Error tracebacks for failed jobs

Brushfire

Avi wrote a Scala framework for distributed learning of ensemble decision tree models called Brushfire. It’s inspired by Google’s PLANET, but built on Hadoop and Scalding. Designed to be highly generic, Brushfire can build and validate random forests and similar models from very large amounts of training data.

Sequins

Sequins is a static database for serving data in Hadoop’s SequenceFile format. I wrote it to provide low-latency access to key/value aggregates generated by Hadoop. For example, we use it to give our API access to historical fraud modeling features, without adding an online dependency on HDFS.

Herringbone

At Stripe, we use Parquet extensively, especially in tandem with Cloudera Impala. Danielle, Jeff, and Avi wrote Herringbone (a collection of small command-line utilities) to make working with Parquet and Impala easier.


If you’re interested in trying out these projects, there’s more info on how to use them (and how they were designed) in the READMEs. If you’ve got feedback, please get in touch or send us a PR.

Happy Hadooping!

November 21, 2014

Photo of Marc Hedlund

Game Day Exercises at Stripe:
Learning from `kill -9`

Marc Hedlund on October 28, 2014 in Engineering

We’ve started running game day exercises at Stripe. During a recent game day, we tested failing over a Redis cluster by running kill -9 on its primary node [0], and ended up losing all data in the cluster. We were very surprised by this, but grateful to have found the problem in testing. This result and others from this exercise convinced us that game days like these are quite valuable, and we would highly recommend them for others.

If you’re not familiar with game days, the best introductory article is this one from John Allspaw [1]. Below, we’ll lay out a playbook for how to run a game day, and describe the results from our latest exercise to show why we believe they are valuable.

How to run a game day exercise

The system we recently tested, scoring-srv, is one part of our fraud detection system. The scoring-srv processes run on a cluster of boxes and connect to a three-node Redis cluster to store fraud scoring data. Our internal charge-processing code connects to scoring-srv for each charge made on Stripe’s network, so it needs to be very low-latency; likewise, accurate scoring requires historical data, so it needs durable storage.

The scoring-srv developers and a member of our systems team, who could help run the tests, got together around a whiteboard. We drew a basic block diagram of the machines and processes, the data stores, and the network connections between the components. With that diagram, we were able to come up with a list of possible failures.

We came up with a list of six tests we could run easily:

  • destroying and restoring a scoring-srv box,
  • destroying progressively more scoring-srv boxes until calls to it began timing out,
  • partitioning the network between our charge processing code and scoring-srv,
  • increasing the load on the primary Redis node,
  • killing the primary Redis node, and
  • killing one of the Redis replicas.

Since the team was new to game days, we did not try to be comprehensive or clever. We instead chose the simplest, easiest to simulate failures we could think of. We’d take a blunt instrument, like kill -9 or aws ec2 terminate-instances, give the system a good hard knock, and see how it reacted [2].

For each test, we came up with one or more hypotheses for what would happen when we ran it. For instance, we guessed that partitioning the network between charge processing and scoring-srv would cause these calls to time out and fail open (that is, allow the charge to go through immediately). Then, we decided on an order to perform the tests, saved a backup of a recent Redis snapshot as a precaution, and dove in.

Here, then, is a quick-start checklist for running a game day:

  1. Get the development team together with someone who can modify the network and destroy or provision servers, and block off an afternoon to run the exercise.
  2. Make a simple block diagram of the machines, processes, and network connections in the system you’re testing.
  3. Come up with 5-7 of the simplest failures you can easily induce in the system.
  4. Write down one or more hypotheses for what will happen after each failure.
  5. Back up any data you can’t lose.
  6. Induce each failure and observe the results, filing bugs for each surprise you encounter.

Observations and results

We were able to terminate a scoring-srv machine and restore it with a single command in roughly the estimated time. This gave us confidence that replacing or adding cluster machines would be fast and easy. We also saw that killing progressively more scoring-srv machines never caused timeouts, showing we currently have more capacity than necessary. Partitioning the network between the charge-processing code and scoring-srv caused a spike in latency, where we’d expected calls to scoring-srv to time out and fail open quickly. This test also should have immediately alerted the teams responsible for this system, but did not.

The first Redis test went pretty well. When we stopped one of the replicas with kill -9, it flapped several times on restart, which was surprising and confusing to observe. As expected, though, the replica successfully restored data from its snapshot and caught up with replication from the primary.

Then we moved to the Redis primary node test, and had a bigger surprise. While developing the system, we had become concerned about latency spikes during snapshotting of the primary node. Because scoring-srv is latency-sensitive, we had configured the primary node not to snapshot its data to disk. Instead, the two replicas each made frequent snapshots. In the case of failure of the primary, we expected one of the two replicas to be promoted to primary; when the failed process came back up, we expected it to restore its data via replication from the new primary. That didn’t happen. Instead, when we ran kill -9 on the primary node (and it was restarted by daemontools), it came back up – after, again, flapping for a short time – with no data, but was still acting as primary. From there, it restarted replication and sent its empty dataset to the two replica nodes, which lost their datasets as a result. In a few seconds, we’d gone from a three-node replicated data store to an empty data set. Fortunately, we had saved a backup and were able to get the cluster re-populated quickly.

The full set of tests took about 3.5 hours to run. For each failure or surprise, we filed a bug describing the expected and actual results. We wound up with 15 total issues from the five tests we performed (we wound up skipping the Redis primary load test) – a good payoff for the afternoon’s work. Closing these, and re-running the game day to verify that we now know what to expect in these cases, will greatly increase our confidence in the system and its behavior.

Learning from the game day

The invalidation of our Redis hypothesis left us questioning our approach to data storage for scoring-srv. Our original Redis setup had all three nodes performing snapshots (that is, periodically saving data to disk). We had tested failover from the primary node due to a clean shutdown and it had succeeded. While analyzing the cluster once we had live data running through it, though, we observed that the low latency we’d wanted from it would hit significant spikes, above 1 second, during snapshotting:

Obviously these spikes were concerning for a latency-sensitive application. We decided to disable snapshotting on the primary node, leaving it enabled on the replica nodes, and you can see the satisfying results below, with snapshotting enabled, then disabled, then enabled again:

Since we believed that failover would not be compromised in this configuration, this seemed like a good trade-off: relying on the primary node for performance and replication, and the replica nodes for snapshotting, failover, and recovery. As it turned out, this change was made the day before the game day, as part of the final lead-up to production readiness. (One could imagine making a similar change in the run-up to a launch!)

The game day wound up being the first full test of the configuration including all optimizations and changes made during development. We had tested the system with a primary node shutdown, then with snapshotting turned off on the primary, but this was the first time we’d seen these conditions operating together. The value of testing on production systems, where you can observe failures under the conditions you intend to ship, should be clear from this result.

After discussing the results we observed with some friends, a long and heated discussion about the failure took place on Twitter, in which Redis’ author said he had not expected the configuration we were using. Since there is no guarantee the software you’re using supports or expects the way you’re using it, the only way to see for certain how it will react to a failure is to try it.

While Redis is functional for scoring-srv with snapshotting turned on, the needs of our application are likely better served by other solutions. The trade-off between high-latency spikes, with primary node snapshotting enabled, versus total cluster data loss, with it disabled, leaves us feeling neither option is workable. For other configurations at Stripe – especially single-node topologies for which data loss is less costly, such as rate-limiting counters – Redis remains a good fit for our needs.

Conclusions

In the wake of the game day, we’ve run a simple experiment with PostgreSQL RDS as a possible replacement for the Redis cluster in scoring-srv. The results suggest that we could expect comparable latency without suffering snapshotting spikes. Our testing, using a similar dataset, had a 99th percentile read latency of 3.2 milliseconds, and a 99th percentile write latency of 11.3 milliseconds. We’re encouraged by these results and will be continuing our experiments with PostgreSQL for this application (and obviously, we will run similar game day tests for all systems we consider).

Any software will fail in unexpected ways unless you first watch it fail for yourself. We completely agree with Kelly Sommers’ point in the Twitter thread about this:

We’d highly recommend game day exercises to any team deploying a complex web application. Whether your hypotheses are proven out or invalidated, either way you’ll leave the exercise with greater confidence in your ability to respond to failures, and less need for on-the-fly diagnosis. Having that happen for the first time while you’re rested, ready, and watching is the best failure you can hope for.

Notes

[0] We’ve chosen to use the terms “primary” and “replica” in discussing Redis, rather than the terms “master” and “slave” used in the Redis documentation, to support inclusivity. For some interesting and heated discussion of this substitution, we’d recommend this Django pull request and this Drupal change.

[1] Some other good background articles for further reading: “Weathering the Unexpected”; “Resilience Engineering: Learning to Embrace Failure”; “Training Organizational Resilience in Escalating Situations”; “When the Nerds Go Marching In.”

[2] If you’d like to run more involved tests and you’re on AWS, this Netflix Tech Blog post from last week describes the tools they use for similar testing approaches.

Thanks

Thanks much to John Allspaw, Jeff Hodges, Kyle Kingsbury, and Raffi Krikorian for reading drafts of this post, and to Kelly Sommers for permission to quote her tweet. Any errors are ours alone.

October 28, 2014

Photo of Alex MacCaw

jQuery.payment

Alex MacCaw on February 7, 2013 in Engineering

A rising tide lifts all boats, and we’d like to help improve payment experiences for consumers everywhere, whether or not they use Stripe. Today, we’re releasing jQuery.payment, a general purpose library for building credit card forms, validating input, and formatting numbers. This library is behind a lot of the functionality in Checkout.

Some sites require a bit more flexibility than our Checkout provides. This is where jQuery.payment shines. You can have some of the same formatting and validation as in the Checkout along with as much flexibility as you need.

Features

For example, you can ensure that a text input is formatted as a credit card number, with digits in groups of four and limited to 16 characters.

$('input.cc-num').payment('formatCardNumber');

Or you can ensure input is formatted as a MM/YYYY card expiry:

$('input.cc-exp').payment('formatCardExpiry');

The library includes a bunch of utility and validation methods, for example:

$.payment.validateCardNumber('4242 4242 4242 4242'); //=> true
$.payment.validateCardCVC('123', 'amex'); //=> false
$.payment.validateCardExpiry('05', '20'); //=> true

$.payment.cardType('4242 4242 4242 4242'); //=> 'visa'

Robust and tested

It turns out that rolling your own code that restricts and formats input is particularly tricky in JavaScript. You have to cater for lots of edge cases such as users pasting text, selecting and replacing numbers, as well as the different ways credit card numbers are formatted.

We’ve spent a lot of time tuning our formatting and validation logic as well as testing and ensuring cross browser compatibility, so you don't have to reinvent the wheel. We look forward to seeing what you build! You can find a live demo of the library, as well as the source on GitHub.

February 7, 2013

Photo of Nelson Elhage

Announcing MoSQL

Nelson Elhage on February 5, 2013 in Engineering

Today, we are releasing MoSQL, a tool Stripe developed for live-replicating data from a MongoDB database into a PostgreSQL database. With MoSQL, you can run applications against a MongoDB database, but also maintain a live-updated mirror of your data in PostgreSQL, ready for querying with the full power of SQL.

Motivation

Here at Stripe, we use a number of different database technologies for both internal- and external-facing services. Over time, we've found ourselves with growing amounts of data in MongoDB that we would like to be able to analyze using SQL. MongoDB is great for a lot of reasons, but it's hard to beat SQL for easy ad-hoc data aggregation and analysis, especially since virtually every developer or analyst already knows it.

An obvious solution is to periodically dump your MongoDB database and re-import into PostgreSQL, perhaps using mongoexport. We experimented with this approach, but found ourselves frustrated with the ever-growing time it took to do a full refresh. Even if most of your analyses can tolerate a day or two of delay, occasionally you want to ask ad-hoc questions about "what happened last night?", and it's frustrating to have to wait on a huge dump/load refresh to do that. In response, we built MoSQL, enabling us to keep a real-time SQL mirror of our Mongo data.

MoSQL does an initial import of your MongoDB collections into a PostgreSQL database, and then continues running, applying any changes to the MongoDB server in near-real-time to the PostgreSQL mirror. The replication works by tailing the MongoDB oplog, in essentially the same way Mongo's own replication works.

Usage

MoSQL can be installed like any other gem:

$ gem install mosql

To use MoSQL, you'll need to create a collection map which maps your MongoDB objects to a SQL schema. We'll use the collection from the MongoDB tutorial as an example. A possible collection map for that collection would look like:

mydb:
  things:
    :columns:
      - _id: TEXT
      - x: INTEGER
      - j: INTEGER
    :meta:
     :table: things
     :extra_props: true

Save that file as collections.yaml, start a local mongod and postgres, and run:

$ mosql --collections collections.yaml

Now, run through the MongoDB tutorial, and then open a psql shell. You'll find all your Mongo data now available in SQL form:

postgres=# select * from things limit 5;
           _id            | x | j |   _extra_props
--------------------------+---+---+------------------
 50f445b65c46a32ca8c84a5d |   |   | {"name":"mongo"}
 50f445df5c46a32ca8c84a5e | 3 |   | {}
 50f445e75c46a32ca8c84a5f | 4 | 1 | {}
 50f445e75c46a32ca8c84a60 | 4 | 2 | {}
 50f445e75c46a32ca8c84a61 | 4 | 3 | {}
(5 rows)

mosql will continue running, syncing any further changes you make into Postgres.

For more documentation and usage information, see the README.

mongoriver

MoSQL comes from a general philosophy of preferring real-time, continuously-updating solutions to periodic batch jobs.

MoSQL is built on top of mongoriver, a general library for MongoDB oplog tailing that we developed. Along with the MoSQL release, we have also released mongoriver as open source today. If you find yourself wanting to write your own MongoDB tailer, to monitor updates to your data in near-realtime, check it out.

February 5, 2013

Photo of Evan Broder

Exploring Python Using GDB

Evan Broder on June 13, 2012 in Engineering

People tend to have a narrow view of the problems they can solve using GDB. Many think that GDB is just for debugging segfaults or that it's only useful with C or C++ programs. In reality, GDB is an impressively general and powerful tool. When you know how to use it, you can debug just about anything, including Python, Ruby, and other dynamic languages. It's not just for inspection either—GDB can also be used to modify a program's behavior while it's running.

When we ran our Capture The Flag contest, a lot of people asked us about introductions to that kind of low-level work. GDB can be a great way to get started. In order to demonstrate some of GDB's flexibility, and show some of the steps involved in practical GDB work, we've put together a brief example of debugging Python with GDB.

Imagine you're building a web app in Django. The standard cycle for building one of these apps is to edit some code, hit an error, fix it, restart the server, and refresh in the browser. It's a little tedious. Wouldn't it be cool if you could hit the error, fix the code while the request is still pending, and then have the request complete successfully?

As it happens, the Seaside framework supports exactly this. Using one of Stripe's example projects, let's take a look at how we could pull it off in Python using GDB:

GDB Demo Screencast

Pretty cool, right? Though a little contrived, this example demonstrates many helpful techniques for making effective real-world use of GDB. I'll walk through what we did in a little more detail, and explain some of the GDB tricks as we go.

For the sake of brevity, I'll show the commands I type, but elide some of the output they generate. I'm working on Ubuntu 12.04 with GDB 7.4. The manipulation should still work on other platforms, but you probably won't get automatic pretty-printing of Python types. You can generate them by hand by running p PyString_AsString(PyObject_Repr(obj)) in GDB.

Getting Set Up

First, let's start the monospace-django server with --noreload so that Django's autoreloading doesn't get in the way of our GDB-based reloading. We'll also use the python2.7-dbg interpreter, which will ensure that less of the program's state is optimized away.

$ git clone http://github.com/stripe/monospace-django
$ cd monospace-django/
$ virtualenv --no-site-packages env
$ cp /usr/bin/python2.7-dbg env/bin/python
$ source env/bin/activate
(env)$ pip install -r requirements.txt
(env)$ python monospace/manage.py syncdb
(env)$ python monospace/manage.py runserver --noreload

$ sudo gdb -p $(pgrep -f monospace/manage.py)
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
[...]
Attaching to process 946
Reading symbols from /home/evan/monospace-django/env/bin/python...done.
(gdb) symbol-file /usr/bin/python2.7-dbg
Load new symbol table from "/usr/bin/python2.7-dbg"? (y or n) y
Reading symbols from /usr/bin/python2.7-dbg...done.

As of version 7.0 of GDB, it's possible to automatically script GDB's behavior, and even register your own code to pretty-print C types. Python comes with its own hooks which can pretty-print Python types (such as PyObject *) and understand the Python stack. These hooks are loaded automatically if you have the python2.7-dbg package installed on Ubuntu.

Whatever you're debugging, you should look to see if there are relevant GDB scripts available—useful helpers have been created for many dynamic languages.

Catching the Error

The Python interpreter creates a PyFrameObject every time it starts executing a Python stack frame. From that frame object, we can get the name of the function being executed. It's stored as a Python object, so we can convert it to a C string using PyString_AsString, and then stop the interpreter only if it begins executing a function called handle_uncaught_exception.

The obvious way to catch this would be by creating a GDB breakpoint. A lot of frames are allocated in the process of executing Python code, though. Rather than tediously continue through hundreds of false positives, we can set a conditional breakpoint that'll break on only the frame we care about:

(gdb) b PyEval_EvalFrameEx if strcmp(PyString_AsString(f->f_code->co_name), "handle_uncaught_exception") == 0
Breakpoint 1 at 0x519d64: file ../Python/ceval.c, line 688.
(gdb) c
Continuing.

Breakpoint conditions can be pretty complex, but it's worth noting that conditional breakpoints that fire often (like PyEval_EvalFrameEx) can slow the program down significantly.

Generating the Initial Return Value

Okay, let's see if we can actually fix things during the next request. We resubmit the form. Once again, GDB halts when the app starts generating the internal server error response. While we investigate more, let's disable the breakpoint in order to keep things fast.

What we really want to do here is to let the app finish generating its original return value (the error response) and then to replace that with our own (the correct response). We find the stack frame where get_response is being evaluated. Once we've jumped to that frame with the up or frame command, we can use the finish command to wait until the currently selected stack frame finishes executing and returns.

Breakpoint 1, PyEval_EvalFrameEx (f=
    Frame 0x3534110, for file [...]/django/core/handlers/base.py, line 186, in handle_uncaught_exception [...], throwflag=0) at ../Python/ceval.c:688
688 ../Python/ceval.c: No such file or directory.
(gdb) disable 1
(gdb) frame 3
#3  0x0000000000521276 in PyEval_EvalFrameEx (f=
    Frame 0x31ac000, for file [...]/django/core/handlers/base.py, line 169, in get_response [...], throwflag=0) at ../Python/ceval.c:2666
2666      in ../Python/ceval.c
(gdb) finish
Run till exit from #3  0x0000000000521276 in PyEval_EvalFrameEx (f=
    Frame 0x31ac000, for file [...]/django/core/handlers/base.py, line 169, in get_response [...], throwflag=0) at ../Python/ceval.c:2666
0x0000000000526871 in fast_function (func=<function at remote 0x26e96f0>,
    pp_stack=0x7fffb296e4b0, n=2, na=2, nk=0) at ../Python/ceval.c:4107
4107                         in ../Python/ceval.c
Value returned is $1 =
    <HttpResponseServerError[...] at remote 0x3474680>

Patching the Code

Now that we've gotten the interpreter into the state we want, we can use Python's internals to modify the running state of the application. GDB allows you to make fairly complicated dynamic function invocations, and we'll use lots of that here.

We use the C equivalent of the Python reload function to reimport the code. We have to also reload the monospace.urls module so that it picks up the new code in monospace.views.

One handy trick, which we use to invoke git in the video and curl here, is that you can run shell commands from within GDB.

(gdb) shell curl -s -L https://gist.github.com/raw/2897961/ | patch -p1
patching file monospace/views.py
(gdb) p PyImport_ReloadModule(PyImport_AddModule("monospace.views"))
$2 = <module at remote 0x31d4b58>

(gdb) p PyImport_ReloadModule(PyImport_AddModule("monospace.urls"))
$3 = <module at remote 0x31d45a8>

We've now patched and reloaded the code. Next, let's generate a new response by finding self and request from the local variables in this stack frame, and fetch and call its get_response method.

(gdb) p $self = PyDict_GetItemString(f->f_locals, "self")
$4 =
    <WSGIHandler([...]) at remote 0x311c610>
(gdb) set $request = PyDict_GetItemString(f->f_locals, "request")
(gdb) set $get_response = PyObject_GetAttrString($self, "get_response")
(gdb) set $args = Py_BuildValue("(O)", $request)
(gdb) p PyObject_Call($get_response, $args, 0)
$5 =
    <HttpResponse([...]) at remote 0x31b9fb0>

In the above snippet, we use GDB's set command to assign values to variables.

Alright, we now have a new response. Remember that we stopped the program right where the original get_response method returned. The C return value for the Python interpreter is the same as the Python return value. And so, to replace that return value on x86, we just have to store the new return value in a register—$rax on 64-bit x86— and then allow the execution to continue.

GDB allows you to refer to refer to the values returned by every command you evaluate by number. In this case, we want $5:

(gdb) set $rax = $5
(gdb) c
Continuing.

And, like magic, our web request finishes successfully.

GDB is a powerful precision tool. Even if you spend most of your time writing code in a much higher-level language, it can be extremely useful to have it available when you need to investigate subtle bugs or complex issues in running applications.

June 13, 2012