Blog Engineering

Share this post on Twitter

PagerDuty analytics with Postgres

Mark McGranaghan on December 2, 2014 in Engineering

We’re open-sourcing the tool we use to collect and analyze on-call data from PagerDuty. We use pd2pg to improve the on-call experience for engineers at Stripe, and we think it’ll be useful for your teams too.

PagerDuty data in Postgres

PagerDuty is an important source of data about how services behave in production and the on-call load experienced by engineers. This data has been instrumental for managing and evolving our on-call rotations: over five months, we’ve reduced on-call load for our systems team by about 75%.

We import data from the PagerDuty API into a Postgres database using pd2pg, where we can use the full power of Postgres’ SQL queries.

Here’s how you import your data:

$ export PAGERDUTY_SUBDOMAIN="your-company"
$ export PAGERDUTY_API_KEY="..."
$ export DATABASE_URL="postgres://..."
$ bundle exec ruby pd2pg.rb

The script incrementally updates existing data, so it’s trivial to refresh your database periodically. (It also fetches historical data from your account, so you can get started with long-term analysis right away.)

Querying PagerDuty data with SQL

You can start analyzing and exploring your PagerDuty data once it’s in the database with psql:

$ psql $DATABASE_URL
> \d incidents
           Column            |           Type           | Modifiers
-----------------------------+--------------------------+-----------
 id                          | character varying        | not null
 incident_number             | integer                  | not null
 created_at                  | timestamp with time zone | not null
 html_url                    | character varying        | not null
 incident_key                | character varying        |
 service_id                  | character varying        |
 escalation_policy_id        | character varying        |
 trigger_summary_subject     | character varying        |
 trigger_summary_description | character varying        |
 trigger_type                | character varying        | not null

> select count(*) from incidents;
 count
-------
 3466
(1 row)

As an example of a real query, here’s how you’d count the number of incidents per service over the past 28 days:

select
  services.name,
  count(incidents.id)
from
  incidents,
  services
where
  incidents.created_at > now() - '28 days'::interval and
  incidents.service_id = services.id
group by
  services.name
order by
  count(incidents.id) desc

How we use pd2pg at Stripe

  • Weekly team report: Our sys team reviews a detailed on-call report each week. It covers all alerts sent by either a team-owned service or fielded by an engineer (which can include escalations from other team’s services). This detailed report helps us understand the types of incidents we’re seeing so we can prevent or respond to them better.
  • Per-service incident counts: Aggregates like per-service incident counts help give us a high-level overview. (They’re not actionable results in themselves, but do show us high-load services we should review further.)
  • Interrupted hours metric: A common way to measure on-call load is counting the number of incidents over a period a time. Sometimes, this over-represents issues that cause several related alerts to fire at the same time (which aren’t actually more costly than a single alert firing). To get a more accurate view of on-call load, we calculate an "interrupted hours" metric that counts the intervals in which an engineer receives one or more pages. This metric provides pretty good insight into real on-call load by suppressing noise from issues that result in multiple pages and more heavily weighting incidents with escalations.
  • On-hours vs. off-hours alerts: Pages during the work day are less costly than ones that wake an engineer up at 3am on a Sunday. So, we look at the metrics discussed above broken down by on-hours vs. off-hours incidents.
  • Escalation rate analysis: Frequent or repeated escalations may indicate that either that responders aren’t able to get to a computer, or they aren’t prepared to deal with the issue. Some escalations are expected, but keeping an eye on escalation rates across services helps us keep an eye out for organizational bugs.
  • Individual on-call load: Being primary on-call is a major responsibility, and high on-call load can cause burnout in engineers. To help understand on-call load at the individual level, we can perform user-specific variants of the above queries.

We’d love to hear how you use pd2pg. If you’ve got any feedback, please get in touch or send us a PR.