We’re open-sourcing the tool we use to collect and analyze on-call data from PagerDuty. We use pd2pg to improve the on-call experience for engineers at Stripe, and we think it’ll be useful for your teams too.
PagerDuty data in Postgres
PagerDuty is an important source of data about how services behave in production and the on-call load experienced by engineers. This data has been instrumental for managing and evolving our on-call rotations: over five months, we’ve reduced on-call load for our systems team by about 75%.
We import data from the PagerDuty API into a Postgres database using pd2pg, where we can use the full power of Postgres’ SQL queries.
Here’s how you import your data:
export PAGERDUTY_SUBDOMAIN="your-company" export PAGERDUTY_API_KEY="..." export DATABASE_URL="postgres://..." bundle exec ruby pd2pg.rb
The script incrementally updates existing data, so it’s trivial to refresh your database periodically. (It also fetches historical data from your account, so you can get started with long-term analysis right away.)
Querying PagerDuty data with SQL
You can start analyzing and exploring your PagerDuty data once it’s in the database with
$DATABASE_URL \d incidents Column Type Modifiers id character varying not null incident_number integer not null created_at timestamp with time zone not null html_url character varying not null incident_key character varying service_id character varying escalation_policy_id character varying trigger_summary_subject character varying trigger_summary_description character varying trigger_type character varying not null select count(*) from incidents; count ------- 3466psql
As an example of a real query, here’s how you’d count the number of incidents per service over the past 28 days:
select services.name, count(incidents.id) from incidents, services where incidents.created_at > now() - '28 days'::interval and incidents.service_id = services.id group by services.name order by count(incidents.id) desc
How we use pd2pg at Stripe
- Weekly team report: Our sys team reviews a detailed on-call report each week. It covers all alerts sent by either a team-owned service or fielded by an engineer (which can include escalations from other team’s services). This detailed report helps us understand the types of incidents we’re seeing so we can prevent or respond to them better.
- Per-service incident counts: Aggregates like per-service incident counts help give us a high-level overview. (They’re not actionable results in themselves, but do show us high-load services we should review further.)
- Interrupted hours metric: A common way to measure on-call load is counting the number of incidents over a period a time. Sometimes, this over-represents issues that cause several related alerts to fire at the same time (which aren’t actually more costly than a single alert firing). To get a more accurate view of on-call load, we calculate an "interrupted hours" metric that counts the intervals in which an engineer receives one or more pages. This metric provides pretty good insight into real on-call load by suppressing noise from issues that result in multiple pages and more heavily weighting incidents with escalations.
- On-hours vs. off-hours alerts: Pages during the work day are less costly than ones that wake an engineer up at 3am on a Sunday. So, we look at the metrics discussed above broken down by on-hours vs. off-hours incidents.
- Escalation rate analysis: Frequent or repeated escalations may indicate that either that responders aren’t able to get to a computer, or they aren’t prepared to deal with the issue. Some escalations are expected, but keeping an eye on escalation rates across services helps us keep an eye out for organizational bugs.
- Individual on-call load: Being primary on-call is a major responsibility, and high on-call load can cause burnout in engineers. To help understand on-call load at the individual level, we can perform user-specific variants of the above queries.