
Coordinating Multi-Time-Zone Incident Response
Incidents rarely respect time zones. When ransomware hit a client while I was in Lisbon, the on-call engineer in Seattle was asleep and the CTO was boarding in Singapore. Chaos. I rebuilt the process so follow-the-sun coverage is smooth, measurable, and fatigue-free. Here's how we coordinate multi-time-zone incident response today.
What Happened in Lisbon
It was 11:30 p.m. in Lisbon when PagerDuty woke me. The client's production database was encrypted. Ransom note in ASCII art. By the time I got my laptop open and the VPN running, I was fighting two wars: the technical one and the coordination disaster. Seattle's primary responder had their phone on airplane mode. Singapore's CTO replied forty minutes later with "Just landed, no laptop." Our Slack war room filled with messages that nobody was reading in sync. We lost an hour just trying to agree on who owned what.
The ransomware turned out to be a wiper variant, not real extortion, so we restored from backups and rebuilt the network perimeter. But the real damage was internal. The Seattle engineer woke up to sixty unread messages and quit three weeks later. The CTO filed a post-mortem that mentioned "structural handoff failure" seventeen times. I wrote the first version of this playbook on the flight back to Bucharest, fueled by espresso and spite.
Team Structure
- Regions: Americas, EMEA, APAC. Each has primary + secondary responder (L1) and subject-matter expert (L2).
- Shift cadence: 8-hour coverage blocks with 30-minute overlap. On-call rotation weekly.
- Roles: Lead (current region), Deputy (next region), Scribe (any time zone). Scribe maintains timeline and updates.
Tooling Stack We Actually Use
The first iteration after Lisbon used eight different apps. Too many tabs, too much context switching. After three months of refinement, here's what survived:
- PagerDuty schedules aligned with time zones. Each region has a dedicated schedule. I configure escalation policies so if the primary doesn't ack within five minutes, it routes to the secondary, then the L2. We publish the on-call calendar as an .ics feed so people can overlay it on their personal calendars.
- Incident Slack channel with prefix
#inc-{ticket}
. Slack integrations auto-post alerts from PagerDuty, Datadog, and Sentry. I mute @channel during handoffs to avoid notification storms, then re-enable it once the new lead confirms ownership. - Notion incident workspace. The template includes a timeline (UTC timestamps only), action items with owners, a comms log for stakeholder updates, and a decision log. Every entry links to evidence—screenshots, logs, Datadog snapshots. I keep a markdown backup in Git (
incidents/YYYY-MM-DD-{slug}.md
) so if Notion dies mid-incident, we're not blind. - Zoom war room with waiting room disabled; Jitsi fallback. Zoom's good for screen sharing and recording. Jitsi saved us twice when bandwidth in Manila tanked and Zoom wouldn't connect. I keep both links pinned in the Slack channel header.
- 1Password shared vault for credentials. Every responder has access. We rotate secrets after each major incident, but mid-crisis you can't be hunting for the root CA password.
One rule: every tool must work offline or with degraded connectivity. I learned that in Kraków when the coworking space lost internet for ninety minutes during a DDoS post-mortem call.
Handoff Playbook
Pre-Handoff (T-30 minutes)
- Lead updates timeline with:
- Current status, mitigations applied, outstanding tasks.
- Key contacts (stakeholders, customers).
- Risk level and business impact.
- Scribe prepares executive summary (one slide) for leadership.
- Deputy reviews monitoring dashboards to confirm context.
Live Handoff Call (15 minutes)
- Agenda: Overview (Lead), Questions (Deputy), Next Steps, Confirm ownership.
- Document in Notion
Handoff Log
with start/stop times and participants.
Post-Handoff
- New lead posts update in Slack channel: “EMEA taking lead. Next update at 10:00 UTC.”
- Previous lead logs ticket off duty and rests (mandatory downtime enforcement).
Communication Rhythm That Doesn't Drive People Insane
Early on, I made the mistake of posting updates every time something changed. Stakeholders got fifty messages in two hours and started ignoring them. Now we follow a cadence tied to incident age:
Time since start | Update frequency |
---|---|
0–1 hour | Every 15 minutes |
1–4 hours | Every 30 minutes |
>4 hours | Hourly + whenever major change |
Executives receive an email summary every two hours, automated via PagerDuty Status Pages. I template the email so it always includes: current status, customer impact, ETA to resolution (or "still investigating"), and who to contact. During the Lisbon ransomware mess, the CEO got a wall of text at 2 a.m. his time with no clear action. Never again.
One other trick: I use Slack's scheduled-send feature to batch non-urgent updates so the next region doesn't wake up to a chaotic scroll. If you post seven messages between 23:00 and 23:45 UTC, schedule them all to deliver at 00:00 UTC in a single block.
Documentation Templates
- Timeline entries:
UTC | Who | What happened | Evidence link
- Decision log:
Decision, Reason, Approver, Timestamp
- Action items:
Task, Owner, ETA, Status
All editable offline (Notion export + markdown backups stored in Git).
What Broke and How We Fixed It
The playbook sounds clean on paper. Real incidents expose the gaps. Here's what we learned the hard way:
Singapore handoff failed because the deputy was in a no-phone zone. The flight from Tokyo to Manila requires phones off for ninety minutes during approach. Now we require deputies to confirm they're available thirty minutes before handoff, or we escalate to the tertiary.
Notion went down during a data-center outage in Frankfurt. We lost access to the timeline for forty minutes. Now every incident gets a parallel markdown file in Git, updated via a Slack bot command: /incident-log [entry]
. The bot appends to the markdown file and pushes to Git. Low-tech, bombproof.
APAC contact list was six months out of date. The on-call engineer for APAC left the company in March; we didn't update the escalation policy. PagerDuty paged a dead Slack account for twelve minutes before someone manually escalated. Now we run a quarterly audit of every contact in every region and force an ack test—if you don't respond within two minutes, you're removed from the rotation.
Retros kept getting rescheduled. People are exhausted after an incident. If you don't book the retro within seventy-two hours, it never happens. I now schedule the retro call automatically when an incident closes, using a Zapier hook tied to PagerDuty status changes. Attendance is mandatory unless you're on PTO.
Continual Improvement
After each incident, we track three metrics per region: mean time to acknowledge (MTTA), mean time to resolve (MTTR), and handoff smoothness (measured by whether the deputy asked clarifying questions or just ran with it). If APAC's MTTR is consistently double EMEA's, that's a staffing or tooling problem, not bad luck.
We also rotate the retro facilitator between time zones so no single region owns the narrative. The facilitator publishes the retro doc in Notion with action items, owners, and due dates. If an action item isn't closed within two weeks, it gets escalated to leadership. This system survived because it's light on process and heavy on accountability.
Fatigue & Well-Being (The Part Everyone Skips)
After the Lisbon ransomware incident, the Seattle engineer didn't just quit because of the technical chaos. They quit because they worked a sixteen-hour shift, slept four hours, then got paged again for a separate outage. We didn't enforce rest. That was on me.
Now we have hard rules:
- No double shifts. If coverage is thin, we escalate to leadership to pull in contractors or pause non-critical work. I'd rather delay a feature launch than burn someone out.
- Mandatory forty-eight-hour break after major incidents. "Major" is defined as anything lasting more than four hours or requiring an executive briefing. The responder is off the on-call rotation and gets comp time.
- Mental health resources post-incident. We provide access to Talkspace and an employee assistance program (EAP). After the Frankfurt data-center incident, three people used it. Nobody mocked them.
- Transparent on-call calendar published three months ahead. People plan vacations, doctor appointments, and time with family. If you spring an on-call week on someone with two days' notice, they resent it. Publish the calendar, let people swap shifts via Slack, and honor their boundaries.
One more thing: during multi-day incidents, we deliver meals to people's homes if they're working remotely, or provide stipends if they're traveling. It's a small gesture, but when you're debugging at 3 a.m. in a Bangkok hostel and a food delivery shows up, you feel seen.
The Checklist I Keep in My Backpack
I printed this and laminated it after the third time I forgot to book a handoff call:
[ ] On-call schedule updated for all regions
[ ] Incident templates prepared and stored offline (Notion + Git)
[ ] Overlap meeting booked 30 min before handoff
[ ] Communication cadence agreed with stakeholders
[ ] Deputy confirmed availability 30 min before handoff
[ ] Retro auto-scheduled at incident close (Zapier)
[ ] Credentials vault accessible to all responders
[ ] Backup markdown log initialized in Git
[ ] Food/rest breaks enforced every 4 hours
What This Looks Like in Practice
A month after we deployed this playbook, a client's API gateway went down at 14:00 UTC. I was in a café in Porto. The EMEA lead (based in Berlin) acknowledged within two minutes, spun up the Slack channel, and started the Notion timeline. At 21:30 UTC, we handed off to the Americas lead in Austin. The handoff call took eleven minutes. By 22:00 UTC, Austin had restored service and was writing the post-mortem. The next morning, APAC reviewed the retro notes and pushed a patch to prevent recurrence.
Nobody worked more than nine hours. Nobody quit. The client sent a thank-you email.
Follow-the-sun incident response isn't just a buzzword. It's a system you build, test, break, and rebuild until it's boring. And boring is exactly what you want when the pager goes off at midnight.