Laptop screen showing code and development work

Debugging Production Issues from a Moving Train Through the Alps

The PagerDuty alert hit at 14:37 CET, somewhere between Lugano and Como. Database connection pool exhausted, API throwing 503s, customers locked out. I had 22% battery, one bar of Swiss 4G that would vanish in about eight minutes when we hit the Gotthard tunnel, and a laptop balanced on a fold-down tray table vibrating at 160 km/h. This is incident response on a moving train.

Laptop screen showing code and development work

Photo: Unsplash / Karsten Winegeart

Why I Was Even On This Train

I was three days into a work-from-Europe trip, bouncing between Zurich and Milan for client meetings. The Gotthard Panorama route looked too good to skip, so I booked a ticket and figured I'd get some low-key coding done. The universe had other plans.

What I learned: If you work in ops and travel frequently, eventually you will have to debug production from somewhere absurd. Might as well have a protocol.

The Connectivity Problem

Trains are terrible for network reliability:

  • Constant handoffs between cell towers at high speed
  • Tunnels that black out connectivity for minutes at a time
  • Cross-border roaming that introduces latency spikes and sometimes drops your session entirely
  • Train Wi-Fi that's borderline unusable (20+ Mbps advertised, 0.3 Mbps actual on this route)

For casual browsing, it's annoying. For SSH sessions and database queries, it's a disaster.

My Setup: Redundant Connectivity

I never travel with a single internet connection. Here's what I had running on this train:

Primary: Dual-SIM Phone Hotspot

  • Swiss SIM (Swisscom prepaid, 20GB data)
  • EU roaming SIM (Vodafone, works across borders)
  • iPhone 14 Pro with automatic carrier switching
  • Tethered to laptop via USB-C (more stable than Wi-Fi, faster failover)

Secondary: Portable Router

  • GL.iNet Mudi (GL-E750) with a third SIM (Three UK, roaming across EU)
  • Battery-powered, fits in my jacket pocket
  • Configured with automatic failback to phone hotspot if the SIM loses signal

Tertiary: Train Wi-Fi (LOL)

  • Technically available
  • Realistically unusable for anything requiring low latency
  • I keep it connected as a last-resort fallback for Slack messages

Emergency: Starlink Mini

Okay, I don't actually carry this yet, but after this trip I'm seriously considering it. If you're doing regular train travel and need rock-solid uptime, the Starlink Mini (portable dish) is looking increasingly viable at $599 + $30/month.

The Debugging Session

Here's how the incident played out in real-time:

14:37 - Alert Arrives

PagerDuty hits phone and laptop simultaneously. Database connection pool exhausted on our primary Postgres instance (RDS db.r6g.xlarge in eu-central-1).

First move: Triage. I need to know:

  • Is this impacting all customers or just a subset?
  • Is it database-level or application-level?
  • Can I mitigate without a full rollback?

I open Datadog on my phone (yes, phone—laptop's still waking up from sleep). Error rate spiked 6 minutes ago. Affecting 100% of API requests. Database CPU at 98%, connection count maxed at 150.

14:39 - Establish Stable Connection

Train's about to enter a tunnel. I have maybe 5 minutes of stable connectivity.

I switch laptop from phone hotspot to the GL.iNet router (it's on the Italian SIM, which has better coverage on this route). Latency to AWS eu-central-1: 68ms. Not great, not terrible.

14:41 - SSH into Bastion


ssh -o ServerAliveInterval=30 -o ServerAliveCountMax=3 ops-bastion.example.com

The ServerAliveInterval is critical here—without it, SSH sessions drop silently when you hit a tunnel and don't reconnect cleanly.

I use tmux religiously for exactly this reason:


tmux attach -t incident || tmux new -s incident

If my connection drops, tmux keeps the session alive on the bastion. When I reconnect, I reattach and pick up exactly where I left off.

14:43 - Identify the Query

I dump the active queries on Postgres:


SELECT pid, usename, application_name, state, query, query_start
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;

There it is: A JOIN query from our analytics service running for 14 minutes, holding locks, and blocking every other transaction. The query was supposed to be read-only but got routed to the primary due to a misconfigured connection pool.

14:45 - We Hit the Tunnel

Connection drops. Laptop shows "No Internet." I watch my SSH session freeze.

This is where tmux saves me. The session is still alive on the bastion. My query results are sitting in the terminal buffer. I don't lose state.

I spend the next 90 seconds staring at a black screen, hoping we exit the tunnel before the database completely melts down.

14:47 - Back Online

Train exits tunnel. GL.iNet reconnects to Italian towers. Latency spikes to 340ms but the connection holds.

I reattach to tmux:


tmux attach -t incident

Everything's right where I left it. I kill the offending query:


SELECT pg_terminate_backend(12849);

Connection pool drains. API starts recovering. Error rate drops from 100% to 12% within 30 seconds.

14:50 - Deploy Hotfix

The root cause was a config change that routed read queries to the primary instead of read replicas. Someone (not me, but I've done worse) pushed a connection string tweak without realizing the impact.

I need to roll back the config change, but I'm on a train with intermittent connectivity. Here's my process:

  1. Cherry-pick the revert commit (already prepared by on-call teammate in Slack)
  2. Push to staging, verify it doesn't break anything (automated tests run in 90 seconds)
  3. Deploy to prod via CI/CD pipeline (Github Actions + ArgoCD)

The deploy takes 4 minutes. I'm switching between phone hotspot and router every time we hit a connectivity dead zone.

14:55 - Confirm Resolution

Datadog shows error rate back to baseline. Database CPU drops to 32%. Customer support confirms logins are working.

I write a quick incident summary in our Slack #incidents channel:


14:37 - DB connection pool exhausted due to long-running analytics query on primary
14:43 - Identified misconfigured read replica routing
14:47 - Killed blocking query, pool recovered
14:50 - Deployed config rollback
14:55 - Confirmed resolution
Root cause: Config change routed reads to primary. Fix: Reverted config, added test to prevent regression.
Postmortem: Tomorrow.

Total incident duration: 18 minutes. From a moving train. In a tunnel.

What Worked

Here's what saved me:

1. Redundant Connectivity

Having three independent internet sources meant I could failover when one dropped. The GL.iNet router with auto-failback was clutch.

2. tmux on Bastion

Cannot overstate this. tmux meant I didn't lose my session state when connectivity dropped. Every command, every query result, every log tail was preserved across disconnects.

3. Pre-configured SSH Keepalives


# In ~/.ssh/config
Host *.example.com
    ServerAliveInterval 30
    ServerAliveCountMax 3
    TCPKeepAlive yes

This kept SSH connections alive through brief signal drops and tunnels.

4. Mobile-Friendly Monitoring

Datadog and PagerDuty both have excellent mobile apps. I could triage the incident from my phone while my laptop booted.

5. Automated Deploys

I didn't have to manually SSH into servers and restart services. I pushed a config change and let CI/CD handle the rollout. This is doable on flaky train internet; manual operations are not.

What Didn't Work

Train Wi-Fi

Advertised as "high-speed," delivered 0.3 Mbps with 800ms latency and constant dropouts. Completely useless for SSH.

Single Hotspot

If I'd only had my phone, I would've been dead in the water during tunnel transits. Redundancy saved me.

Working from the Dining Car

I initially tried to set up in the dining car (better table, coffee). The Wi-Fi was somehow even worse there, and every time someone walked by my table shook. Moved back to my seat.

My Travel Incident Response Kit

Here's what I carry now for exactly this scenario:

| Item | Purpose | Cost | |------|---------|------| | GL.iNet Mudi (GL-E750) | Backup LTE router with failover | $150 | | Dual-SIM phone | Primary hotspot with carrier diversity | (Already own) | | Three UK SIM | EU roaming data, 12GB prepaid | $25 | | USB-C tethering cable | Faster/more stable than Wi-Fi hotspot | $15 | | Anker PowerCore 20K | Keep devices alive during long incidents | $50 | | Laptop w/ 12+ hour battery | Framework Laptop 13, real-world 14hrs | $1,400 |

Total for mobile ops setup: ~$240 (excluding laptop).

Lessons Learned

1. Assume Connectivity Will Fail

Don't rely on a single internet source. Have at least two independent paths (different carriers, different devices).

2. Use tmux or screen for All Remote Sessions

If you're SSHing from a mobile connection, always use a terminal multiplexer. Your future self will thank you.

3. Optimize for Latency, Not Bandwidth

SSH, database queries, and API calls need low latency and stable connections, not high bandwidth. A 5 Mbps connection with 50ms latency beats a 50 Mbps connection with 500ms latency for ops work.

4. Mobile Triage is Real

Be able to assess and mitigate incidents from your phone. Sometimes your laptop won't boot fast enough, or you'll be standing in a train aisle with no table.

5. Document Your Runbooks for Flaky Connections

My runbooks now include a "mobile ops" section with shorter commands, pre-configured aliases, and steps that assume I might lose connectivity mid-process.

The Absurd Reality of Modern Ops

I debugged a production database incident, killed a runaway query, deployed a config rollback, and confirmed resolution—all while traveling at 160 km/h through a mountain tunnel with intermittent 4G.

Ten years ago, this would've required pulling over, finding a café with Wi-Fi, or waiting until I reached my hotel.

Now? It's just another Tuesday.

The tools exist to do serious ops work from absurd locations. You just need the right setup and a healthy paranoia about connectivity.

My Current Train Travel Protocol

If I'm doing any train travel longer than 2 hours, here's my pre-flight checklist:

  • [ ] Charge laptop to 100%, bring power bank
  • [ ] Activate backup SIM, verify roaming works
  • [ ] Test GL.iNet router, confirm failover config
  • [ ] Download offline docs for critical systems
  • [ ] Sync password manager, ensure offline access works
  • [ ] Set Slack status to "On train, connectivity may be spotty"
  • [ ] Notify on-call teammate I'm mobile for next X hours

And the most important rule: Never assume the train Wi-Fi will work.

Because it won't.