Earth from space at night showing city lights and global network connections

Managing Multiple Cloud Providers on the Road

I learned the hard way that putting all your infrastructure eggs in one cloud basket is a bad idea when you're halfway around the world. Last November, my entire AWS us-east-1 stack went dark during an outage while I was in a Bangkok coworking space at 3 a.m., racing a contract deadline. My client-facing APIs flatlined, monitoring went silent, and I spent ninety frantic minutes migrating DNS to a DigitalOcean standby before anyone noticed. That night convinced me: multi-cloud isn't paranoia—it's operational hygiene for solo operators who can't afford downtime.

Earth from space at night showing city lights and global network connections

Photo: Unsplash / NASA

Why Single-Provider Dependency Is Risky

When you're stationary with a team and a pager rotation, a six-hour outage is painful but survivable. When you're solo, mobile, and possibly offline yourself, a single point of failure can kill client trust. I've seen three scenarios where multi-cloud saved projects:

  1. Regional outages: AWS us-east-1 issues, GCP europe-west1 networking problems—rare but catastrophic.
  2. Account lockouts: fraud detection false positives, billing disputes, support ticket limbo.
  3. Geo-restrictions: some clients require data residency; one cloud might not have the right region.

The goal isn't to mirror everything everywhere—that's expensive and complex. Instead, I tier workloads by criticality and spread them strategically.

My Three-Provider Split

AWS (Primary Production)

  • What: customer-facing APIs, Lambda functions, RDS databases
  • Why: mature ecosystem, best-in-class managed services
  • Cost: ~$340/month (reserved instances, t3.medium + Aurora Serverless)

I run production here because downtime directly impacts revenue. AWS Support Business tier ($100/month) means I can escalate issues even from a sketchy hotel connection in Phnom Penh.

Hetzner (Workload Compute)

  • What: CI/CD runners, data processing jobs, staging environments
  • Why: stupid cheap (€4.50/month for a CPX11 with 2 vCPUs, 2 GB RAM)
  • Cost: ~€35/month (~$38 USD)

Hetzner's Nuremberg and Helsinki datacenters give me EU presence for GDPR-sensitive stuff. The trade-off: fewer managed services, so I maintain more config myself. I use Ansible playbooks and keep golden images updated monthly.

DigitalOcean (Failover + Static Assets)

  • What: hot standby for critical APIs, CDN origin for images/docs
  • Why: simple API, fast provisioning, reliable Spaces (S3-compatible storage)
  • Cost: ~$60/month (one Droplet on standby, Spaces CDN)

When AWS melted down, I flipped a CNAME to point at the DO Droplet running a read-only version of the main API. It handled auth checks and served cached responses while I rebuilt the primary stack.

The Glue: Terraform + Tailscale + Scripted Failover

Managing three providers from a 13-inch laptop requires automation. Here's my stack:

Infrastructure as Code (Terraform)

Every resource lives in Git. My repo structure:


infra/
├── aws/
│   ├── production/
│   ├── staging/
├── hetzner/
│   ├── ci-runners/
│   ├── processing/
├── digitalocean/
│   ├── failover/
│   ├── cdn/
└── modules/ (shared VPC, firewall, monitoring)

I can spin up or tear down entire environments with terraform apply. When I landed in Nairobi and needed a temporary processing box for a data migration, I provisioned a Hetzner instance, ran the job, and destroyed it—all from the airport lounge in forty minutes.

Private Networking (Tailscale)

All my servers join a Tailscale mesh network. Benefits:

  • SSH access without exposing port 22 to the internet
  • Encrypted inter-cloud communication (AWS Lambda → Hetzner DB for batch jobs)
  • My laptop is a node, so I can reach internal services from any Wi-Fi

Setup is dead simple: curl -fsSL https://tailscale.com/install.sh | sh && tailscale up. Each provider's machines authenticate via pre-shared keys stored in 1Password.

Monitoring & Alerting (Prometheus + Grafana Cloud)

I run Prometheus exporters on every instance, scrape metrics to Grafana Cloud (free tier handles my volume). Alerts fire to a dedicated Signal group and PagerDuty (which SMS-pages me if I don't ack within five minutes).

Critical thresholds:

  • API response time >500ms for 2 minutes
  • Disk usage >85%
  • Any 5xx error rate >1% over 10 minutes

When AWS went down in Bangkok, PagerDuty woke me at 03:11. I knew which APIs were dead before I opened my laptop.

Failover Procedure

I keep a laminated checklist in my backpack. The short version:

  1. Confirm primary is down (not just my connection).
  2. Update DNS A record for api.client-domain.com to point at DigitalOcean IP (TTL is 60 seconds).
  3. Start the standby Droplet if it's in a stopped state (doctl compute droplet-action power-on ).
  4. Tail logs via Tailscale SSH to verify traffic is flowing.
  5. Notify client via Slack/email with ETA for full restore.

Practiced this in Chiang Mai during a fire drill. From decision to traffic shift: eight minutes.

Deployment Workflow from Anywhere

I don't trust hotel Wi-Fi for pushing production changes. My ritual:

  1. Tether to phone LTE (more stable, predictable latency).
  2. Connect to Tailscale for encrypted tunnel.
  3. Pull latest from Git (main branch is protected; I work in feature branches).
  4. Run Terraform plan locally to preview changes.
  5. Review diff carefully—a typo at 2 a.m. in Istanbul once deleted a security group rule and opened SSH to 0.0.0.0/0. Took me an hour to notice.
  6. Apply in stages: non-critical resources first, then gradual rollout to production.
  7. Monitor dashboards for anomalies (latency spikes, error rates).

For risky changes, I schedule them during low-traffic windows (usually 02:00–05:00 UTC) and set a timer to revert if metrics degrade.

Cost Discipline: How I Keep It Under $450/Month

Multi-cloud can spiral into budget hell. My guardrails:

  • Reserved instances on AWS for predictable workloads (40% savings vs. on-demand).
  • Auto-scaling with tight limits: my Lambda concurrency caps at 50; I'd rather throttle than pay surprise bills.
  • Scheduled shutdown of dev/staging: Hetzner instances stop nights and weekends (saves ~€15/month).
  • Monthly cost review ritual: I export billing CSVs, tag every resource, and kill orphaned volumes/snapshots.

I use Infracost to estimate Terraform changes before applying. It's saved me from launching a mis-configured RDS cluster that would have cost $600/month.

Real-World Resilience Tests

Test 1: Forced Failover Drill (Lisbon, March 2025)

I deliberately blocked AWS egress from my laptop's firewall and treated it as an outage. Flipped DNS, verified standby served traffic, documented gaps in runbook. Found two issues: forgot to update SSL cert on DO Droplet (Let's Encrypt), and monitoring didn't page me because I silenced non-prod alerts. Fixed both.

Test 2: Account Lockout Simulation (Remote Colombia, April 2025)

Assumed my AWS account was frozen (fraud flag, billing issue). Could I restore service using only Hetzner + DO? Answer: partially. Static assets and read-only API worked. Database backups were in S3, so I couldn't restore data without AWS access. Solution: now I replicate critical DB dumps to Hetzner's object storage nightly via cron + rclone.

Test 3: Operating Offline (Iceland Road Trip, June 2025)

No cell for 36 hours. Could I deploy pre-staged changes when signal returned? Yes, because all Terraform state and Git repos were synced locally. I queued commits, applied them in batch when I hit Reykjavik. Lesson: offline-first tooling (Git, Terraform, Ansible) beats SaaS-only stacks when connectivity is flaky.

Gotchas I Learned the Hard Way

Billing Surprises AWS Free Tier expired silently; my t2.micro turned into a $15/month charge. Now I set billing alerts at $50, $100, $200 thresholds across all providers.

Credential Leaks I accidentally pushed a Hetzner API token to a public repo. GitHub's secret scanning caught it within two minutes and notified me. Revoked the key, rotated everything, enabled 2FA on all cloud accounts. Store secrets in 1Password, reference them via ENV vars or Terraform variables, never hardcode.

Time Zone Chaos Scheduled tasks (cron, Lambda schedules) default to UTC. I once launched a database maintenance window at "2 a.m." my local time (Vietnam), which was peak traffic in Europe. Now I always specify UTC explicitly and convert mentally.

Support Ticket Delays Hetzner support is email-only and can take 12–24 hours. For urgent issues, I need a fallback (AWS/DO with faster SLAs). I keep a Hetzner community forum bookmark for common problems.

Minimal Viable Multi-Cloud Setup

If you're just starting, here's the simplest split:

  1. Primary provider (AWS/GCP/Azure): production workloads, managed databases.
  2. Budget provider (Hetzner, Vultr, Linode): CI/CD, staging, dev environments.
  3. Failover provider (DigitalOcean, Cloudflare Workers): static site or read-only API standby.

Total cost: ~$100–$200/month depending on scale. The key is automation—manual multi-cloud management is hell.

Tools That Make It Bearable

  • Terraform Cloud (free tier): remote state, collaboration, run history.
  • Tailscale: zero-config VPN mesh.
  • 1Password: credential storage with CLI access (op command).
  • Infracost: cost estimation in CI.
  • k9s / lazydocker: terminal UIs for Kubernetes/Docker (if you run containers).
  • runbook.md in every repo: step-by-step recovery procedures, tested quarterly.

When Multi-Cloud Isn't Worth It

If your app is a side project, low-traffic blog, or weekend experiment, single-cloud is fine. Multi-cloud makes sense when:

  • Downtime costs you money (SLA penalties, lost sales).
  • You operate solo without a team to cover gaps.
  • Geographic diversity matters (data residency, latency).

For my client work, the extra complexity pays for itself in sleep quality. I've had zero unplanned outages longer than six minutes in eighteen months.

The Payoff

Today I can lose an entire cloud region and keep services running. I've deployed infrastructure changes from moving trains, coffee shops in Medellin, and a ferry between Estonia and Finland. The laptop, the automation, and the mental checklist are enough. That's freedom: not being immune to failure, but being ready to route around it before anyone notices.