Close-up of laptop screen showing colorful code with syntax highlighting

Automating Security Monitoring While Traveling

Last March I was hiking in Armenia's Dilijan National Park when PagerDuty started blowing up my phone. Twelve alerts in six minutes: SSH brute-force attempts on my Hetzner box, a spike in 5xx errors from the API, and—worst of all—a new user account created on my AWS root. My peaceful forest walk turned into a sprint back to the guesthouse, laptop out, terminal open, heart rate spiking. Turns out it was a false alarm (contractor I'd forgotten I'd onboarded), but the incident proved my monitoring stack worked even when I was offline. Here's how I built it.

Close-up of laptop screen showing colorful code with syntax highlighting

Photo: Unsplash / Luca Bravo

The Solo Operator's Monitoring Problem

When you're a team of one running client infrastructure, you face constraints that enterprise SOC teams don't:

  1. No handoffs: you're on-call 24/7, across time zones.
  2. Unreliable connectivity: airport Wi-Fi, train tunnels, rural dead zones.
  3. Alert fatigue: too many pings and you mute everything; too few and you miss real incidents.
  4. Manual triage is expensive: every false positive costs you an hour of billable focus.

The solution: automate detection, correlation, and initial response so your laptop (or phone) handles the grunt work while you sleep, travel, or actually enjoy a hike.

My Monitoring Stack: Five Layers

Layer 1: Infrastructure Health (Prometheus + Grafana)

What it watches:

  • CPU, RAM, disk usage on all VMs (AWS, Hetzner, DO)
  • Network throughput, packet loss
  • Service uptime (API, databases, background workers)

How it's built:

  • Prometheus node exporters on every server (apt install prometheus-node-exporter)
  • Central Prometheus instance scrapes metrics every 15 seconds
  • Grafana Cloud (free tier) for dashboards and alerting

Alerting rules:


groups:
  - name: infrastructure
    interval: 30s
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU usage above 80% on {{ $labels.instance }}"
      - alert: DiskSpaceLow
        expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes < 0.15
        for: 10m
        labels:
          severity: critical

Where it failed me: In Thailand, a misconfigured scrape interval caused Prometheus to OOM-kill itself. I didn't notice for eight hours because I'd silenced non-critical alerts during a beach day. Fix: added a meta-alert that fires if Prometheus itself stops reporting.

Layer 2: Application Performance (Custom Metrics + Uptime Monitoring)

What it watches:

  • API response times (p50, p95, p99)
  • Error rates (4xx, 5xx)
  • Database query latency
  • Background job queue depth

How it's built:

  • Flask/FastAPI apps instrumented with prometheus_client
  • Custom /metrics endpoint exposes app-level stats
  • UptimeRobot (free plan) pings public endpoints every 5 minutes

Example instrumentation (Python):


from prometheus_client import Counter, Histogram
import time
http_requests_total = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
http_request_duration = Histogram('http_request_duration_seconds', 'HTTP request latency')
@app.route('/api/data')
def get_data():
    start = time.time()
    # ... business logic ...
    http_requests_total.labels(method='GET', endpoint='/api/data', status=200).inc()
    http_request_duration.observe(time.time() - start)
    return jsonify(data)

Alert example:

  • If p95 latency > 500ms for 3 minutes → page me.
  • If 5xx rate > 1% for 2 minutes → page me.

Where it saved me: In Porto, a slow database query started timing out under load. I got alerted before clients complained, added an index, and shipped the fix in 20 minutes.

Layer 3: Security Events (Fail2Ban + OSSEC + CloudTrail)

What it watches:

  • SSH login attempts (failed and successful)
  • Unauthorized sudo usage
  • File integrity changes in /etc, /bin, /usr/bin
  • AWS API calls (IAM changes, S3 bucket policy edits)

How it's built:

Fail2Ban (brute-force protection)

  • Monitors /var/log/auth.log for failed SSH attempts
  • Bans IPs after 5 failed logins within 10 minutes
  • Logs to syslog, which Prometheus scrapes via mtail exporter

OSSEC (host-based intrusion detection)

  • Agent runs on each server, forwards logs to central manager
  • Rules detect privilege escalation, rootkits, config changes
  • Alerts via syslog → Prometheus → PagerDuty

AWS CloudTrail + CloudWatch

  • All API calls logged to S3
  • CloudWatch Logs Insights queries for high-risk events:
  • fields @timestamp, userIdentity.principalId, eventName | filter eventName = "CreateUser" or eventName = "PutUserPolicy"
  • Lambda function parses logs, sends critical events to SNS → PagerDuty

Example OSSEC rule (detect unauthorized user creation):


<rule id="100010" level="12">
  <if_sid>5902</if_sid>
  <match>useradd</match>
  <description>New user account created</description>
  <group>adduser,</group>
</rule>

Where it caught a real threat: In Lisbon, OSSEC alerted me to a new SSH key added to ~/.ssh/authorized_keys on a Hetzner box. I hadn't added it. Turns out I'd left an old deployment script with embedded credentials in a public GitHub repo. Attacker found it, added their key. I nuked the instance, rotated all keys, and scrubbed the repo history.

Layer 4: Log Aggregation (Loki + LogCLI)

What it does:

  • Centralized log storage (application logs, system logs, web server access logs)
  • Fast search across all sources
  • Retention: 30 days hot, 6 months cold (S3)

How it's built:

  • Promtail agents tail logs on each server
  • Loki stores logs in Hetzner's object storage (S3-compatible)
  • Query from CLI: logcli query '{job="api"} |= "error"' --since=1h

Example query (find failed auth attempts):


logcli query '{job="auth"} |~ "authentication failed"' --since=24h --limit=50

Where it saved my ass: In Sofia, a client reported intermittent 503 errors. I searched Loki for status=503 and found a memory leak in a background worker. Logs showed heap usage climbing until the process OOM-killed. Deployed a fix (added pagination to a batch job) in two hours.

Layer 5: Automated Remediation (Self-Healing Scripts)

What it does:

  • Restarts crashed services
  • Clears disk space when usage hits 90%
  • Rotates logs
  • Blocks abusive IPs

How it's built:

Systemd service watchdogs:


[Service]
Restart=on-failure
RestartSec=10s

Cron job for disk cleanup:


#!/bin/bash
# /etc/cron.hourly/disk-cleanup
USAGE=$(df / | tail -1 | awk '{print $5}' | sed 's/%//')
if [ "$USAGE" -gt 90 ]; then
  journalctl --vacuum-time=7d
  apt-get autoremove -y
  docker system prune -af
  echo "Disk cleanup triggered at $(date)" | mail -s "Disk Alert" ops@secureroamer.com
fi

PagerDuty webhook + self-healing Lambda: When PagerDuty fires an alert, it triggers an SNS topic. A Lambda function reads the alert, checks if it matches a known pattern (e.g., "service X is down"), and runs remediation (e.g., systemctl restart X via SSM Run Command).

Example Lambda handler (Python):


import boto3
import json
ssm = boto3.client('ssm')
def lambda_handler(event, context):
    alert = json.loads(event['Records'][0]['Sns']['Message'])
    if 'api' in alert['incident']['title'].lower():
        response = ssm.send_command(
            InstanceIds=['i-0abc123'],
            DocumentName='AWS-RunShellScript',
            Parameters={'commands': ['systemctl restart api.service']}
        )
        return {'statusCode': 200, 'body': 'Restart triggered'}
    return {'statusCode': 200, 'body': 'No action'}

Where it worked: In Chiang Mai, the API crashed at 4 a.m. (my time). PagerDuty woke me, but before I could even unlock my phone, the Lambda had already restarted the service. Downtime: 38 seconds. I went back to sleep.

Alert Routing: Smart Escalation

Not all alerts are equal. I use PagerDuty's tiered escalation:

P1 (Critical): Page immediately via SMS + phone call

  • Service down >2 minutes
  • Database unreachable
  • Security event (unauthorized IAM change, root login)

P2 (High): Push notification + email

  • High error rate (5xx >1%)
  • Disk usage >85%
  • Slow queries >1 second

P3 (Medium): Email only, batch every 15 minutes

  • Elevated CPU (>70%)
  • Memory usage climbing
  • Non-critical service restart

P4 (Low): Daily digest email

  • SSL cert expiring in 30 days
  • Backup completed successfully
  • Software updates available

I also have "quiet hours" (23:00–07:00 local time) where P3/P4 alerts are muted unless I override them (like before a risky deployment).

Managing Alerts with Flaky Connectivity

Problem: If I'm offline (plane, tunnel, remote area), I might miss critical pages.

Solution: Multi-channel redundancy

  1. PagerDuty tries SMS first.
  2. If I don't acknowledge in 5 minutes, calls my phone.
  3. If still no ack, escalates to backup contact (trusted colleague).
  4. Simultaneously posts to a private Slack channel I monitor on Wi-Fi.

Offline fallback:

  • Prometheus and Loki are self-hosted, so they keep collecting data even if I can't reach them.
  • When I reconnect, I query Loki for anything that happened while I was dark.
  • Example: logcli query '{severity="critical"}' --since=6h after landing from a transatlantic flight.

Dashboards I Actually Use

1. Overview (Grafana)

  • Map showing all server locations (color-coded: green = healthy, red = down)
  • Key metrics: uptime %, request rate, error rate, latency p95
  • Refresh: every 30 seconds

2. Security (custom React app + Loki API)

  • Recent failed SSH attempts (grouped by source IP)
  • Geo-map of login locations (flags anomalies, like a login from China when I'm in Portugal)
  • Fail2Ban banned IPs (with unban button if I lock myself out)

3. Cost (CloudWatch + custom scraper)

  • AWS/Hetzner/DO spend by service
  • Projected monthly bill
  • Alerts if spend increases >20% week-over-week

Real Incidents Where This Saved Me

Vienna, April 2025

  • Alert: "Database replica lag >60 seconds"
  • Investigation: primary DB was stuck on a long-running migration
  • Fix: killed the query, re-ran with batching
  • Downtime: zero (read traffic served from replica)

Medellín, June 2025

  • Alert: "Unusual IAM activity"
  • Investigation: AWS access key leaked in contractor's Trello board
  • Fix: revoked key, rotated all credentials, enabled MFA enforcement
  • Damage: none (caught within 9 minutes)

Reykjavik, July 2025

  • Alert: "API response time >2 seconds"
  • Investigation: external API dependency (weather service) timing out
  • Fix: added caching layer, set 5-second timeout on external calls
  • Client impact: minimal (cached data was <10 min stale)

Cost Breakdown

  • Grafana Cloud (free tier): $0
  • PagerDuty (starter plan): $21/month
  • UptimeRobot (free plan): $0
  • Prometheus + Loki self-hosted: ~$8/month (Hetzner VM + object storage)
  • AWS CloudWatch Logs: ~$12/month (log ingestion + queries)
  • Total: ~$41/month

For comparison, enterprise monitoring (Datadog, New Relic) would cost $200–$500/month for my scale. Self-hosting cuts 80% of the cost.

Lessons Learned

Over-alerting is worse than under-alerting. I started with 40+ alert rules. Got paged for trivial stuff (CPU spikes during nightly backups). Refined down to 12 high-signal rules. Sleep improved dramatically.

Test your alerts. Once a quarter, I simulate failures: kill a service, fill a disk, trigger a bad deploy. Verify alerts fire and remediation works. Found three broken rules this way.

Document runbooks. Every alert links to a runbook (Markdown in Git). Steps to investigate, common causes, fix procedures. Saves me from Googling at 3 a.m.

Automate acks for known issues. If an alert fires repeatedly and I've decided to ignore it (e.g., planned maintenance), I auto-ack it via PagerDuty API so it doesn't clutter my inbox.

What I'd Do Differently

Start simpler. My first iteration tried to monitor everything. Metrics overload, alert fatigue, burnout. Now I monitor the 20% of signals that catch 80% of real issues.

Invest in log parsing earlier. I wasted hours grepping raw logs before building Loki queries. Structured logging + centralized search pays off fast.

Set up test environments for alert tuning. I used to test alert thresholds in production (bad idea). Now I have a staging stack where I can simulate load and tweak rules.

Final Take

You can't be online 24/7, but your monitoring can. The goal isn't to eliminate all failures—it's to detect them fast, automate the boring fixes, and only wake yourself for things that actually need human judgment. I've gone from stressed-out firefighting to confident operational control, even when I'm offline for hours or hiking in the woods. The automation doesn't replace me—it buys me time to think before I act.