Back to overview

Dashboard Unavailable in EU Region

Mar 20, 2026 at 05:29pm UTC
Affected services
Dashboard & API

Resolved
Mar 20, 2026 at 05:29pm UTC

Postmortem: EU Cloud Region Service Disruption

Date: March 20, 2026
Incident Duration: ~2:00 AM PST – 9:00 AM PST
Impact: Total service outage for the EU Cloud Region. Exit nodes remained functional, but downstream resources were inaccessible.

Executive Summary

On March 20, 2026, at approximately 2:00 AM PST, the EU Cloud Region experienced a total service outage. The failure originated from database write timeouts to the US-East region, which triggered a cascading failure within the Kubernetes clusters. Lack of region-specific monitoring delayed initial detection. Service was restored by forcing a full redeployment of the affected services.

Incident Timeline

  • 02:00 PST: Database write timeouts begin between EU region and US-East database.
  • 02:15 PST: High concentration of write failures causes Kubernetes health checks to fail.
  • 02:20 PST: Pods enter rapid restart cycles.
  • 03:00 – 07:00 PST: Pods enter CrashLoopBackOff state; Kubernetes back-off timers increase until containers cease running.
  • 08:00 PST: Engineering team identifies the regional failure despite status page silence.
  • 09:00: Resolution: A forced redeployment of all EU pods is executed.
  • 09:10: Traffic flow resumes; service is confirmed healthy.

Root Cause Analysis

The outage was caused by a cascading failure initiated by cross-region latency or connectivity issues:
1. Dependency Failure: The EU region relies on the US-East database for write operations. Timeouts here caused application threads to hang.
2. Health Check Sensitivity: Kubernetes liveness/readiness probes were tied to database connectivity. When the writes failed, the probes failed, killing the pods.
3. The Death Spiral: Rapid restarts and subsequent CrashLoopBackOff states prevented the system from recovering automatically once the database became reachable again.
4. Monitoring Blind Spot: The monitoring stack was configured primarily for US regions. Consequently, the EU-specific failure did not trigger the automated alerting system or update the public status page.

Impact Assessment

  • Services: All EU-hosted applications were offline.
  • Exit Nodes: Not directly impacted. However, because the backend resources they serve were down, sites reported as "offline" to end-users.
  • Internal Visibility: Status pages incorrectly reported "All Systems Operational" during the event due to the lack of regional monitoring granularity.

Corrective Actions & Preventative Measures

Completed

  • Alerting Update: Monitoring has been expanded to include all specific geographic regions (including EU).
  • Notification Logic: Media alerts are now configured to trigger for regional-specific failures, ensuring engineers are notified regardless of US-region health.

Planned

  • Resiliency Tuning: We will adjust the CrashLoopBackOff and restart policies in Kubernetes to ensure containers continue to attempt restarts more gracefully during prolonged outages.
  • Database Deep Dive: A secondary deep-root-cause analysis is underway to determine why the cross-region database writes timed out initially.
  • Circuit Breaking: Implementing circuit breakers to prevent database timeouts from failing pod health checks entirely.