Dashboard Unavailable in EU Region
Resolved
Mar 20, 2026 at 05:29pm UTC
Postmortem: EU Cloud Region Service Disruption
Date: March 20, 2026
Incident Duration: ~2:00 AM PST – 9:00 AM PST
Impact: Total service outage for the EU Cloud Region. Exit nodes remained functional, but downstream resources were inaccessible.
Executive Summary
On March 20, 2026, at approximately 2:00 AM PST, the EU Cloud Region experienced a total service outage. The failure originated from database write timeouts to the US-East region, which triggered a cascading failure within the Kubernetes clusters. Lack of region-specific monitoring delayed initial detection. Service was restored by forcing a full redeployment of the affected services.
Incident Timeline
- 02:00 PST: Database write timeouts begin between EU region and US-East database.
- 02:15 PST: High concentration of write failures causes Kubernetes health checks to fail.
- 02:20 PST: Pods enter rapid restart cycles.
- 03:00 – 07:00 PST: Pods enter
CrashLoopBackOffstate; Kubernetes back-off timers increase until containers cease running. - 08:00 PST: Engineering team identifies the regional failure despite status page silence.
- 09:00: Resolution: A forced redeployment of all EU pods is executed.
- 09:10: Traffic flow resumes; service is confirmed healthy.
Root Cause Analysis
The outage was caused by a cascading failure initiated by cross-region latency or connectivity issues:
1. Dependency Failure: The EU region relies on the US-East database for write operations. Timeouts here caused application threads to hang.
2. Health Check Sensitivity: Kubernetes liveness/readiness probes were tied to database connectivity. When the writes failed, the probes failed, killing the pods.
3. The Death Spiral: Rapid restarts and subsequent CrashLoopBackOff states prevented the system from recovering automatically once the database became reachable again.
4. Monitoring Blind Spot: The monitoring stack was configured primarily for US regions. Consequently, the EU-specific failure did not trigger the automated alerting system or update the public status page.
Impact Assessment
- Services: All EU-hosted applications were offline.
- Exit Nodes: Not directly impacted. However, because the backend resources they serve were down, sites reported as "offline" to end-users.
- Internal Visibility: Status pages incorrectly reported "All Systems Operational" during the event due to the lack of regional monitoring granularity.
Corrective Actions & Preventative Measures
Completed
- Alerting Update: Monitoring has been expanded to include all specific geographic regions (including EU).
- Notification Logic: Media alerts are now configured to trigger for regional-specific failures, ensuring engineers are notified regardless of US-region health.
Planned
- Resiliency Tuning: We will adjust the
CrashLoopBackOffand restart policies in Kubernetes to ensure containers continue to attempt restarts more gracefully during prolonged outages. - Database Deep Dive: A secondary deep-root-cause analysis is underway to determine why the cross-region database writes timed out initially.
- Circuit Breaking: Implementing circuit breakers to prevent database timeouts from failing pod health checks entirely.
Affected services