Incidents | Pangolin

Incidents | Pangolin Incidents reported on status page for Pangolin https://status.pangolin.net/ https://d1lppblt9t2x15.cloudfront.net/logos/847b93693b42a1f69a73044c8420ace1.png Incidents | Pangolin https://status.pangolin.net/ en Dashboard & API recovered https://status.pangolin.net/ Sun, 05 Apr 2026 02:00:56 +0000 https://status.pangolin.net/#b8222d76ec656ed2f3a448ea2f10d28c5146d1ca66e5335e41aaabfcc77d9d6e Dashboard & API recovered Dashboard & API went down https://status.pangolin.net/ Sun, 05 Apr 2026 01:55:47 +0000 https://status.pangolin.net/#b8222d76ec656ed2f3a448ea2f10d28c5146d1ca66e5335e41aaabfcc77d9d6e Dashboard & API went down Dashboard & API recovered https://status.pangolin.net/ Sun, 05 Apr 2026 00:44:50 +0000 https://status.pangolin.net/#bde619b42be0ebcd5959bdb2120bb92eceb2b081d9401fac8f13878313b42fba Dashboard & API recovered Dashboard & API went down https://status.pangolin.net/ Sun, 05 Apr 2026 00:38:48 +0000 https://status.pangolin.net/#bde619b42be0ebcd5959bdb2120bb92eceb2b081d9401fac8f13878313b42fba Dashboard & API went down Dashboard & API recovered https://status.pangolin.net/ Sat, 04 Apr 2026 23:10:49 +0000 https://status.pangolin.net/#61848e6a4c508dc5262e03cbf6655e278152e63be324db257404ec397539a48c Dashboard & API recovered Dashboard & API went down https://status.pangolin.net/ Sat, 04 Apr 2026 23:06:53 +0000 https://status.pangolin.net/#61848e6a4c508dc5262e03cbf6655e278152e63be324db257404ec397539a48c Dashboard & API went down Dashboard Maintenance https://status.pangolin.net/incident/856941 Wed, 25 Mar 2026 04:35:00 -0000 https://status.pangolin.net/incident/856941#557d978cd3821d149a422bab9e6d5f5be58026f4a2d9465f0d4fb5e8a04569ac Maintenance completed Dashboard Maintenance https://status.pangolin.net/incident/856941 Wed, 25 Mar 2026 04:00:00 -0000 https://status.pangolin.net/incident/856941#978646c989f9f5a9b78e8828bf30d53f502941b3a2bbf62038f1989c6fd6a733 The Pangolin dashboard will be unavailable for around 30 minutes while the team performs system updates. Resources should continue to operate as expected. Remote nodes may experience downtime. Dashboard & API recovered https://status.pangolin.net/ Tue, 24 Mar 2026 08:47:10 +0000 https://status.pangolin.net/#ffe155dffb167e7cbd4a3a389bd45f8155ed78158e03325323772c5bdaf781ab Dashboard & API recovered Dashboard & API went down https://status.pangolin.net/ Tue, 24 Mar 2026 08:38:53 +0000 https://status.pangolin.net/#ffe155dffb167e7cbd4a3a389bd45f8155ed78158e03325323772c5bdaf781ab Dashboard & API went down Dashboard & API recovered https://status.pangolin.net/ Mon, 23 Mar 2026 09:15:56 +0000 https://status.pangolin.net/#ba45f10fadf52fa5c7d884cef9e0f24b1b3e0238d57b66f7434f21ed7117e51e Dashboard & API recovered Dashboard & API went down https://status.pangolin.net/ Mon, 23 Mar 2026 09:13:27 +0000 https://status.pangolin.net/#ba45f10fadf52fa5c7d884cef9e0f24b1b3e0238d57b66f7434f21ed7117e51e Dashboard & API went down Dashboard Unavailable in EU Region https://status.pangolin.net/incident/853774 Fri, 20 Mar 2026 17:29:00 -0000 https://status.pangolin.net/incident/853774#a75b9ec63cbb865f085b0fc76977f1cc73ab79a66da38443972163fc9a3caac8 ## Postmortem: EU Cloud Region Service Disruption **Date:** March 20, 2026 **Incident Duration:** ~2:00 AM PST – 9:00 AM PST **Impact:** Total service outage for the EU Cloud Region. Exit nodes remained functional, but downstream resources were inaccessible. ### Executive Summary On March 20, 2026, at approximately 2:00 AM PST, the **EU Cloud Region** experienced a total service outage. The failure originated from database write timeouts to the **US-East region**, which triggered a cascading failure within the Kubernetes clusters. Lack of region-specific monitoring delayed initial detection. Service was restored by forcing a full redeployment of the affected services. ### Incident Timeline * **02:00 PST:** Database write timeouts begin between EU region and US-East database. * **02:15 PST:** High concentration of write failures causes Kubernetes health checks to fail. * **02:20 PST:** Pods enter rapid restart cycles. * **03:00 – 07:00 PST:** Pods enter `CrashLoopBackOff` state; Kubernetes back-off timers increase until containers cease running. * **08:00 PST:** Engineering team identifies the regional failure despite status page silence. * **09:00:** **Resolution:** A forced redeployment of all EU pods is executed. * **09:10:** Traffic flow resumes; service is confirmed healthy. ### Root Cause Analysis The outage was caused by a **cascading failure** initiated by cross-region latency or connectivity issues: 1. **Dependency Failure:** The EU region relies on the US-East database for write operations. Timeouts here caused application threads to hang. 2. **Health Check Sensitivity:** Kubernetes liveness/readiness probes were tied to database connectivity. When the writes failed, the probes failed, killing the pods. 3. **The Death Spiral:** Rapid restarts and subsequent `CrashLoopBackOff` states prevented the system from recovering automatically once the database became reachable again. 4. **Monitoring Blind Spot:** The monitoring stack was configured primarily for US regions. Consequently, the EU-specific failure did not trigger the automated alerting system or update the public status page. ### Impact Assessment * **Services:** All EU-hosted applications were offline. * **Exit Nodes:** Not directly impacted. However, because the backend resources they serve were down, sites reported as "offline" to end-users. * **Internal Visibility:** Status pages incorrectly reported "All Systems Operational" during the event due to the lack of regional monitoring granularity. ### Corrective Actions & Preventative Measures #### Completed * **Alerting Update:** Monitoring has been expanded to include all specific geographic regions (including EU). * **Notification Logic:** Media alerts are now configured to trigger for regional-specific failures, ensuring engineers are notified regardless of US-region health. #### Planned * **Resiliency Tuning:** We will adjust the `CrashLoopBackOff` and restart policies in Kubernetes to ensure containers continue to attempt restarts more gracefully during prolonged outages. * **Database Deep Dive:** A secondary deep-root-cause analysis is underway to determine why the cross-region database writes timed out initially. * **Circuit Breaking:** Implementing circuit breakers to prevent database timeouts from failing pod health checks entirely. Dashboard & API recovered https://status.pangolin.net/ Fri, 20 Mar 2026 06:12:53 +0000 https://status.pangolin.net/#e868863399463aa1ee2ceb4d083b13658791cb14be9991c86169a0eed477a777 Dashboard & API recovered Dashboard & API went down https://status.pangolin.net/ Fri, 20 Mar 2026 06:10:00 +0000 https://status.pangolin.net/#e868863399463aa1ee2ceb4d083b13658791cb14be9991c86169a0eed477a777 Dashboard & API went down Dashboard & API recovered https://status.pangolin.net/ Tue, 03 Mar 2026 17:21:57 +0000 https://status.pangolin.net/#b4f45d8217b6407462989d4b3da5a6096f5a775ce603b866011f5767e6ed5eda Dashboard & API recovered Dashboard & API went down https://status.pangolin.net/ Tue, 03 Mar 2026 17:19:01 +0000 https://status.pangolin.net/#b4f45d8217b6407462989d4b3da5a6096f5a775ce603b866011f5767e6ed5eda Dashboard & API went down