Incidents | Pangolin Incidents reported on status page for Pangolin https://status.pangolin.net/ https://d1lppblt9t2x15.cloudfront.net/logos/847b93693b42a1f69a73044c8420ace1.png Incidents | Pangolin https://status.pangolin.net/ en Dashboard Maintenance https://status.pangolin.net/incident/920356 Thu, 11 Jun 2026 00:00:00 -0000 https://status.pangolin.net/incident/920356#dcbbabc42ad141eae3903eeca9ddb08356af10b8f60f6900d7a1600c70ea8318 The Pangolin dashboard may experience slowdowns or momentarily changes in operation for around 30-60 minutes while the team performs system updates. Resources should continue to operate as expected. Dashboard & API recovered https://status.pangolin.net/ Sat, 06 Jun 2026 10:21:21 +0000 https://status.pangolin.net/#6a5abb32bc10b9ffd591ce94683fff1967798fa9e7e5aecb6dc37c96113e09c2 Dashboard & API recovered Dashboard & API went down https://status.pangolin.net/ Sat, 06 Jun 2026 10:17:05 +0000 https://status.pangolin.net/#6a5abb32bc10b9ffd591ce94683fff1967798fa9e7e5aecb6dc37c96113e09c2 Dashboard & API went down Client Registration Disruption https://status.pangolin.net/incident/896245 Thu, 14 May 2026 22:12:00 -0000 https://status.pangolin.net/incident/896245#437661d0482da22fad48aed292c97bb51e9e343b2fc70bbcdef38972ef12e525 ## Postmortem: Client Registration Disruption **Date:** May 14, 2026 **Incident Duration:** ~3 hours **Impact:** Intermittent failure of client registration and connection establishment across US-East regional Points of Presence (PoPs). --- ### Executive Summary On May 14, 2026, the **US-East region** experienced a degradation in connection processing. A surge in holepunch requests, stemming from suboptimal code execution paths, led to resource exhaustion within the supporting infrastructure. This created a bottleneck in source address verification, preventing clients from successfully completing the registration handshake. Service was restored following targeted code optimizations and an expedited infrastructure scaling event. ### Incident Timeline * **T-00:00:** Observed sharp increase in holepunch request volume at US-East regional PoPs. * **T+00:15:** Cache and database infrastructure report elevated latency and resource utilization. * **T+00:30:** Inability to verify client source addresses leads to widespread registration failures. * **T+01:45:** Engineering identifies specific code inefficiencies contributing to the request volume. * **T+02:30:** **Resolution:** Optimized logic is deployed; supplementary infrastructure capacity is provisioned. * **T+03:00:** Connectivity metrics return to baseline; all regional systems confirmed operational. ### Root Cause Analysis The disruption was the result of a **resource exhaustion cascade** triggered by internal logic inefficiencies: 1. **Request Volume Anomalies:** Suboptimal code paths generated a disproportionate volume of holepunch requests relative to standard site traffic. 2. **Downstream Pressure:** The high frequency of these requests overwhelmed the regional database and caching layers, leading to delayed response times. 3. **Verification Bottleneck:** The system requires successful holepunch completion to validate client source addresses for security and functionality. Due to infrastructure timeouts, this validation could not be completed. 4. **Registration Inhibition:** Failure to verify source addresses prevented the system from authorizing new client connections. ### Impact Assessment * **Client Connectivity:** Users in the US-East region were unable to register or establish stable connections. * **Infrastructure:** Database and cache utilization reached critical thresholds * **Verification Logic:** The source address validation mechanism was temporarily unable to process inbound requests, leading to a "fail-closed" state for new connections. ### Corrective Actions & Preventative Measures #### Completed * **Logic Optimization:** Refined the code responsible for holepunch request handeling to reduce unnecessary overhead. * **Capacity Expansion:** Scaled the regional database and cache clusters to better accommodate peak request volumes. Dashboard & API recovered https://status.pangolin.net/ Sun, 05 Apr 2026 02:00:56 +0000 https://status.pangolin.net/#b8222d76ec656ed2f3a448ea2f10d28c5146d1ca66e5335e41aaabfcc77d9d6e Dashboard & API recovered Dashboard & API went down https://status.pangolin.net/ Sun, 05 Apr 2026 01:55:47 +0000 https://status.pangolin.net/#b8222d76ec656ed2f3a448ea2f10d28c5146d1ca66e5335e41aaabfcc77d9d6e Dashboard & API went down Dashboard & API recovered https://status.pangolin.net/ Sun, 05 Apr 2026 00:44:50 +0000 https://status.pangolin.net/#bde619b42be0ebcd5959bdb2120bb92eceb2b081d9401fac8f13878313b42fba Dashboard & API recovered Dashboard & API went down https://status.pangolin.net/ Sun, 05 Apr 2026 00:38:48 +0000 https://status.pangolin.net/#bde619b42be0ebcd5959bdb2120bb92eceb2b081d9401fac8f13878313b42fba Dashboard & API went down Dashboard & API recovered https://status.pangolin.net/ Sat, 04 Apr 2026 23:10:49 +0000 https://status.pangolin.net/#61848e6a4c508dc5262e03cbf6655e278152e63be324db257404ec397539a48c Dashboard & API recovered Dashboard & API went down https://status.pangolin.net/ Sat, 04 Apr 2026 23:06:53 +0000 https://status.pangolin.net/#61848e6a4c508dc5262e03cbf6655e278152e63be324db257404ec397539a48c Dashboard & API went down Dashboard Maintenance https://status.pangolin.net/incident/856941 Wed, 25 Mar 2026 04:35:00 -0000 https://status.pangolin.net/incident/856941#557d978cd3821d149a422bab9e6d5f5be58026f4a2d9465f0d4fb5e8a04569ac Maintenance completed Dashboard Maintenance https://status.pangolin.net/incident/856941 Wed, 25 Mar 2026 04:00:00 -0000 https://status.pangolin.net/incident/856941#978646c989f9f5a9b78e8828bf30d53f502941b3a2bbf62038f1989c6fd6a733 The Pangolin dashboard will be unavailable for around 30 minutes while the team performs system updates. Resources should continue to operate as expected. Remote nodes may experience downtime. Dashboard & API recovered https://status.pangolin.net/ Tue, 24 Mar 2026 08:47:10 +0000 https://status.pangolin.net/#ffe155dffb167e7cbd4a3a389bd45f8155ed78158e03325323772c5bdaf781ab Dashboard & API recovered Dashboard & API went down https://status.pangolin.net/ Tue, 24 Mar 2026 08:38:53 +0000 https://status.pangolin.net/#ffe155dffb167e7cbd4a3a389bd45f8155ed78158e03325323772c5bdaf781ab Dashboard & API went down Dashboard & API recovered https://status.pangolin.net/ Mon, 23 Mar 2026 09:15:56 +0000 https://status.pangolin.net/#ba45f10fadf52fa5c7d884cef9e0f24b1b3e0238d57b66f7434f21ed7117e51e Dashboard & API recovered Dashboard & API went down https://status.pangolin.net/ Mon, 23 Mar 2026 09:13:27 +0000 https://status.pangolin.net/#ba45f10fadf52fa5c7d884cef9e0f24b1b3e0238d57b66f7434f21ed7117e51e Dashboard & API went down Dashboard Unavailable in EU Region https://status.pangolin.net/incident/853774 Fri, 20 Mar 2026 17:29:00 -0000 https://status.pangolin.net/incident/853774#a75b9ec63cbb865f085b0fc76977f1cc73ab79a66da38443972163fc9a3caac8 ## Postmortem: EU Cloud Region Service Disruption **Date:** March 20, 2026 **Incident Duration:** ~2:00 AM PST – 9:00 AM PST **Impact:** Total service outage for the EU Cloud Region. Exit nodes remained functional, but downstream resources were inaccessible. ### Executive Summary On March 20, 2026, at approximately 2:00 AM PST, the **EU Cloud Region** experienced a total service outage. The failure originated from database write timeouts to the **US-East region**, which triggered a cascading failure within the Kubernetes clusters. Lack of region-specific monitoring delayed initial detection. Service was restored by forcing a full redeployment of the affected services. ### Incident Timeline * **02:00 PST:** Database write timeouts begin between EU region and US-East database. * **02:15 PST:** High concentration of write failures causes Kubernetes health checks to fail. * **02:20 PST:** Pods enter rapid restart cycles. * **03:00 – 07:00 PST:** Pods enter `CrashLoopBackOff` state; Kubernetes back-off timers increase until containers cease running. * **08:00 PST:** Engineering team identifies the regional failure despite status page silence. * **09:00:** **Resolution:** A forced redeployment of all EU pods is executed. * **09:10:** Traffic flow resumes; service is confirmed healthy. ### Root Cause Analysis The outage was caused by a **cascading failure** initiated by cross-region latency or connectivity issues: 1. **Dependency Failure:** The EU region relies on the US-East database for write operations. Timeouts here caused application threads to hang. 2. **Health Check Sensitivity:** Kubernetes liveness/readiness probes were tied to database connectivity. When the writes failed, the probes failed, killing the pods. 3. **The Death Spiral:** Rapid restarts and subsequent `CrashLoopBackOff` states prevented the system from recovering automatically once the database became reachable again. 4. **Monitoring Blind Spot:** The monitoring stack was configured primarily for US regions. Consequently, the EU-specific failure did not trigger the automated alerting system or update the public status page. ### Impact Assessment * **Services:** All EU-hosted applications were offline. * **Exit Nodes:** Not directly impacted. However, because the backend resources they serve were down, sites reported as "offline" to end-users. * **Internal Visibility:** Status pages incorrectly reported "All Systems Operational" during the event due to the lack of regional monitoring granularity. ### Corrective Actions & Preventative Measures #### Completed * **Alerting Update:** Monitoring has been expanded to include all specific geographic regions (including EU). * **Notification Logic:** Media alerts are now configured to trigger for regional-specific failures, ensuring engineers are notified regardless of US-region health. #### Planned * **Resiliency Tuning:** We will adjust the `CrashLoopBackOff` and restart policies in Kubernetes to ensure containers continue to attempt restarts more gracefully during prolonged outages. * **Database Deep Dive:** A secondary deep-root-cause analysis is underway to determine why the cross-region database writes timed out initially. * **Circuit Breaking:** Implementing circuit breakers to prevent database timeouts from failing pod health checks entirely. Dashboard & API recovered https://status.pangolin.net/ Fri, 20 Mar 2026 06:12:53 +0000 https://status.pangolin.net/#e868863399463aa1ee2ceb4d083b13658791cb14be9991c86169a0eed477a777 Dashboard & API recovered Dashboard & API went down https://status.pangolin.net/ Fri, 20 Mar 2026 06:10:00 +0000 https://status.pangolin.net/#e868863399463aa1ee2ceb4d083b13658791cb14be9991c86169a0eed477a777 Dashboard & API went down