Postmortem: 2025-02 Checkout Outage

March 16, 2026 · 2 min read

Date: 2025-02-12 · Duration: 47 minutes · Severity: SEV-1 Author: Jordan Park (SRE) · Status: Action items complete

Summary

On February 12, 2025 at 14:23 UTC, the checkout service began returning HTTP 503 errors to approximately 35% of customers attempting to complete purchases. The incident lasted 47 minutes until a configuration fix was deployed at 15:10 UTC. Estimated revenue impact: ~$280K in lost or delayed orders.

Root Cause

Connection pool exhaustion in the payments service due to missing timeout configuration.

The payments service (payments-svc) connects to the Stripe API through an internal connection pool (HikariCP, max pool size = 20). A Stripe API degradation at 14:20 UTC caused response times to increase from ~200ms to ~8 seconds. Because the pool had no connection timeout configured (default: infinite wait), threads waiting for a pool connection blocked indefinitely.

Within 3 minutes, all 20 connections were occupied by slow Stripe calls, and new checkout requests queued behind them. The queue grew until the service hit its thread limit (200 threads), at which point Kubernetes health checks started failing and pods entered CrashLoopBackOff.

The checkout service depends on payments-svc synchronously — when payments became unavailable, checkout returned 503.

Timeline

Time (UTC)	Event
14:20	Stripe API latency increases (p99 200ms → 8s)
14:23	Payments-svc connection pool saturates; checkout errors begin
14:25	PagerDuty fires `checkout-error-rate` alert
14:28	On-call SRE acknowledges; begins investigation
14:35	Root cause identified: payments-svc thread dump shows all threads blocked on HikariCP `getConnection()`
14:42	Attempted fix: increase pool size to 50 — did not help (Stripe still slow)
14:55	Correct fix identified: add `connectionTimeout=5000` to HikariCP config
15:02	Config change deployed via ConfigMap update + rolling restart
15:10	All pods healthy; error rate returns to baseline
15:15	Incident resolved; monitoring confirmed stable

Contributing Factors

No connection timeout — HikariCP defaults to 30 seconds, but our config explicitly set it to 0 (infinite) based on a years-old tuning guide that prioritized throughput over resilience.
No circuit breaker — The payments service had no circuit breaker on the Stripe integration, so it kept sending requests to a degraded upstream.
Synchronous dependency — Checkout blocks on payments; there is no async fallback or queue-based decoupling.
Monitoring gap — We had alerts on checkout error rate but not on payments-svc connection pool utilization.

Action Items

#	Action	Owner	Status
1	Set `connectionTimeout=5000` and `maximumPoolSize=30` on all HikariCP pools	Platform Team	✅ Done
2	Add circuit breaker (Resilience4j) to Stripe integration in payments-svc	Payments Team	✅ Done
3	Add Grafana alert on HikariCP active connections > 80% of pool size	SRE	✅ Done
4	Evaluate async checkout flow (publish to SQS, process payment async)	Checkout Team	🔄 In progress (Q2 target)
5	Audit all services for missing timeout configurations	Platform Team	✅ Done

Lessons Learned

Timeouts are not optional. Every connection pool, HTTP client, and RPC call must have an explicit timeout. "Infinite" is never the right default for production.
Pool exhaustion cascades fast. A 20-connection pool with no timeout can go from healthy to fully blocked in under 3 minutes during an upstream degradation.
Monitor pool internals, not just request outcomes. We caught the error rate spike quickly, but could have caught the pool saturation 2 minutes earlier if we'd been monitoring HikariCP metrics.

Summary​

Root Cause​

Timeline​

Contributing Factors​

Action Items​

Lessons Learned​