Skip to content

Post-Mortem: API Outage - Rate Limiter Misconfiguration

Security SpecialistOperations & StrategyDevops

This is an example post-mortem. Delete this file or use it as reference when creating real post-mortems.

Metadata

FieldValue
Incident Date2024-01-15
SeverityP2
AuthorsAlice, Bob
StatusFinal
Incident Log2024-01-15-example-api-outage

Summary

On January 15, 2024, at 14:32 UTC, our API began rejecting most requests with 429 (rate limited) errors. The incident lasted approximately 2 hours and 13 minutes.

The root cause was a typo in a rate limiter configuration change deployed at 14:15 UTC. The threshold was set to 10 requests per minute instead of the intended 1000. This caused legitimate user requests to be rejected.

The incident was detected via automated monitoring and resolved by rolling back the configuration change. No funds were at risk, but approximately 3,000 users experienced failed transactions during the outage window.


Impact

Users

  • Users affected: ~3,000
  • Duration: 2h 13m
  • Services unavailable: API (full), Frontend (degraded)

Financial

  • Funds at risk: None
  • Actual losses: None
  • Estimated costs: ~2 engineering hours for response

Reputation

  • Public visibility: Low (brief Discord posts)
  • Media coverage: None
  • Community response: Minor frustration, resolved quickly

Timeline

Time (UTC)Event
14:15Config change deployed
14:32Incident began (errors started)
14:32Detected via DataDog alert
14:38Response started
14:42Incident Leader assigned
15:05Root cause identified
15:25Rollback complete
16:45Incident resolved

See Incident Log for detailed timeline.


Root Cause

Primary Cause

Human error during configuration change. The rate limit threshold was typed as 10 instead of 1000.

Contributing Factors

  1. No validation in config deployment process for obviously wrong values
  2. Config change was not tested in staging environment
  3. No gradual rollout for config changes
  4. Alert threshold (5% error rate) took 17 minutes to trigger

5 Whys

QuestionAnswer
Why did the API reject requests?Rate limiter threshold was too low
Why was the threshold wrong?Typo in configuration file
Why wasn't the typo caught?No review process for config changes
Why is there no review process?Config changes considered "low risk"
Why are they considered low risk?Never had an incident before (survivorship bias)

What Went Well

  1. Monitoring detected the issue within 17 minutes
  2. Team mobilized quickly once alert fired
  3. Root cause identified within 30 minutes
  4. Rollback was straightforward
  5. Communication to users was timely

What Went Wrong

  1. Config change deployed without testing
  2. No peer review for config changes
  3. Alert took 17 minutes to fire (threshold too high)
  4. No validation for obviously wrong values (10 vs 1000)

Where We Got Lucky

  1. This happened during business hours when the team was available
  2. The fix (rollback) was simple. If the old config was also broken, recovery would have been harder
  3. No financial impact because this was the API layer, not smart contracts

Action Items

ActionOwnerDeadlineStatus
Add peer review requirement for config changesDave2024-01-22Done
Add config validation for rate limit thresholdsDave2024-01-29Done
Lower alert threshold to 1% error rateBob2024-01-19Done
Add staging environment testing for config changesDave2024-02-15In Progress
Document config change processAlice2024-01-31Not Started

Lessons for Runbooks

  • Existing runbook sufficient: Third-Party-Outage (config rollback section applicable)
  • No new runbook needed

Detection

AspectDetails
How detectedMonitoring alert (DataDog)
Time to detection17 minutes
Could we detect faster?Yes - lower alert threshold to 1%

Links


Meeting Notes

Attendees: Alice, Bob, Carol, Dave

Discussion points:
  • Agreed config changes need same rigor as code changes
  • Discussed whether to require staging for all changes (decided yes)
  • Dave to implement validation this week

See Incident-Response-Policy for post-mortem process.