Skip to content

Incident: API Outage - Rate Limiter Misconfiguration

Security SpecialistOperations & StrategyDevops

This is an example incident log. Delete this file or use it as reference when creating real incident logs.

Summary

FieldValue
StatusResolved
SeverityP2
Start Time2024-01-15 14:32 UTC
Resolution Time2024-01-15 16:45 UTC
Affected ServicesAPI, Frontend (degraded)

Roles

RolePerson
DetectorMonitoring (DataDog alert)
Incident LeaderAlice
ScribeBob
Communication ManagerCarol
RespondersAlice, Bob, Dave (infra SME)

Communication Channels


Timeline

14:32 UTC - DataDog alert fired: API error rate >5%
14:35 UTC - Bob acknowledged alert, started investigating
14:38 UTC - Bob escalated to #incident channel, started call
14:42 UTC - Alice joined, assigned as Incident Leader
14:45 UTC - Bob assigned as Scribe
14:48 UTC - Dave (infra SME) joined
14:52 UTC - Identified: rate limiter rejecting legitimate requests
14:55 UTC - Carol joined as Communication Manager
15:05 UTC - Root cause identified: bad config deployed at 14:15 UTC
15:12 UTC - Decision: rollback config
15:18 UTC - Rollback initiated
15:25 UTC - Rollback complete, monitoring
15:40 UTC - Error rates back to normal
15:45 UTC - Carol posted update to Discord
16:00 UTC - Continued monitoring, no issues
16:45 UTC - Incident declared resolved

Investigation

What We Know

  • API started returning 429 errors at 14:32 UTC
  • Rate limiter was updated at 14:15 UTC as part of routine config change
  • New config had threshold set to 10 req/min instead of 1000 req/min (typo)
  • Affected all API endpoints

Affected Services

ServiceImpactStatus
API429 errors for most requestsResolved
FrontendDegraded (API calls failing)Resolved
Smart ContractsNo impactN/A

Root Cause (initial assessment)

Typo in rate limiter configuration: 10 instead of 1000 for requests per minute threshold.


Actions

Immediate

  • Identify source of errors @Dave
  • Find recent changes @Dave
  • Prepare rollback @Dave

Resolution

  • Rollback config change @Dave
  • Verify API functioning @Bob
  • Post community update @Carol

Resolution Summary

Mitigation Applied

Rolled back rate limiter configuration to previous known-good version.

Verification

  • API error rate back to baseline (<0.1%)
  • Sample API calls succeeding
  • No user reports of issues post-fix

Communications Sent

TimeChannelSummary
15:00 UTCDiscord #announcements"Investigating API issues"
15:45 UTCDiscord #announcements"API issues resolved"
16:00 UTCTwitterBrief update (optional)

Post-Incident

  • Post-mortem scheduled for: 2024-01-17 15:00 UTC
  • Post-mortem document created
  • Action items assigned (pending post-mortem)

Links & Evidence

  • DataDog dashboard: [link]
  • Config PR that caused issue: [link]
  • Rollback PR: [link]

See Incident-Response-Policy for severity definitions and process.