Incident Response Policy
Overview
This policy defines how we respond to security and operational incidents. It covers roles, severity classification, and the steps from detection through post-incident review.
For role structures and on-call options, see Roles-and-Staffing.
Roles
One person can hold multiple roles. Roles can be reassigned during an incident as needed.
Detector
The person who identifies the incident. Their job is to notify responders and hand off. They don't need to fix it.
Incident Leader
Coordinates the response. Assigns tasks, makes decisions, ensures the process is followed. Escalates to Decision Makers when needed.
Scribe
Documents everything in the Incident Log. Maintains timestamps (UTC), captures decisions and rationale. This is a focused role. Don't also assign the Scribe to fix things.
Communication Manager
Handles internal and external communications. Drafts updates, coordinates with PR if needed, manages community channels.
Subject Matter Experts (SMEs)
Technical specialists called in based on incident type (smart contracts, infrastructure, security, etc.).
Decision Makers
Senior leadership for high-stakes decisions. Define who these are for your protocol (founders, security lead, legal, etc.).
Severity Levels
When in doubt, choose the higher severity. A false P1 creates noise. A missed P1 costs funds.
P1 - Critical
| Aspect | Details |
|---|---|
| Impact | Loss of funds, critical systems down, active exploit |
| Response | Immediate. Core team. Scale as needed. Decision Makers involved. |
| Examples | Active exploit, private key compromise, critical smart contract vulnerability, production down |
P2 - High
| Aspect | Details |
|---|---|
| Impact | High impact to production, potential fund loss under specific conditions |
| Response | Immediate |
| Examples | Major vulnerability (not actively exploited), significant outage, DDoS on core services |
P3 - Moderate
| Aspect | Details |
|---|---|
| Impact | Medium impact, no fund loss likely |
| Response | Within hours |
| Examples | Minor vulnerability, degraded performance, non-critical service down |
P4 - Low
| Aspect | Details |
|---|---|
| Impact | Low impact, no fund loss |
| Response | Can be scheduled |
| Examples | Minor bugs, display issues, non-urgent fixes |
P5 - Info
| Aspect | Details |
|---|---|
| Impact | Informational, often from automated systems |
| Response | No immediate action |
| Examples | Expiring certificates, resource spikes, maintenance notices |
Response Process
Step 1: Detection
Incidents can be detected via:
- Monitoring alerts (Grafana, DataDog, on-chain monitors, etc.)
- Community reports (Discord, Twitter, Telegram)
- Team members noticing something wrong
- Bug bounty reports
- Security audits
- Partner notifications
The Detector's job: get the right people involved, fast.
Don't know who to call? Contact your incident response on-call team (e.g., DevOps or SecOps) via on-call system or team-wide email (e.g.,
team-security@company.com). These serve as fallbacks when the detector is unfamiliar with escalation paths.
Step 2: Coordination
Detector responsibilities:- Start a call (Zoom/Meet/Huddle)
- Create a private channel (
#incident-[brief-description]) - Alert responders via your alerting system or direct contact
- Provide all known information
- Hand off to an Incident Leader (get explicit acknowledgment)
For P1 incidents: Keep the group minimal initially. Alert Decision Makers immediately.
Incident Leader responsibilities:- Pull in relevant SMEs
- Assign a Scribe → they create an Incident Log
- Assign Communication Manager(s)
Step 3: Investigation
Goal: Understand what's happening and assess impact.
- Collect logs, error messages, reproduction steps
- Identify affected services and scope of impact
- Confirm or adjust severity level
- Determine mitigation options
The Incident Leader assigns specific tasks to individuals. One task per person at a time keeps things focused.
Step 4: Resolution
Goal: Stop the bleeding first, permanent fix later.
If a temporary fix (rollback, pause, disable feature) is faster than a full fix and reduces damage, do that first.
Checklist:- Apply temporary mitigation
- Verify it's working
- Notify stakeholders
- Plan permanent fix with owner and timeline
See Runbooks for step-by-step guides for specific incident types.
Step 5: Monitoring
Goal: Confirm the fix actually worked.
- Verify immediately after deployment
- Monitor for at least a week
- Consider adding new alerts or test cases
- Document what monitoring is now in place
Step 6: Post-Incident Review
Goal: Learn and prevent recurrence.
- Incident Leader schedules post-mortem (within a week of resolution)
- Scribe prepares Post-Mortem draft
- Team reviews timeline, identifies root causes, captures lessons
- Define action items with owners and deadlines
- Share with team (and community if appropriate)
All action items must have owners and deadlines. Track them to completion.
Communication Guidelines
- Internal: Regular updates in the incident channel. Frequency depends on severity.
- External: Communication Manager drafts, gets approval before posting. See Communications for examples.
- Transparency: Default to sharing post-mortems publicly (redacting sensitive details).
Related Documents
- Roles-and-Staffing - Team structure options
- Communications - Public announcement guidance
- Incident Log Template - Fill out during incidents
- Post-Mortem Template - Complete after incidents
- Contacts - Critical contacts
- Runbooks - Step-by-step incident guides
Document Control
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | [DATE] | [AUTHOR] | Initial release |
Customize this policy for your protocol's tools, team structure, and communication channels.