Incident Response Policy

Security SpecialistOperations & StrategyDevops

Overview

This policy defines how we respond to security and operational incidents. It covers roles, severity classification, and the steps from detection through post-incident review.

For role structures and on-call options, see Roles-and-Staffing.

Roles

One person can hold multiple roles. Roles can be reassigned during an incident as needed.

Detector

The person who identifies the incident. Their job is to notify responders and hand off. They don't need to fix it.

Incident Leader

Coordinates the response. Assigns tasks, makes decisions, ensures the process is followed. Escalates to Decision Makers when needed.

Scribe

Documents everything in the Incident Log. Maintains timestamps (UTC), captures decisions and rationale. This is a focused role. Don't also assign the Scribe to fix things.

Communication Manager

Handles internal and external communications. Drafts updates, coordinates with PR if needed, manages community channels.

Subject Matter Experts (SMEs)

Technical specialists called in based on incident type (smart contracts, infrastructure, security, etc.).

Decision Makers

Senior leadership for high-stakes decisions. Define who these are for your protocol (founders, security lead, legal, etc.).

Severity Levels

When in doubt, choose the higher severity. A false P1 creates noise. A missed P1 costs funds.

P1 - Critical

Aspect	Details
Impact	Loss of funds, critical systems down, active exploit
Response	Immediate. Core team. Scale as needed. Decision Makers involved.
Examples	Active exploit, private key compromise, critical smart contract vulnerability, production down

P2 - High

Aspect	Details
Impact	High impact to production, potential fund loss under specific conditions
Response	Immediate
Examples	Major vulnerability (not actively exploited), significant outage, DDoS on core services

P3 - Moderate

Aspect	Details
Impact	Medium impact, no fund loss likely
Response	Within hours
Examples	Minor vulnerability, degraded performance, non-critical service down

P4 - Low

Aspect	Details
Impact	Low impact, no fund loss
Response	Can be scheduled
Examples	Minor bugs, display issues, non-urgent fixes

P5 - Info

Aspect	Details
Impact	Informational, often from automated systems
Response	No immediate action
Examples	Expiring certificates, resource spikes, maintenance notices

Response Process

Step 1: Detection

Incidents can be detected via:

Monitoring alerts (Grafana, DataDog, on-chain monitors, etc.)
Community reports (Discord, Twitter, Telegram)
Team members noticing something wrong
Bug bounty reports
Security audits
Partner notifications

The Detector's job: get the right people involved, fast.

Don't know who to call? Contact your incident response on-call team (e.g., DevOps or SecOps) via on-call system or team-wide email (e.g., team-security@company.com). These serve as fallbacks when the detector is unfamiliar with escalation paths.

Step 2: Coordination

Detector responsibilities:

Start a call (Zoom/Meet/Huddle)
Create a private channel (#incident-[brief-description])
Alert responders via your alerting system or direct contact
Provide all known information
Hand off to an Incident Leader (get explicit acknowledgment)

For P1 incidents: Keep the group minimal initially. Alert Decision Makers immediately.

Incident Leader responsibilities:

Pull in relevant SMEs
Assign a Scribe → they create an Incident Log
Assign Communication Manager(s)

Step 3: Investigation

Goal: Understand what's happening and assess impact.

Collect logs, error messages, reproduction steps
Identify affected services and scope of impact
Confirm or adjust severity level
Determine mitigation options

The Incident Leader assigns specific tasks to individuals. One task per person at a time keeps things focused.

Step 4: Resolution

Goal: Stop the bleeding first, permanent fix later.

If a temporary fix (rollback, pause, disable feature) is faster than a full fix and reduces damage, do that first.

Checklist:

Apply temporary mitigation
Verify it's working
Notify stakeholders
Plan permanent fix with owner and timeline

See Runbooks for step-by-step guides for specific incident types.

Step 5: Monitoring

Goal: Confirm the fix actually worked.

Verify immediately after deployment
Monitor for at least a week
Consider adding new alerts or test cases
Document what monitoring is now in place

Step 6: Post-Incident Review

Goal: Learn and prevent recurrence.

Incident Leader schedules post-mortem (within a week of resolution)
Scribe prepares Post-Mortem draft
Team reviews timeline, identifies root causes, captures lessons
Define action items with owners and deadlines
Share with team (and community if appropriate)

All action items must have owners and deadlines. Track them to completion.

Communication Guidelines

Internal: Regular updates in the incident channel. Frequency depends on severity.
External: Communication Manager drafts, gets approval before posting. See Communications for examples.
Transparency: Default to sharing post-mortems publicly (redacting sensitive details).

Document Control

Version	Date	Author	Changes
1.0	[DATE]	[AUTHOR]	Initial release

Customize this policy for your protocol's tools, team structure, and communication channels.