Lessons Learned

Security SpecialistOperations & StrategyDevopsSRE

Authored by:

Dickson Wu

SEAL

Conducting a post-incident review and identifying lessons learned will improve your project's incident response capabilities. By analyzing what went well and what could be improved, you can enhance your readiness for future incidents.

Principles

The purpose of post-incident review is improvement, not blame. The review should focus on:

What happened
Why the incident was possible
Why the response unfolded the way it did
What changes will materially reduce the chance or impact of recurrence

The best reviews distinguish between the primary cause, contributing factors, and response quality instead of collapsing everything into one narrative.

Best Practices

Review the incident together with everybody involved in handling it shortly after the incident is resolved.
Record details about the incident, including the timeline, root cause, impact, and response efforts.
Assess the effectiveness of the incident response, highlighting areas where the team performed well and areas needing improvement.
Create action plans to address identified weaknesses and enhance strengths. Assign responsibilities and deadlines for implementing improvements.
Share the lessons learned with the ecosystem to promote awareness and improve overall security practices.
Revise incident response policies and procedures based on the lessons learned to ensure continuous improvement.

What a useful post-incident review captures

At minimum, the review should cover:

Summary: what happened, when it started, how long it lasted, and how it was resolved
Impact: services, users, funds, partners, or operations affected
Timeline: a UTC sequence of events and decisions
Root cause: the underlying failure that made the incident possible
Contributing factors: process, tooling, staffing, monitoring, or design issues that worsened the event
What went well: capabilities worth preserving
What went poorly: failures in readiness or execution
Where the team got lucky: conditions that helped this time but should not be relied upon

Action item quality

Corrective actions should be concrete enough to track. Good action items:

have a clear owner
have a deadline
describe a specific change
are tied to a real weakness observed during the incident

Bad action items are vague or purely aspirational, such as "communicate better" or "be more careful."

Questions worth asking

Was the incident detected as quickly as it should have been?
Did the severity level reflect actual impact?
Were the right people involved early enough?
Did the team have an appropriate runbook or was too much invented during the response?
Was the external communication cadence appropriate?
Did logs, dashboards, and evidence collection support investigation effectively?
What should change in monitoring, alerting, staffing, or access controls?

When to hold the review

Run the review soon enough that details are still fresh, but after the situation is actually stable. For significant incidents, many teams find it useful to hold the review within about a week of resolution.

Outputs beyond the write-up

A good retrospective often drives updates to:

playbooks and runbooks
alert thresholds and monitoring coverage
access control or break-glass procedures
communication templates
training and tabletop exercises

The review should end with tangible changes, not just a document.