Incident response
When production breaks. Keep this short, because in an incident nobody reads long pages.
Severity levels
| Severity | Example | Response time | Who leads |
|---|---|---|---|
| SEV1 | OMS down, Shopify checkout broken, mass customer-facing failure | Within 15 minutes | Ray |
| SEV2 | Single supplier API failing, one channel’s orders delayed | Within 1 hour | On-call / CS lead |
| SEV3 | Chatbot responses wrong, handbook site slow | Within 24 hours | Tech |
| SEV4 | Cosmetic issues, minor UX bugs | Next working day | Normal process |
First 5 minutes of a SEV1
- Post in Slack
#general: “SEV1 in progress: [one sentence description]. Ray leading.” - Acknowledge in customer channels: if customers are affected, post a status update on social / email / chatbot
- Identify what changed in the last 24 hours (deploys, config, external vendor status)
- Start a shared incident doc in
#incidentschannel — pinned, everyone watches - Decide: fix forward or roll back?
Don’ts
- Don’t blame. Blameless culture. “What system allowed this to happen” beats “who did this.”
- Don’t fix alone unless the fix is obvious and <5 minutes. Get someone to pair, even async.
- Don’t forget to update customers. A 2-minute status post calms 100 support tickets.
- Don’t skip the post-mortem. Every SEV1 and SEV2 gets a written review within 48 hours.
Post-mortem template
For every SEV1 and SEV2, write one within 48 hours. Save to /operations/decisions/incident-YYYY-MM-DD.mdx.
Sections:
- What happened: plain-English timeline
- Impact: customers affected, revenue impact, duration
- Root cause: the technical and process causes
- What went well: things that worked in our favor
- What didn’t: gaps we found
- Action items: owner + deadline for each, tracked in Linear
Incident log
Record every SEV1 and SEV2 here with link to the post-mortem.
No incidents logged yet.