Incident response

What happened: plain-English timeline
Impact: customers affected, revenue impact, duration
Root cause: the technical and process causes
What went well: things that worked in our favor
What didn’t: gaps we found
Action items: owner + deadline for each, tracked in Linear

When production breaks. Keep this short, because in an incident nobody reads long pages.

Severity levels

Severity	Example	Response time	Who leads
SEV1	OMS down, Shopify checkout broken, mass customer-facing failure	Within 15 minutes	Ray
SEV2	Single supplier API failing, one channel’s orders delayed	Within 1 hour	On-call / CS lead
SEV3	Chatbot responses wrong, handbook site slow	Within 24 hours	Tech
SEV4	Cosmetic issues, minor UX bugs	Next working day	Normal process

Post in Slack #general: “SEV1 in progress: [one sentence description]. Ray leading.”
Acknowledge in customer channels: if customers are affected, post a status update on social / email / chatbot
Identify what changed in the last 24 hours (deploys, config, external vendor status)
Start a shared incident doc in #incidents channel — pinned, everyone watches
Decide: fix forward or roll back?

Don’t blame. Blameless culture. “What system allowed this to happen” beats “who did this.”
Don’t fix alone unless the fix is obvious and <5 minutes. Get someone to pair, even async.
Don’t forget to update customers. A 2-minute status post calms 100 support tickets.
Don’t skip the post-mortem. Every SEV1 and SEV2 gets a written review within 48 hours.

For every SEV1 and SEV2, write one within 48 hours. Save to /operations/decisions/incident-YYYY-MM-DD.mdx.

Sections:

Record every SEV1 and SEV2 here with link to the post-mortem.

No incidents logged yet.