🔒 Internal Handbook — confidential. Do not share links or content with anyone outside G-Starlink.
TechIncident response

Incident response

When production breaks. Keep this short, because in an incident nobody reads long pages.

Severity levels

SeverityExampleResponse timeWho leads
SEV1OMS down, Shopify checkout broken, mass customer-facing failureWithin 15 minutesRay
SEV2Single supplier API failing, one channel’s orders delayedWithin 1 hourOn-call / CS lead
SEV3Chatbot responses wrong, handbook site slowWithin 24 hoursTech
SEV4Cosmetic issues, minor UX bugsNext working dayNormal process

First 5 minutes of a SEV1

  1. Post in Slack #general: “SEV1 in progress: [one sentence description]. Ray leading.”
  2. Acknowledge in customer channels: if customers are affected, post a status update on social / email / chatbot
  3. Identify what changed in the last 24 hours (deploys, config, external vendor status)
  4. Start a shared incident doc in #incidents channel — pinned, everyone watches
  5. Decide: fix forward or roll back?

Don’ts

  • Don’t blame. Blameless culture. “What system allowed this to happen” beats “who did this.”
  • Don’t fix alone unless the fix is obvious and <5 minutes. Get someone to pair, even async.
  • Don’t forget to update customers. A 2-minute status post calms 100 support tickets.
  • Don’t skip the post-mortem. Every SEV1 and SEV2 gets a written review within 48 hours.

Post-mortem template

For every SEV1 and SEV2, write one within 48 hours. Save to /operations/decisions/incident-YYYY-MM-DD.mdx.

Sections:

  • What happened: plain-English timeline
  • Impact: customers affected, revenue impact, duration
  • Root cause: the technical and process causes
  • What went well: things that worked in our favor
  • What didn’t: gaps we found
  • Action items: owner + deadline for each, tracked in Linear

Incident log

Record every SEV1 and SEV2 here with link to the post-mortem.

No incidents logged yet.