Skip to content

Incident Response

Detect

  • Monitoring alerts via ntfy (push notification)
  • CI failure email to it@liflode.com → Linear issue auto-created
  • User reports via hello@liflode.com

Triage

Severity Definition Response time
P0 Production down, data loss risk Immediate
P1 Major feature broken Within 2 hours
P2 Minor feature broken Within 1 business day
P3 Cosmetic/UX issue Next sprint

Communicate

  • Create Linear issue with severity label
  • Notify rachel@liflode.com for P0/P1
  • Add status comment every 30 min for P0

Mitigate

  • Follow the relevant service runbook
  • When in doubt: revert and restore from backup
  • Backup location: Cloudflare R2 (see restic-restore.md for restore procedure)

Post-Mortem

After resolution:

  1. Write post-mortem doc in docs/runbooks/<service>-<date>.md
  2. Root cause analysis: what happened, why, what was missed
  3. Action items: create Linear issues for each prevention measure