Incident Response¶
Detect¶
- Monitoring alerts via ntfy (push notification)
- CI failure email to it@liflode.com → Linear issue auto-created
- User reports via hello@liflode.com
Triage¶
| Severity | Definition | Response time |
|---|---|---|
| P0 | Production down, data loss risk | Immediate |
| P1 | Major feature broken | Within 2 hours |
| P2 | Minor feature broken | Within 1 business day |
| P3 | Cosmetic/UX issue | Next sprint |
Communicate¶
- Create Linear issue with severity label
- Notify rachel@liflode.com for P0/P1
- Add status comment every 30 min for P0
Mitigate¶
- Follow the relevant service runbook
- When in doubt: revert and restore from backup
- Backup location: Cloudflare R2 (see restic-restore.md for restore procedure)
Post-Mortem¶
After resolution:
- Write post-mortem doc in
docs/runbooks/<service>-<date>.md - Root cause analysis: what happened, why, what was missed
- Action items: create Linear issues for each prevention measure