Incident Response Plan
Severity Levels
| Level | Description | Response Time | Examples |
|---|---|---|---|
| Sev0 | Complete outage, all users affected | 15 min | Health check failures, app unreachable |
| Sev1 | Major feature broken, significant user impact | 30 min | API error rate >5%, database connection failures |
| Sev2 | Degraded performance, limited user impact | 2 hours | Slow response times, memory pressure |
| Sev3 | Minor issue, workaround available | Next business day | UI glitch, non-critical background job failure |
Escalation Matrix
| Severity | First Responder | Escalation (if unresolved in 30 min) | Executive Notification |
|---|---|---|---|
| Sev0 | On-call engineer | Engineering lead + DevOps | CTO within 1 hour |
| Sev1 | On-call engineer | Engineering lead | Product manager |
| Sev2 | Any available engineer | Team lead | N/A |
| Sev3 | Assigned engineer | N/A | N/A |
Incident Workflow
1. Detection
- Azure Monitor alert fires (see Alerting)
- User reports via support channel
- Health check failure detected by deployment pipeline
2. Triage (First 5 Minutes)
- Acknowledge the alert
- Open a GitHub Issue using the incident-report template
- Assign severity level
- Notify the team via primary communication channel
3. Investigate
- Check the relevant runbook:
- Review Application Insights for errors and traces
- Check recent deployments in GitHub Actions
- Review Hangfire dashboard for failed jobs
4. Mitigate
- Apply the fix from the runbook
- If a recent deployment caused the issue, perform rollback:
az webapp deployment slot swap \
--name hcss-eventscore-api-prod \
--resource-group hcss-rg-prod \
--slot staging \
--target-slot production - If database migration caused the issue, contact the database team
5. Resolve
- Confirm the fix via health check endpoints
- Monitor for 15 minutes to ensure stability
- Update the GitHub Issue with resolution details
- Close the incident
6. Post-Incident
- Schedule a blameless postmortem within 48 hours (Sev0/Sev1)
- Use the Postmortem Template
- Create follow-up action items as GitHub Issues
- Update runbooks with any new procedures learned
Health Check Endpoints
| Endpoint | Purpose | Expected Response |
|---|---|---|
/health/live | App is running | 200 OK |
/health/ready | App can serve requests | 200 OK |
/health/detailed | Full diagnostics (DB, memory, jobs) | 200 OK with JSON |
/api/ping | Basic connectivity | 200 OK |