Incident Response Plan

Severity Levels

Level	Description	Response Time	Examples
Sev0	Complete outage, all users affected	15 min	Health check failures, app unreachable
Sev1	Major feature broken, significant user impact	30 min	API error rate >5%, database connection failures
Sev2	Degraded performance, limited user impact	2 hours	Slow response times, memory pressure
Sev3	Minor issue, workaround available	Next business day	UI glitch, non-critical background job failure

Severity	First Responder	Escalation (if unresolved in 30 min)	Executive Notification
Sev0	On-call engineer	Engineering lead + DevOps	CTO within 1 hour
Sev1	On-call engineer	Engineering lead	Product manager
Sev2	Any available engineer	Team lead	N/A
Sev3	Assigned engineer	N/A	N/A

If a recent deployment caused the issue, perform rollback:

az webapp deployment slot swap \
  --name hcss-eventscore-api-prod \
  --resource-group hcss-rg-prod \
  --slot staging \
  --target-slot production

Endpoint	Purpose	Expected Response
`/health/live`	App is running	200 OK
`/health/ready`	App can serve requests	200 OK
`/health/detailed`	Full diagnostics (DB, memory, jobs)	200 OK with JSON
`/api/ping`	Basic connectivity	200 OK