Skip to main content

Incident Response Plan

Severity Levels

LevelDescriptionResponse TimeExamples
Sev0Complete outage, all users affected15 minHealth check failures, app unreachable
Sev1Major feature broken, significant user impact30 minAPI error rate >5%, database connection failures
Sev2Degraded performance, limited user impact2 hoursSlow response times, memory pressure
Sev3Minor issue, workaround availableNext business dayUI glitch, non-critical background job failure

Escalation Matrix

SeverityFirst ResponderEscalation (if unresolved in 30 min)Executive Notification
Sev0On-call engineerEngineering lead + DevOpsCTO within 1 hour
Sev1On-call engineerEngineering leadProduct manager
Sev2Any available engineerTeam leadN/A
Sev3Assigned engineerN/AN/A

Incident Workflow

1. Detection

  • Azure Monitor alert fires (see Alerting)
  • User reports via support channel
  • Health check failure detected by deployment pipeline

2. Triage (First 5 Minutes)

  1. Acknowledge the alert
  2. Open a GitHub Issue using the incident-report template
  3. Assign severity level
  4. Notify the team via primary communication channel

3. Investigate

  1. Check the relevant runbook:
  2. Review Application Insights for errors and traces
  3. Check recent deployments in GitHub Actions
  4. Review Hangfire dashboard for failed jobs

4. Mitigate

  • Apply the fix from the runbook
  • If a recent deployment caused the issue, perform rollback:
    az webapp deployment slot swap \
    --name hcss-eventscore-api-prod \
    --resource-group hcss-rg-prod \
    --slot staging \
    --target-slot production
  • If database migration caused the issue, contact the database team

5. Resolve

  1. Confirm the fix via health check endpoints
  2. Monitor for 15 minutes to ensure stability
  3. Update the GitHub Issue with resolution details
  4. Close the incident

6. Post-Incident

  1. Schedule a blameless postmortem within 48 hours (Sev0/Sev1)
  2. Use the Postmortem Template
  3. Create follow-up action items as GitHub Issues
  4. Update runbooks with any new procedures learned

Health Check Endpoints

EndpointPurposeExpected Response
/health/liveApp is running200 OK
/health/readyApp can serve requests200 OK
/health/detailedFull diagnostics (DB, memory, jobs)200 OK with JSON
/api/pingBasic connectivity200 OK