Skip to main content

Runbook: API Down / Health Check Failure

Alert Trigger

  • Health Check Failure alert: /health/ready returns non-200 for 5 minutes
  • API Error Rate alert: >5% 5xx responses for 5 minutes

Diagnosis Steps

1. Verify the Outage

curl -s https://hcss-eventscore-api-prod.azurewebsites.net/health/live
curl -s https://hcss-eventscore-api-prod.azurewebsites.net/health/ready
curl -s https://hcss-eventscore-api-prod.azurewebsites.net/health/detailed
curl -s https://hcss-eventscore-api-prod.azurewebsites.net/api/ping

2. Check Azure App Service Status

az webapp show --name hcss-eventscore-api-prod --resource-group hcss-rg-prod --query state
az webapp log tail --name hcss-eventscore-api-prod --resource-group hcss-rg-prod

3. Check Application Insights

  • Navigate to Application Insights in Azure Portal
  • Review Failures blade for exception spikes
  • Check Performance blade for latency issues
  • Review Live Metrics for real-time state

4. Check Recent Deployments

  • Review GitHub Actions for recent CI/CD runs
  • Check if a deployment happened shortly before the outage

Resolution Steps

App Service Not Responding

az webapp restart --name hcss-eventscore-api-prod --resource-group hcss-rg-prod

# Wait 60 seconds then verify
sleep 60
curl -s https://hcss-eventscore-api-prod.azurewebsites.net/health/ready

Bad Deployment (Rollback)

# If using deployment slots, swap back
az webapp deployment slot swap \
--name hcss-eventscore-api-prod \
--resource-group hcss-rg-prod \
--slot staging \
--target-slot production

App Service Plan Issues

# Check resource usage
az monitor metrics list \
--resource /subscriptions/{sub}/resourceGroups/hcss-rg-prod/providers/Microsoft.Web/sites/hcss-eventscore-api-prod \
--metric "CpuPercentage,MemoryPercentage" \
--interval PT5M

# Scale up if needed
az appservice plan update \
--name hcss-asp-prod \
--resource-group hcss-rg-prod \
--sku S2

Verification

  1. Health endpoints return 200
  2. Application Insights shows errors returning to baseline
  3. Monitor for 15 minutes for stability