Runbook: API Down / Health Check Failure
Alert Trigger
- Health Check Failure alert:
/health/readyreturns non-200 for 5 minutes - API Error Rate alert: >5% 5xx responses for 5 minutes
Diagnosis Steps
1. Verify the Outage
curl -s https://hcss-eventscore-api-prod.azurewebsites.net/health/live
curl -s https://hcss-eventscore-api-prod.azurewebsites.net/health/ready
curl -s https://hcss-eventscore-api-prod.azurewebsites.net/health/detailed
curl -s https://hcss-eventscore-api-prod.azurewebsites.net/api/ping
2. Check Azure App Service Status
az webapp show --name hcss-eventscore-api-prod --resource-group hcss-rg-prod --query state
az webapp log tail --name hcss-eventscore-api-prod --resource-group hcss-rg-prod
3. Check Application Insights
- Navigate to Application Insights in Azure Portal
- Review Failures blade for exception spikes
- Check Performance blade for latency issues
- Review Live Metrics for real-time state
4. Check Recent Deployments
- Review GitHub Actions for recent CI/CD runs
- Check if a deployment happened shortly before the outage
Resolution Steps
App Service Not Responding
az webapp restart --name hcss-eventscore-api-prod --resource-group hcss-rg-prod
# Wait 60 seconds then verify
sleep 60
curl -s https://hcss-eventscore-api-prod.azurewebsites.net/health/ready
Bad Deployment (Rollback)
# If using deployment slots, swap back
az webapp deployment slot swap \
--name hcss-eventscore-api-prod \
--resource-group hcss-rg-prod \
--slot staging \
--target-slot production
App Service Plan Issues
# Check resource usage
az monitor metrics list \
--resource /subscriptions/{sub}/resourceGroups/hcss-rg-prod/providers/Microsoft.Web/sites/hcss-eventscore-api-prod \
--metric "CpuPercentage,MemoryPercentage" \
--interval PT5M
# Scale up if needed
az appservice plan update \
--name hcss-asp-prod \
--resource-group hcss-rg-prod \
--sku S2
Verification
- Health endpoints return 200
- Application Insights shows errors returning to baseline
- Monitor for 15 minutes for stability