Alerting Configuration
Overview
Azure Monitor alerts are configured via infra/alerts/setup-alerts.sh. Alerts fire to the hcss-alerts-<env> action group which notifies the ops team via email.
Alert Rules
| # | Alert | Condition | Window | Severity | Runbook |
|---|---|---|---|---|---|
| 1 | API Error Rate | >5% 5xx responses | 5 min | Sev1 | API Down |
| 2 | Response Time | avg >3s | 10 min | Sev2 | Check Application Insights Performance blade |
| 3 | Health Check Failure | /health/ready non-200 | 5 min | Sev0 | API Down |
| 4 | Memory Pressure | Available memory <500MB | 10 min | Sev2 | High Memory |
| 5 | Database Errors | Any failed SQL dependency | 5 min | Sev1 | Database Issues |
Alert Response Actions
Sev0 - Health Check Failure
- Immediately check if the app is reachable:
curl https://hcss-eventscore-api-prod.azurewebsites.net/api/ping - If unreachable, restart:
az webapp restart --name hcss-eventscore-api-prod --resource-group hcss-rg-prod - If restart doesn't help, check for bad deployment and rollback
- Open Sev0 incident per Incident Response
Sev1 - API Error Rate / Database Errors
- Check Application Insights Failures blade for the specific error
- For database errors, verify SQL Server is accessible and not throttled
- Check Hangfire dashboard for failed jobs
- Open Sev1 incident if not self-resolving within 15 minutes
Sev2 - Response Time / Memory Pressure
- Check Application Insights Performance blade for slow endpoints
- For memory, check if the issue is gradual (leak) or sudden (spike)
- Consider scaling up/out if persistent
- Monitor for 30 minutes before escalating
Setup / Update
To create or update alert rules:
az login
./infra/alerts/setup-alerts.sh prod
Modifying Thresholds
Edit the thresholds in infra/alerts/setup-alerts.sh and re-run the script. Key values:
- Error rate percentage:
5(in the KQL query) - Response time (ms):
3000 - Memory threshold (bytes):
1468006400(~1.4GB, leaving 500MB free on B1) - Health check availability:
100
Action Group Configuration
The action group hcss-alerts-<env> is configured with:
- Email notification to ops team
To add additional notification channels (SMS, webhook, PagerDuty), update the action group in Azure Portal or modify the setup-alerts.sh script.