Runbook: High Memory Usage
Alert Trigger
- Memory Pressure alert: Available memory <500MB for 10 minutes
- Maps to the
MemoryHealthCheckthresholds configured in the application
Diagnosis Steps
1. Check Current Memory
curl -s https://hcss-eventscore-api-prod.azurewebsites.net/health/detailed | jq .
az monitor metrics list \
--resource /subscriptions/SUB_ID/resourceGroups/hcss-rg-prod/providers/Microsoft.Web/sites/hcss-eventscore-api-prod \
--metric "MemoryWorkingSet,AverageMemoryWorkingSet" \
--interval PT5M
2. Check Application Insights
- Review Performance blade for memory trends
- Look for memory leak patterns (steadily increasing over time)
- Check if a specific endpoint is consuming excessive memory
3. Check Cache Size
The application uses IMemoryCache with a limit of 1024 entries.
- High memory could indicate cache entries are storing large objects
- Review if cache compaction (25% threshold) is working
4. Check Hangfire Jobs
- Access Hangfire dashboard
- Check for stuck or long-running jobs that may hold references
Resolution Steps
Immediate Relief: Restart
az webapp restart --name hcss-eventscore-api-prod --resource-group hcss-rg-prod
sleep 60
curl -s https://hcss-eventscore-api-prod.azurewebsites.net/health/detailed | jq .entries.memory
Scale Up the App Service Plan
# Current: B1 (1.75 GB RAM) or S1 (1.75 GB RAM)
# Scale to S2 (3.5 GB RAM) or S3 (7 GB RAM)
az appservice plan update \
--name hcss-asp-prod \
--resource-group hcss-rg-prod \
--sku S2
Scale Out (Add Instances)
az appservice plan update \
--name hcss-asp-prod \
--resource-group hcss-rg-prod \
--number-of-workers 2
Investigate Memory Leak
If memory increases steadily after restart:
- Enable Application Insights Profiler for memory snapshots
- Check for large file uploads held in memory (should use streaming)
- Review recent code changes for:
- Unbounded collections
- Missing
IDisposableimplementations - Large response objects not being GC'd
- SignalR connections accumulating
Verification
/health/detailedreports memory healthy- Memory working set is below threshold
- Application is responsive (check response times)
- Monitor for 30 minutes to ensure memory is stable (not climbing)