devopsOpenai

Monitoring & Alerting Setup Prompt (ChatGPT)

Alert fatigue is the biggest cause of missed incidents. This prompt designs alerts from the SLO backwards — starting with what the user experiences and working inward — rather than alerting on every metric. The runbook requirement ensures that whoever responds to an alert at 3am has immediate guidance on how to diagnose it. This variant is formatted for ChatGPT: Optimised for GPT-4o and GPT-4 Turbo. Uses markdown formatting and system/user message separation.

Prompt Template
## System
You are an expert AI assistant. Respond using clear markdown formatting.

## User
You are a senior site reliability engineer specialising in observability.

Design a monitoring and alerting strategy for the following system:

System description: {{system_description}}
Tech stack: {{tech_stack}}
Monitoring tools: {{monitoring_tools}}
SLA requirements: {{sla}}
On-call setup: {{on_call}}

Provide:
1. **Key metrics to track** — list the top 10 metrics with thresholds for warning and critical alerts
2. **SLO definitions** — availability, latency, and error rate SLOs based on the stated SLA
3. **Alert rules** — for each critical metric: condition, severity, and action
4. **Dashboard panels** — describe 6-8 key panels for an operations dashboard
5. **Runbook outline** — for the top 3 most likely alerts, a short runbook with diagnosis steps
6. **Alert noise reduction** — strategies to avoid alert fatigue (grouping, deduplication, inhibition rules)

Variables

{{system_description}}What the system does, e.g., "payment processing API handling 1000 req/sec"
{{tech_stack}}Technologies in use, e.g., "Node.js, PostgreSQL, Redis, Kubernetes"
{{monitoring_tools}}Monitoring stack, e.g., "Prometheus + Grafana + PagerDuty", "Datadog", "CloudWatch"
{{sla}}Uptime and performance SLA, e.g., "99.9% uptime, p99 latency < 500ms"
{{on_call}}On-call structure, e.g., "24/7 on-call rotation, max 5 pages/week target"

Example

Input
system_description: REST API for e-commerce checkout, processes credit card payments
tech_stack: Node.js, PostgreSQL, Stripe API, Redis, deployed on Kubernetes
monitoring_tools: Prometheus, Grafana, PagerDuty
sla: 99.9% uptime, p99 latency < 1 second
on_call: 24/7 rotation, target < 3 critical pages per week
Output
**Key Metrics:**
1. HTTP error rate (5xx): warn >1%, critical >5%
2. p99 response time: warn >800ms, critical >1500ms
3. Stripe API error rate: warn >2%, critical >10%
4. Database connection pool utilisation: warn >70%, critical >90%
5. Pod restart rate: warn >2/hour, critical >10/hour

**SLO Definitions:**
- Availability: 99.9% of requests return non-5xx within 30 days
- Latency: 95% of checkout requests complete within 1 second

Related Tools

FAQ

What is the difference between SLA, SLO, and SLI?
SLI (Service Level Indicator) is a measurement (e.g., request success rate). SLO (Service Level Objective) is your internal target (e.g., 99.9% success). SLA (Service Level Agreement) is the contractual commitment with consequences for breach. Set SLOs tighter than your SLAs to give yourself a buffer.
How do I avoid alert storms during an outage?
Use alert inhibition rules: when a root-cause alert fires (e.g., database down), suppress downstream symptom alerts (e.g., all API errors). Configure alert grouping to send one notification per incident rather than one per alerting rule.
What metrics should I collect from day one?
Start with the four golden signals: latency, traffic (requests/sec), error rate, and saturation (CPU/memory/disk). Everything else can be added when you identify a specific blind spot during an incident.

Related Prompts