Lesson 3 of 5
Setting Up Alert Rules
Estimated time: 8 minutes
Setting Up Alert Rules
Raw alerts are noisy. A CPU spike at 80% during a deploy is normal. The same spike at 3 AM on a quiet night is a problem. In this lesson, you'll create intelligent alert rules that reduce noise, add context, and tell you why something is happening — not just what.
The Alert Noise Problem
Most monitoring setups suffer from alert fatigue:
- Too many alerts: Every minor threshold breach fires a notification
- No context: "CPU > 80%" tells you nothing about cause
- No correlation: A database alert and an API latency alert fire separately, even though they're the same incident
- Wrong channel: Critical alerts go to the same place as informational ones
The goal: fewer alerts, each one actionable.
Define severity tiers
Set up tiered routing so critical alerts wake you up and informational ones wait for morning:
openclaw agent edit "DevOps Monitor" \
--append-message "
SEVERITY ROUTING:
🔴 CRITICAL (immediate action required):
- Service down / health check failing
- Error rate > 5% for 5+ minutes
- Disk usage > 95%
- Database connections exhausted
→ Send to #ops-critical + page on-call engineer
🟡 WARNING (investigate soon):
- CPU > 85% for 10+ minutes
- Memory > 80%
- Error rate 1-5%
- Response time p99 > 2s
→ Send to #ops-alerts
🟢 INFO (review in morning):
- Deployment completed
- Auto-scaling event
- SSL cert expiring in 30+ days
- Scheduled maintenance completed
→ Send to #ops-info (muted channel)
When in doubt, default to WARNING. Never suppress an alert entirely
— downgrade it to INFO instead."
Configure the channel routing:
openclaw alert-route add \
--name "Critical" \
--severity critical \
--channel slack --to "#ops-critical" \
--also-page "on-call"
openclaw alert-route add \
--name "Warning" \
--severity warning \
--channel slack --to "#ops-alerts"
openclaw alert-route add \
--name "Info" \
--severity info \
--channel slack --to "#ops-info"
Add AI enrichment to alerts
This is the key difference. Instead of forwarding raw alerts, OpenClaw analyzes them first:
openclaw agent edit "DevOps Monitor" \
--append-message "
ENRICHMENT RULES:
When you receive an alert, before sending to chat:
1. CHECK RECENT EVENTS (last 2 hours):
- Any deployments? (from GitHub/Vercel connection)
- Any config changes?
- Any other alerts that fired around the same time?
2. CHECK HISTORICAL PATTERNS:
- Has this alert fired before? When? What fixed it?
- Is this a known pattern (e.g., daily backup causes CPU spike)?
3. ANALYZE PROBABLE CAUSE:
- If deploy happened < 30 min ago: likely deploy-related
- If multiple services affected: likely infrastructure (network, DB, DNS)
- If single service affected: likely application-level
4. FORMAT THE ALERT:
🚨 [SEVERITY] Service: [name]
📊 Metric: [what triggered] → [current value]
🕐 Started: [time] (duration: [X min])
🔄 Recent changes: [deploy/config/none]
🤔 Probable cause: [AI analysis]
📋 Suggested action: [specific next step]
📎 Similar incidents: [link to past incidents if any]"
Test with a simulated alert:
openclaw connect simulate Grafana \
--alert "High error rate" \
--metric "http_error_rate" \
--value "4.2%" \
--host "api-prod-01" \
--duration "8 minutes"
Expected enriched alert:
🟡 WARNING — api-prod-01
📊 Error rate: 4.2% (threshold: 1%)
🕐 Started: 14:32 UTC (8 minutes ago)
🔄 Recent changes:
• Deploy #487 at 14:25 UTC (7 min before alert)
"feat: add user search endpoint" by @sarah
🤔 Probable cause:
New deployment likely introduced a regression.
Error rate was 0.1% before deploy, spiked to 4.2% after.
Most errors are 500s on /api/users/search endpoint.
📋 Suggested actions:
1. Check logs: /logs api-prod-01 --since 14:25
2. Roll back if confirmed: /rollback #487
3. Or investigate: /ssh api-prod-01 "tail -f /var/log/app.log"
📎 Similar: Incident #34 (Jan 2) — same endpoint, fixed by
adding index on users.email
Set up alert deduplication
Prevent the same issue from flooding your channel:
openclaw alert-rule add \
--name "Deduplication" \
--group-by "host,metric" \
--window "15m" \
--action "If the same host+metric fires multiple times within 15 minutes,
send only the first alert. Append 'Still ongoing (X occurrences)' to the
original message every 15 minutes until resolved."
Before dedup:
14:32 🟡 CPU > 85% on web-prod-01
14:37 🟡 CPU > 85% on web-prod-01
14:42 🟡 CPU > 85% on web-prod-01
14:47 🟡 CPU > 85% on web-prod-01
After dedup:
14:32 🟡 CPU > 85% on web-prod-01
14:47 🟡 ↳ Still ongoing (4 occurrences, 15 min)
15:02 ✅ Resolved: CPU normalized at 72% on web-prod-01
Create smart correlation rules
Group related alerts into a single incident:
openclaw alert-rule add \
--name "Incident Correlation" \
--rule "If 3+ alerts fire within 5 minutes across different services,
group them as a single incident. Create an incident thread in #ops-critical
with:
1. All related alerts
2. Common factors (same host? same time? same recent deploy?)
3. Likely root cause based on correlation
4. Single action plan instead of per-alert suggestions"
Example correlated incident:
🔴 INCIDENT #42 — Multiple services affected
Related alerts (all started within 3 minutes):
• API error rate: 8.5% on api-prod-01
• Database connections: 95% utilized on db-prod-01
• Queue depth: 15,000 (normally <100) on worker-prod-01
🔄 Common factor: Deploy #487 at 14:25 UTC
🤔 Root cause: New search endpoint running unoptimized
queries, exhausting DB connection pool. API errors
and queue backlog are downstream effects.
📋 Action plan:
1. Roll back deploy #487: /rollback #487
2. Monitor DB connections: /metrics db-prod-01 connections
3. Once stable, investigate the query in staging
Add time-aware rules
Some alerts mean different things at different times:
openclaw alert-rule add \
--name "Time Context" \
--rule "Apply these time-based overrides:
BUSINESS HOURS (9am-6pm weekdays):
- CPU > 85%: WARNING (normal load)
- Deployment events: INFO
OFF-HOURS (nights, weekends):
- CPU > 85%: CRITICAL (unexpected load)
- Any deployment: WARNING (who's deploying at 2 AM?)
KNOWN MAINTENANCE WINDOWS:
- Suppress all alerts during scheduled maintenance
- Auto-resolve alerts that clear within 10 min of maintenance end
DAILY PATTERNS:
- Ignore backup-related CPU spikes (daily 2-3 AM)
- Ignore auto-scaling events during known traffic peaks"
Alert Rule Cheat Sheet
| Rule | Effect |
|---|---|
| Severity routing | Critical → page, Warning → alert channel, Info → quiet channel |
| AI enrichment | Raw metric → context + cause + action |
| Deduplication | Same alert fires 10x → you see 1 message + updates |
| Correlation | 5 related alerts → 1 incident thread |
| Time awareness | CPU spike at 2 PM = normal, at 2 AM = critical |
Create templates for your most common alert types:
openclaw alert-template add \
--name "High Error Rate" \
--format "
🚨 [{severity}] Error rate spike on {host}
Rate: {value} (threshold: {threshold})
Top errors:
{top_5_error_messages}
Recent deploys: {recent_deploys}
Suggested: {ai_suggestion}"
Templates ensure consistent formatting and make alerts scannable at a glance.
Temporarily suppress non-critical alerts during deployments:
openclaw alert-rule add \
--name "Deploy Quiet" \
--rule "When a deployment starts, suppress WARNING-level alerts
for 10 minutes. If any alert persists after 10 minutes, escalate
to CRITICAL (the deploy may have caused a problem)."
This prevents false positives from deployment-related metric fluctuations while still catching real issues.
What You Have Now
Your alert system is no longer a fire hose. Alerts are:
- Tiered — critical alerts page you, info waits for morning
- Enriched — each alert includes probable cause and suggested action
- Deduplicated — no more notification spam from flapping metrics
- Correlated — related alerts grouped into single incidents
- Time-aware — context-sensitive severity based on time of day
The next lesson turns these alerts into action with one-command remediation.
Why should you downgrade noisy alerts to INFO instead of suppressing them entirely?