DevOps Monitoring

Lesson 3 of 5

Setting Up Alert Rules

Estimated time: 8 minutes

Setting Up Alert Rules

Raw alerts are noisy. A CPU spike at 80% during a deploy is normal. The same spike at 3 AM on a quiet night is a problem. In this lesson, you'll create intelligent alert rules that reduce noise, add context, and tell you why something is happening — not just what.

The Alert Noise Problem

Most monitoring setups suffer from alert fatigue:

  • Too many alerts: Every minor threshold breach fires a notification
  • No context: "CPU > 80%" tells you nothing about cause
  • No correlation: A database alert and an API latency alert fire separately, even though they're the same incident
  • Wrong channel: Critical alerts go to the same place as informational ones

The goal: fewer alerts, each one actionable.

Define severity tiers

Set up tiered routing so critical alerts wake you up and informational ones wait for morning:

openclaw agent edit "DevOps Monitor" \
  --append-message "

SEVERITY ROUTING:
🔴 CRITICAL (immediate action required):
  - Service down / health check failing
  - Error rate > 5% for 5+ minutes
  - Disk usage > 95%
  - Database connections exhausted
  → Send to #ops-critical + page on-call engineer

🟡 WARNING (investigate soon):
  - CPU > 85% for 10+ minutes
  - Memory > 80%
  - Error rate 1-5%
  - Response time p99 > 2s
  → Send to #ops-alerts

🟢 INFO (review in morning):
  - Deployment completed
  - Auto-scaling event
  - SSL cert expiring in 30+ days
  - Scheduled maintenance completed
  → Send to #ops-info (muted channel)

When in doubt, default to WARNING. Never suppress an alert entirely
— downgrade it to INFO instead."

Configure the channel routing:

openclaw alert-route add \
  --name "Critical" \
  --severity critical \
  --channel slack --to "#ops-critical" \
  --also-page "on-call"

openclaw alert-route add \
  --name "Warning" \
  --severity warning \
  --channel slack --to "#ops-alerts"

openclaw alert-route add \
  --name "Info" \
  --severity info \
  --channel slack --to "#ops-info"

Add AI enrichment to alerts

This is the key difference. Instead of forwarding raw alerts, OpenClaw analyzes them first:

openclaw agent edit "DevOps Monitor" \
  --append-message "

ENRICHMENT RULES:
When you receive an alert, before sending to chat:

1. CHECK RECENT EVENTS (last 2 hours):
   - Any deployments? (from GitHub/Vercel connection)
   - Any config changes?
   - Any other alerts that fired around the same time?

2. CHECK HISTORICAL PATTERNS:
   - Has this alert fired before? When? What fixed it?
   - Is this a known pattern (e.g., daily backup causes CPU spike)?

3. ANALYZE PROBABLE CAUSE:
   - If deploy happened < 30 min ago: likely deploy-related
   - If multiple services affected: likely infrastructure (network, DB, DNS)
   - If single service affected: likely application-level

4. FORMAT THE ALERT:
   🚨 [SEVERITY] Service: [name]
   📊 Metric: [what triggered] → [current value]
   🕐 Started: [time] (duration: [X min])
   🔄 Recent changes: [deploy/config/none]
   🤔 Probable cause: [AI analysis]
   📋 Suggested action: [specific next step]
   📎 Similar incidents: [link to past incidents if any]"

Test with a simulated alert:

openclaw connect simulate Grafana \
  --alert "High error rate" \
  --metric "http_error_rate" \
  --value "4.2%" \
  --host "api-prod-01" \
  --duration "8 minutes"

Expected enriched alert:

🟡 WARNING — api-prod-01

📊 Error rate: 4.2% (threshold: 1%)
🕐 Started: 14:32 UTC (8 minutes ago)

🔄 Recent changes:
  • Deploy #487 at 14:25 UTC (7 min before alert)
    "feat: add user search endpoint" by @sarah

🤔 Probable cause:
  New deployment likely introduced a regression.
  Error rate was 0.1% before deploy, spiked to 4.2% after.
  Most errors are 500s on /api/users/search endpoint.

📋 Suggested actions:
  1. Check logs: /logs api-prod-01 --since 14:25
  2. Roll back if confirmed: /rollback #487
  3. Or investigate: /ssh api-prod-01 "tail -f /var/log/app.log"

📎 Similar: Incident #34 (Jan 2) — same endpoint, fixed by
   adding index on users.email

Set up alert deduplication

Prevent the same issue from flooding your channel:

openclaw alert-rule add \
  --name "Deduplication" \
  --group-by "host,metric" \
  --window "15m" \
  --action "If the same host+metric fires multiple times within 15 minutes,
send only the first alert. Append 'Still ongoing (X occurrences)' to the
original message every 15 minutes until resolved."

Before dedup:

14:32  🟡 CPU > 85% on web-prod-01
14:37  🟡 CPU > 85% on web-prod-01
14:42  🟡 CPU > 85% on web-prod-01
14:47  🟡 CPU > 85% on web-prod-01

After dedup:

14:32  🟡 CPU > 85% on web-prod-01
14:47  🟡 ↳ Still ongoing (4 occurrences, 15 min)
15:02  ✅ Resolved: CPU normalized at 72% on web-prod-01

Create smart correlation rules

Group related alerts into a single incident:

openclaw alert-rule add \
  --name "Incident Correlation" \
  --rule "If 3+ alerts fire within 5 minutes across different services,
group them as a single incident. Create an incident thread in #ops-critical
with:
1. All related alerts
2. Common factors (same host? same time? same recent deploy?)
3. Likely root cause based on correlation
4. Single action plan instead of per-alert suggestions"

Example correlated incident:

🔴 INCIDENT #42 — Multiple services affected

Related alerts (all started within 3 minutes):
  • API error rate: 8.5% on api-prod-01
  • Database connections: 95% utilized on db-prod-01
  • Queue depth: 15,000 (normally <100) on worker-prod-01

🔄 Common factor: Deploy #487 at 14:25 UTC
🤔 Root cause: New search endpoint running unoptimized
   queries, exhausting DB connection pool. API errors
   and queue backlog are downstream effects.

📋 Action plan:
  1. Roll back deploy #487: /rollback #487
  2. Monitor DB connections: /metrics db-prod-01 connections
  3. Once stable, investigate the query in staging

Add time-aware rules

Some alerts mean different things at different times:

openclaw alert-rule add \
  --name "Time Context" \
  --rule "Apply these time-based overrides:

BUSINESS HOURS (9am-6pm weekdays):
  - CPU > 85%: WARNING (normal load)
  - Deployment events: INFO

OFF-HOURS (nights, weekends):
  - CPU > 85%: CRITICAL (unexpected load)
  - Any deployment: WARNING (who's deploying at 2 AM?)

KNOWN MAINTENANCE WINDOWS:
  - Suppress all alerts during scheduled maintenance
  - Auto-resolve alerts that clear within 10 min of maintenance end

DAILY PATTERNS:
  - Ignore backup-related CPU spikes (daily 2-3 AM)
  - Ignore auto-scaling events during known traffic peaks"

Alert Rule Cheat Sheet

RuleEffect
Severity routingCritical → page, Warning → alert channel, Info → quiet channel
AI enrichmentRaw metric → context + cause + action
DeduplicationSame alert fires 10x → you see 1 message + updates
Correlation5 related alerts → 1 incident thread
Time awarenessCPU spike at 2 PM = normal, at 2 AM = critical

Create templates for your most common alert types:

openclaw alert-template add \
  --name "High Error Rate" \
  --format "
🚨 [{severity}] Error rate spike on {host}

Rate: {value} (threshold: {threshold})
Top errors:
{top_5_error_messages}

Recent deploys: {recent_deploys}
Suggested: {ai_suggestion}"

Templates ensure consistent formatting and make alerts scannable at a glance.

Temporarily suppress non-critical alerts during deployments:

openclaw alert-rule add \
  --name "Deploy Quiet" \
  --rule "When a deployment starts, suppress WARNING-level alerts
for 10 minutes. If any alert persists after 10 minutes, escalate
to CRITICAL (the deploy may have caused a problem)."

This prevents false positives from deployment-related metric fluctuations while still catching real issues.

What You Have Now

Your alert system is no longer a fire hose. Alerts are:

  • Tiered — critical alerts page you, info waits for morning
  • Enriched — each alert includes probable cause and suggested action
  • Deduplicated — no more notification spam from flapping metrics
  • Correlated — related alerts grouped into single incidents
  • Time-aware — context-sensitive severity based on time of day

The next lesson turns these alerts into action with one-command remediation.

Knowledge Check

Why should you downgrade noisy alerts to INFO instead of suppressing them entirely?