DevOps Monitoring

Lesson 5 of 5

Creating Chat Runbooks

Estimated time: 5 minutes

Creating Chat Runbooks

Every team has tribal knowledge — the things the senior engineer knows that aren't written down anywhere. "When the queue backs up, check if the worker lost its Redis connection. If so, restart the worker, then flush the dead letter queue." In this lesson, you'll capture that knowledge as executable runbooks that any team member can run from chat.

The Problem with Traditional Runbooks

Most runbooks live in a wiki or Google Doc. They're:

  • Outdated: Written 6 months ago, infrastructure changed since
  • Hard to find: Buried in a wiki with 200 other pages
  • Manual to execute: "SSH into server, run this command, check this output, then run that command"
  • Single-threaded: Only the person reading the wiki knows what to do

Chat runbooks fix all of this. They're discoverable (/runbook list), executable (one command), and shared (the whole team sees it happen in real-time).

Create your first runbook

Start with a common scenario — high database connections:

openclaw runbook add \
  --name "DB Connection Pool Exhausted" \
  --trigger "/runbook db-pool" \
  --tags "database,performance,critical" \
  --steps '
Step 1: Check current connections
  - Command: SELECT count(*) FROM pg_stat_activity;
  - Expected: Under max_connections limit
  - If over 90%: continue to Step 2

Step 2: Identify connection hogs
  - Command: SELECT application_name, count(*)
    FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC LIMIT 10;
  - Look for: One service holding too many connections

Step 3: Check for idle connections
  - Command: SELECT count(*) FROM pg_stat_activity
    WHERE state = '"'"'idle'"'"' AND query_start < now() - interval '"'"'5 minutes'"'"';
  - If > 20 idle connections: kill them

Step 4: Kill idle connections (with confirmation)
  - Command: SELECT pg_terminate_backend(pid) FROM pg_stat_activity
    WHERE state = '"'"'idle'"'"' AND query_start < now() - interval '"'"'5 minutes'"'"';
  - REQUIRES CONFIRMATION before executing

Step 5: Verify recovery
  - Command: SELECT count(*) FROM pg_stat_activity;
  - Expected: Connections dropped below 70% of max
  - If still high: escalate to on-call DBA
' \
  --source "AWS Production" \
  --announce \
  --channel slack \
  --to "#ops-alerts"

Test it:

Run the DB pool runbook
/runbook db-pool

OpenClaw executes each step, shows the output, and pauses at Step 4 for confirmation before killing connections.

Create a deployment verification runbook

Run this after every deploy to verify everything is healthy:

openclaw runbook add \
  --name "Post-Deploy Verification" \
  --trigger "/runbook post-deploy" \
  --tags "deployment,verification" \
  --auto-run-after "deployment" \
  --steps '
Step 1: Health check
  - Hit /health endpoint on all instances
  - Expected: 200 OK on every instance
  - If any fail: alert and suggest rollback

Step 2: Error rate check
  - Compare error rate to pre-deploy baseline (5 min window)
  - Expected: No increase > 0.5%
  - If elevated: show top new errors

Step 3: Response time check
  - Compare p50, p95, p99 to pre-deploy baseline
  - Expected: No increase > 20%
  - If degraded: show slowest endpoints

Step 4: Key transaction test
  - Run synthetic tests: login, search, checkout
  - Expected: All pass within SLA
  - If any fail: show failure details

Step 5: Summary
  - Overall: ✅ All clear / ⚠️ Issues detected / 🔴 Rollback recommended
  - Metrics comparison: before vs. after
  - If issues: suggested action
' \
  --source "AWS Production,Grafana" \
  --timeout "5m"

Auto-run after deploy

The --auto-run-after "deployment" flag triggers this runbook automatically when a deploy completes. No manual step needed — every deploy gets verified.

Build a library of runbooks

Create runbooks for your team's most common scenarios:

openclaw runbook add \
  --name "High Memory Usage" \
  --trigger "/runbook high-memory" \
  --tags "memory,performance" \
  --steps '
Step 1: Identify top memory consumers
  - Command: ps aux --sort=-%mem | head -20
  - Look for: Processes using > 20% memory

Step 2: Check for memory leaks
  - Compare current RSS to 24h ago
  - If growing steadily: likely leak in [service name]

Step 3: Remediate
  - If leak confirmed: rolling restart of affected service
  - If spike (not leak): check for large payload processing
  - REQUIRES CONFIRMATION for restart

Step 4: Long-term
  - Create ticket: investigate memory leak in [service]
  - Add memory limit to container config
'

Make runbooks discoverable

Your team needs to find the right runbook fast during an incident:

openclaw runbook index \
  --link-to-alerts \
  --auto-suggest

This does two things:

  1. Links runbooks to alerts: When a "high memory" alert fires, OpenClaw suggests /runbook high-memory in the alert message.

  2. Makes runbooks searchable:

/runbook list                  — show all runbooks
/runbook search "database"     — find DB-related runbooks
/runbook recent                — last 5 runbooks executed
/runbook history db-pool       — past executions + outcomes

Runbook suggestions in alerts

After linking, your enriched alerts automatically include the relevant runbook:

🟡 WARNING — Memory at 88% on api-prod-01
📋 Suggested runbook: /runbook high-memory

New team members don't need to know which runbook to use — the alert tells them.

Set up runbook reviews

Runbooks go stale. Schedule regular reviews:

openclaw cron add \
  --name "Runbook Audit" \
  --cron "0 10 1 * *" \
  --tz "America/New_York" \
  --message "Monthly runbook audit. For each runbook:

1. Last executed: when? Did it succeed?
2. Last updated: is it > 90 days old? Flag for review.
3. Coverage: are there recent incidents without a matching runbook?
4. Effectiveness: runbooks that were run but didn't resolve the issue

Generate a report:
- Runbooks needing update
- Gaps: incident types without runbooks
- Suggestion: new runbooks to create based on recent incidents" \
  --announce \
  --channel slack \
  --to "#ops-info"

The Complete DevOps System

Here's everything you've built across this course:

  📡 Monitoring Stack          🤖 OpenClaw Gateway         💬 Chat Ops
  ┌─────────────────┐       ┌─────────────────────┐     ┌─────────────────┐
  │ Grafana/Datadog  │──────>│ AI Enrichment:       │     │ Smart alerts    │
  │ CloudWatch       │       │ - Severity routing   │────>│ with context    │
  │ Custom webhooks  │       │ - Cause analysis     │     │                 │
  └─────────────────┘       │ - Dedup + correlate  │     │ /rollback       │
                            └─────────────────────┘     │ /scale          │
  🚀 CI/CD                          │                   │ /restart        │
  ┌─────────────────┐               │                   │ /logs           │
  │ GitHub Actions   │───────────────┘                   │ /runbook        │
  │ Vercel/Docker    │                                   └─────────────────┘
  └─────────────────┘                                            │
                                                                 ▼
  ☁️ Infrastructure                                     ┌─────────────────┐
  ┌─────────────────┐                                   │ Incident thread │
  │ AWS/GCP/SSH     │<──── remediation commands ────────│ with timeline,  │
  │                 │                                   │ actions, and    │
  └─────────────────┘                                   │ auto-resolution │
                                                        └─────────────────┘

Course Complete

You've built a chat-based DevOps monitoring system that:

  1. Connects monitoring, cloud, and CI/CD tools
  2. Enriches alerts with AI analysis and deployment correlation
  3. Reduces noise with deduplication, correlation, and time-awareness
  4. Responds with one-command remediation from chat
  5. Documents tribal knowledge as executable runbooks

The system gets better over time — every incident adds data for future AI analysis, and runbook audits keep your playbooks current.

Knowledge Check

What's the biggest advantage of runbooks in chat over runbooks in a wiki?