Lesson 4 of 5
Automated Incident Response
Estimated time: 6 minutes
Automated Incident Response
Alerts tell you something is wrong. This lesson gives you the tools to fix it without leaving your chat app. You'll build one-command remediations that roll back deploys, scale infrastructure, restart services, and more — all with safety guardrails.
Safety first
Every remediation action in this lesson requires explicit confirmation before executing. OpenClaw will never auto-execute destructive actions (rollbacks, restarts, scaling) without your approval. The goal is speed, not recklessness.
Set up remediation commands
Create chat commands for common fixes:
openclaw command add \
--name "rollback" \
--trigger "/rollback" \
--source "GitHub,AWS Production" \
--message "The user wants to roll back a deployment.
1. Identify the deployment to roll back (by number, or default to latest)
2. Show what will change: commit range, affected services
3. Ask for explicit confirmation
4. Execute the rollback
5. Monitor for 5 minutes and report if metrics improve
SAFETY: Always show a diff of what will change before executing.
Never roll back more than one deployment at a time." \
--announce \
--channel slack \
--to "#ops-critical"
openclaw command add \
--name "scale" \
--trigger "/scale" \
--source "AWS Production" \
--message "The user wants to scale infrastructure.
Parse the request and:
1. Identify the service and desired scale
2. Show current vs. proposed state (instances, cost impact)
3. Ask for confirmation
4. Execute the scaling action
5. Monitor until new instances are healthy
SAFETY: Cap maximum scale at 3x current capacity.
For larger scales, require a reason." \
--announce \
--channel slack \
--to "#ops-alerts"
openclaw command add \
--name "restart" \
--trigger "/restart" \
--source "AWS Production" \
--message "The user wants to restart a service.
1. Identify the service
2. Show current health status
3. Confirm: rolling restart or full restart?
4. Execute with zero-downtime if possible (rolling)
5. Monitor health checks after restart
SAFETY: Default to rolling restart. Full restart only if explicitly requested.
Refuse to restart database services — suggest failover instead." \
--announce \
--channel slack \
--to "#ops-alerts"
Create diagnostic commands
Before you fix, you need to investigate. Set up quick diagnostics:
openclaw command add \
--name "logs" \
--trigger "/logs" \
--source "AWS Production" \
--message "Fetch and analyze recent logs.
Parse: /logs [service] [--since time] [--grep pattern]
1. Fetch the last 100 lines (or since the specified time)
2. Highlight errors and warnings
3. Group similar errors and count occurrences
4. Identify the most likely root cause
5. Suggest next steps
Format: Show the 5 most relevant log lines, then your analysis.
Don't dump raw logs — summarize intelligently." \
--announce \
--channel slack \
--to "#ops-alerts"
openclaw command add \
--name "metrics" \
--trigger "/metrics" \
--source "Grafana,AWS Production" \
--message "Show current metrics for a service.
Parse: /metrics [service] [metric-name]
Display:
- Current value vs. normal baseline
- Trend over last hour (↑ increasing, ↓ decreasing, → stable)
- If abnormal, when it started deviating
- Related metrics that might explain the anomaly" \
--announce \
--channel slack \
--to "#ops-alerts"
Build an incident workflow
When a critical alert fires, OpenClaw should automatically start an incident workflow:
openclaw agent edit "DevOps Monitor" \
--append-message "
INCIDENT WORKFLOW (for CRITICAL alerts):
1. Create an incident thread in #ops-critical
2. Set the thread topic: '🔴 [service] — [description]'
3. Post initial context (alert details, probable cause, timeline)
4. Suggest immediate actions with command shortcuts
5. Tag the on-call engineer
6. Start a timeline — log every action taken in the thread
7. When resolved, generate a summary with:
- Duration
- Root cause
- Actions taken
- What to fix long-term"
Here's what an incident thread looks like:
🔴 INCIDENT #43 — api-prod-01 error rate critical
📊 Error rate: 12.3% (threshold: 5%)
🕐 Started: 03:14 UTC
🔄 Last deploy: #492 at 03:01 UTC (@mike — "fix: cache TTL update")
🤔 Probable cause: Cache TTL change causing cache stampede.
All requests hitting DB directly after cache invalidation.
📋 Immediate actions:
/rollback #492 — revert the cache change
/scale api 4 — add capacity while investigating
/logs api-prod-01 — check error details
/metrics db-prod-01 — check DB connection pool
👤 On-call: @sarah (paged via PagerDuty)
─── Timeline ───
03:14 Alert fired: error rate > 5%
03:15 Escalated to CRITICAL (rate > 10%)
03:16 @sarah acknowledged
03:18 @sarah: /logs api-prod-01 --since 03:01
03:19 Logs show "connection pool exhausted" errors
03:20 @sarah: /rollback #492
03:20 Rolling back deploy #492...
03:22 Rollback complete. Monitoring...
03:25 ✅ Error rate back to 0.2%. Cache warming.
03:30 ✅ RESOLVED — Duration: 16 minutes
Add auto-remediation for known issues
For well-understood problems, you can pre-approve automatic fixes:
openclaw auto-fix add \
--name "Disk Cleanup" \
--condition "disk_usage > 90%" \
--action "Clean Docker images and log rotation:
1. docker system prune -f (removes unused images)
2. journalctl --vacuum-size=500M (trim system logs)
3. Find and remove files in /tmp older than 7 days" \
--approval "auto" \
--notify "#ops-info" \
--max-runs "3 per day"
openclaw auto-fix add \
--name "Service Restart on OOM" \
--condition "service_oom_killed = true" \
--action "Rolling restart of the affected service" \
--approval "auto" \
--notify "#ops-alerts" \
--cooldown "30m"
Auto-fix guardrails
Auto-remediation is powerful but dangerous. Rules to follow:
- Only auto-fix idempotent actions (disk cleanup, service restart)
- Never auto-fix: rollbacks, scaling down, database operations
- Set cooldowns to prevent remediation loops
- Always notify a channel, even for auto-fixes
- Max runs per day prevents runaway automation
For anything not pre-approved, OpenClaw asks first:
🟡 Disk at 92% on web-prod-01
I can clean up:
• Docker images: ~4.2 GB
• Old logs: ~1.8 GB
• /tmp files > 7 days: ~600 MB
Total recovery: ~6.6 GB (→ would bring disk to 71%)
Run cleanup? Reply /approve or /deny
Command Reference
| Command | What It Does | Requires Confirmation |
|---|---|---|
/rollback [#deploy] | Revert to previous deployment | Yes |
/scale [service] [count] | Scale instances up or down | Yes |
/restart [service] | Rolling restart of a service | Yes |
/logs [service] | Fetch and analyze recent logs | No |
/metrics [service] | Show current metrics + trends | No |
/status | Overview of all services | No |
/incident close | Resolve the current incident | Yes |
Add stack-specific commands:
# Clear Redis cache
openclaw command add \
--name "flush-cache" \
--trigger "/flush-cache" \
--source "AWS Production" \
--message "Flush the Redis cache for a specific service or pattern.
Parse: /flush-cache [service|pattern]
Confirm before executing. Show estimated cache rebuild time."
# Database failover
openclaw command add \
--name "db-failover" \
--trigger "/db-failover" \
--source "AWS Production" \
--message "Initiate database failover to read replica.
Show: current primary, replica lag, estimated downtime.
Require double confirmation for this action."
Set up automatic escalation if nobody responds:
openclaw escalation add \
--name "Default" \
--level-1 "on-call" --timeout "5m" \
--level-2 "team-lead" --timeout "15m" \
--level-3 "engineering-manager" --timeout "30m" \
--message "No one has acknowledged incident #{id} after {timeout}.
Escalating to {next_level}."
What You Have Now
Your incident response toolkit:
- Remediation commands:
/rollback,/scale,/restart— all with safety confirmations - Diagnostic commands:
/logs,/metrics— AI-analyzed, not raw dumps - Incident workflow: Automatic thread creation, timeline tracking, resolution summary
- Auto-remediation: Pre-approved fixes for known issues with guardrails
The next lesson turns your team's tribal knowledge into executable runbooks.
Why should auto-remediation be limited to idempotent actions?