Automated Incident Response

Alerts tell you something is wrong. This lesson gives you the tools to fix it without leaving your chat app. You'll build one-command remediations that roll back deploys, scale infrastructure, restart services, and more — all with safety guardrails.

Safety first

Every remediation action in this lesson requires explicit confirmation before executing. OpenClaw will never auto-execute destructive actions (rollbacks, restarts, scaling) without your approval. The goal is speed, not recklessness.

Set up remediation commands

Create chat commands for common fixes:

openclaw command add \
  --name "rollback" \
  --trigger "/rollback" \
  --source "GitHub,AWS Production" \
  --message "The user wants to roll back a deployment.

1. Identify the deployment to roll back (by number, or default to latest)
2. Show what will change: commit range, affected services
3. Ask for explicit confirmation
4. Execute the rollback
5. Monitor for 5 minutes and report if metrics improve

SAFETY: Always show a diff of what will change before executing.
Never roll back more than one deployment at a time." \
  --announce \
  --channel slack \
  --to "#ops-critical"

openclaw command add \
  --name "scale" \
  --trigger "/scale" \
  --source "AWS Production" \
  --message "The user wants to scale infrastructure.

Parse the request and:
1. Identify the service and desired scale
2. Show current vs. proposed state (instances, cost impact)
3. Ask for confirmation
4. Execute the scaling action
5. Monitor until new instances are healthy

SAFETY: Cap maximum scale at 3x current capacity.
For larger scales, require a reason." \
  --announce \
  --channel slack \
  --to "#ops-alerts"

openclaw command add \
  --name "restart" \
  --trigger "/restart" \
  --source "AWS Production" \
  --message "The user wants to restart a service.

1. Identify the service
2. Show current health status
3. Confirm: rolling restart or full restart?
4. Execute with zero-downtime if possible (rolling)
5. Monitor health checks after restart

SAFETY: Default to rolling restart. Full restart only if explicitly requested.
Refuse to restart database services — suggest failover instead." \
  --announce \
  --channel slack \
  --to "#ops-alerts"

Create diagnostic commands

Before you fix, you need to investigate. Set up quick diagnostics:

openclaw command add \
  --name "logs" \
  --trigger "/logs" \
  --source "AWS Production" \
  --message "Fetch and analyze recent logs.

Parse: /logs [service] [--since time] [--grep pattern]

1. Fetch the last 100 lines (or since the specified time)
2. Highlight errors and warnings
3. Group similar errors and count occurrences
4. Identify the most likely root cause
5. Suggest next steps

Format: Show the 5 most relevant log lines, then your analysis.
Don't dump raw logs — summarize intelligently." \
  --announce \
  --channel slack \
  --to "#ops-alerts"

openclaw command add \
  --name "metrics" \
  --trigger "/metrics" \
  --source "Grafana,AWS Production" \
  --message "Show current metrics for a service.

Parse: /metrics [service] [metric-name]

Display:
- Current value vs. normal baseline
- Trend over last hour (↑ increasing, ↓ decreasing, → stable)
- If abnormal, when it started deviating
- Related metrics that might explain the anomaly" \
  --announce \
  --channel slack \
  --to "#ops-alerts"

Build an incident workflow

When a critical alert fires, OpenClaw should automatically start an incident workflow:

openclaw agent edit "DevOps Monitor" \
  --append-message "

INCIDENT WORKFLOW (for CRITICAL alerts):
1. Create an incident thread in #ops-critical
2. Set the thread topic: '🔴 [service] — [description]'
3. Post initial context (alert details, probable cause, timeline)
4. Suggest immediate actions with command shortcuts
5. Tag the on-call engineer
6. Start a timeline — log every action taken in the thread
7. When resolved, generate a summary with:
   - Duration
   - Root cause
   - Actions taken
   - What to fix long-term"

Here's what an incident thread looks like:

🔴 INCIDENT #43 — api-prod-01 error rate critical

📊 Error rate: 12.3% (threshold: 5%)
🕐 Started: 03:14 UTC
🔄 Last deploy: #492 at 03:01 UTC (@mike — "fix: cache TTL update")

🤔 Probable cause: Cache TTL change causing cache stampede.
   All requests hitting DB directly after cache invalidation.

📋 Immediate actions:
  /rollback #492       — revert the cache change
  /scale api 4         — add capacity while investigating
  /logs api-prod-01    — check error details
  /metrics db-prod-01  — check DB connection pool

👤 On-call: @sarah (paged via PagerDuty)

─── Timeline ───
03:14  Alert fired: error rate > 5%
03:15  Escalated to CRITICAL (rate > 10%)
03:16  @sarah acknowledged
03:18  @sarah: /logs api-prod-01 --since 03:01
03:19  Logs show "connection pool exhausted" errors
03:20  @sarah: /rollback #492
03:20  Rolling back deploy #492...
03:22  Rollback complete. Monitoring...
03:25  ✅ Error rate back to 0.2%. Cache warming.
03:30  ✅ RESOLVED — Duration: 16 minutes

Add auto-remediation for known issues

For well-understood problems, you can pre-approve automatic fixes:

openclaw auto-fix add \
  --name "Disk Cleanup" \
  --condition "disk_usage > 90%" \
  --action "Clean Docker images and log rotation:
    1. docker system prune -f (removes unused images)
    2. journalctl --vacuum-size=500M (trim system logs)
    3. Find and remove files in /tmp older than 7 days" \
  --approval "auto" \
  --notify "#ops-info" \
  --max-runs "3 per day"

openclaw auto-fix add \
  --name "Service Restart on OOM" \
  --condition "service_oom_killed = true" \
  --action "Rolling restart of the affected service" \
  --approval "auto" \
  --notify "#ops-alerts" \
  --cooldown "30m"

Auto-fix guardrails

Auto-remediation is powerful but dangerous. Rules to follow:

Only auto-fix idempotent actions (disk cleanup, service restart)
Never auto-fix: rollbacks, scaling down, database operations
Set cooldowns to prevent remediation loops
Always notify a channel, even for auto-fixes
Max runs per day prevents runaway automation

For anything not pre-approved, OpenClaw asks first:

🟡 Disk at 92% on web-prod-01

I can clean up:
  • Docker images: ~4.2 GB
  • Old logs: ~1.8 GB
  • /tmp files > 7 days: ~600 MB
  Total recovery: ~6.6 GB (→ would bring disk to 71%)

Run cleanup? Reply /approve or /deny

Command Reference

Command	What It Does	Requires Confirmation
`/rollback [#deploy]`	Revert to previous deployment	Yes
`/scale [service] [count]`	Scale instances up or down	Yes
`/restart [service]`	Rolling restart of a service	Yes
`/logs [service]`	Fetch and analyze recent logs	No
`/metrics [service]`	Show current metrics + trends	No
`/status`	Overview of all services	No
`/incident close`	Resolve the current incident	Yes

Add stack-specific commands:

# Clear Redis cache
openclaw command add \
  --name "flush-cache" \
  --trigger "/flush-cache" \
  --source "AWS Production" \
  --message "Flush the Redis cache for a specific service or pattern.
Parse: /flush-cache [service|pattern]
Confirm before executing. Show estimated cache rebuild time."

# Database failover
openclaw command add \
  --name "db-failover" \
  --trigger "/db-failover" \
  --source "AWS Production" \
  --message "Initiate database failover to read replica.
Show: current primary, replica lag, estimated downtime.
Require double confirmation for this action."

Set up automatic escalation if nobody responds:

openclaw escalation add \
  --name "Default" \
  --level-1 "on-call" --timeout "5m" \
  --level-2 "team-lead" --timeout "15m" \
  --level-3 "engineering-manager" --timeout "30m" \
  --message "No one has acknowledged incident #{id} after {timeout}.
Escalating to {next_level}."

What You Have Now

Your incident response toolkit:

Remediation commands: /rollback, /scale, /restart — all with safety confirmations
Diagnostic commands: /logs, /metrics — AI-analyzed, not raw dumps
Incident workflow: Automatic thread creation, timeline tracking, resolution summary
Auto-remediation: Pre-approved fixes for known issues with guardrails

The next lesson turns your team's tribal knowledge into executable runbooks.

Knowledge Check

Why should auto-remediation be limited to idempotent actions?