Incident Response Runbook Skill
Guides on-call engineers through structured incident triage, investigation, communication, and resolution.
A reusable skill package for Claude Code and Cowork.
When to use this skill
- Responding to production incidents and outages
- Triaging alerts and determining severity
- Coordinating incident communication with stakeholders
- Writing post-incident reviews and action items
What this skill does
Provides a structured incident response framework: assesses severity based on user impact and blast radius, guides systematic investigation of symptoms and root causes, templates stakeholder communication at each stage, and produces a post-incident review with timeline and action items.
How it works
- 1Assess severity: classify by user impact (P0-P3), identify blast radius and affected services
- 2Investigate: gather logs, metrics, and recent changes. Identify symptoms vs. root cause
- 3Mitigate: apply immediate fix or rollback, communicate status to stakeholders
- 4Close out: write post-incident review with timeline, root cause, and follow-up action items
Full Skill Definition
---
name: incident-runbook
description: "Guides on-call engineers through structured incident triage, investigation, communication, and resolution."
---
# Incident Runbook
## Overview
You are an SRE specializing in incident management, response coordination, and blameless post-mortems.
## Purpose
Help teams build structured incident response processes that minimize downtime and capture learnings.
## When to Use
When a team needs runbooks for common failure modes, an incident response plan, or a post-mortem template.
## Incident Response Process
## Step 1: Define Scope & Classify the Incident
Identify the affected system, blast radius, and user impact before acting. Determine severity (SEV1-4) based on user impact, blast radius, and data integrity risk.
## Step 2: Build the Runbook
Create step-by-step diagnosis and remediation instructions: symptoms → triage commands → fix actions → verification steps → escalation paths.
## Step 3: Define Communication Plan
Specify who to notify, status page updates, customer communication templates, and internal escalation procedures.
## Step 4: Post-Mortem & Follow-Through
Generate a blameless post-mortem with: timeline, root cause analysis, contributing factors, what went well, and action items with owners. Schedule a follow-up review to verify that action items are completed and the fix is durable.
## Error Handling
## Unknown Service Architecture
Ask for a service map or architecture diagram before writing runbooks. Generic runbooks are dangerous.
## Blame Language
Always use blameless language. Replace "X caused" with "the system allowed" in post-mortems.
## Runbook Staleness
Recommend a periodic review cadence (quarterly) for all runbooks. Outdated runbooks are worse than no runbook because they create false confidence.
Summary
Guides on-call engineers through structured incident triage, investigation, communication, and resolution. Install this skill by placing the package in ~/.claude/skills/incident-runbook/ for personal use, or .claude/skills/incident-runbook/ for project-specific use.
FAQs
What severity levels does it use?
P0 (total outage), P1 (major feature broken), P2 (degraded experience), P3 (minor issue). Each maps to response time and escalation requirements.
Does it help with communication?
Yes. It templates status updates for internal teams, leadership, and customers at each stage of the incident.
Can I customize the severity definitions?
Yes. Edit the Core Logic section to match your team's SLA tiers and escalation policies.
Download & install
Install paths
Claude Code — personal (all projects)
~/.claude/skills/incident-runbook/SKILL.mdClaude Code — project-specific
.claude/skills/incident-runbook/SKILL.mdCowork — skill plugin
Upload .skill.zip via Cowork plugin managerCompatible with Claude Code, Cowork, and any SKILL.md-compatible agent platform.
Skills in the registry are community starter templates provided as-is. skill.design and Designless do not guarantee accuracy, completeness, or fitness for any purpose. Always review, customize, and validate skills for your specific use case before deploying to production. You are responsible for the behavior of skills you install and use.