Skip to main content
Skills/DevOps & Infra/incident-runbook

Incident Response Runbook Skill

Guides on-call engineers through structured incident triage, investigation, communication, and resolution.

A reusable skill package for Claude Code and Cowork.

When to use this skill

  • Responding to production incidents and outages
  • Triaging alerts and determining severity
  • Coordinating incident communication with stakeholders
  • Writing post-incident reviews and action items

What this skill does

Provides a structured incident response framework: assesses severity based on user impact and blast radius, guides systematic investigation of symptoms and root causes, templates stakeholder communication at each stage, and produces a post-incident review with timeline and action items.

How it works

  1. 1Assess severity: classify by user impact (P0-P3), identify blast radius and affected services
  2. 2Investigate: gather logs, metrics, and recent changes. Identify symptoms vs. root cause
  3. 3Mitigate: apply immediate fix or rollback, communicate status to stakeholders
  4. 4Close out: write post-incident review with timeline, root cause, and follow-up action items

Full Skill Definition

---
name: incident-runbook
description: "Guides on-call engineers through structured incident triage, investigation, communication, and resolution."
---

# Incident Runbook

## Overview

You are an SRE specializing in incident management, response coordination, and blameless post-mortems.

## Purpose

Help teams build structured incident response processes that minimize downtime and capture learnings.

## When to Use

When a team needs runbooks for common failure modes, an incident response plan, or a post-mortem template.

## Incident Response Process

## Step 1: Define Scope & Classify the Incident

Identify the affected system, blast radius, and user impact before acting. Determine severity (SEV1-4) based on user impact, blast radius, and data integrity risk.

## Step 2: Build the Runbook

Create step-by-step diagnosis and remediation instructions: symptoms → triage commands → fix actions → verification steps → escalation paths.

## Step 3: Define Communication Plan

Specify who to notify, status page updates, customer communication templates, and internal escalation procedures.

## Step 4: Post-Mortem & Follow-Through

Generate a blameless post-mortem with: timeline, root cause analysis, contributing factors, what went well, and action items with owners. Schedule a follow-up review to verify that action items are completed and the fix is durable.

## Error Handling

## Unknown Service Architecture

Ask for a service map or architecture diagram before writing runbooks. Generic runbooks are dangerous.

## Blame Language

Always use blameless language. Replace "X caused" with "the system allowed" in post-mortems.

## Runbook Staleness

Recommend a periodic review cadence (quarterly) for all runbooks. Outdated runbooks are worse than no runbook because they create false confidence.

Summary

Guides on-call engineers through structured incident triage, investigation, communication, and resolution. Install this skill by placing the package in ~/.claude/skills/incident-runbook/ for personal use, or .claude/skills/incident-runbook/ for project-specific use.

FAQs

What severity levels does it use?

P0 (total outage), P1 (major feature broken), P2 (degraded experience), P3 (minor issue). Each maps to response time and escalation requirements.

Does it help with communication?

Yes. It templates status updates for internal teams, leadership, and customers at each stage of the incident.

Can I customize the severity definitions?

Yes. Edit the Core Logic section to match your team's SLA tiers and escalation policies.

Download & install

Install paths

Claude Code — personal (all projects)

~/.claude/skills/incident-runbook/SKILL.md

Claude Code — project-specific

.claude/skills/incident-runbook/SKILL.md

Cowork — skill plugin

Upload .skill.zip via Cowork plugin manager

Compatible with Claude Code, Cowork, and any SKILL.md-compatible agent platform.

Skills in the registry are community starter templates provided as-is. skill.design and Designless do not guarantee accuracy, completeness, or fitness for any purpose. Always review, customize, and validate skills for your specific use case before deploying to production. You are responsible for the behavior of skills you install and use.