Skip to main content
All playbooks
Delivery Manager 12 min

Incident Management for Delivery Leaders

When production breaks, the Delivery Manager's job isn't to fix the code — it's to ensure the response is coordinated, communication is clear, and the organisation learns from every incident to prevent recurrence.

The Delivery Manager's Role in Incidents

You're not the on-call engineer. You're not debugging the root cause. Your role during an incident is to ensure:

1. The right people are engaged and coordinated 2. Stakeholders are informed at the right cadence 3. The response follows a structured process (not a panic) 4. Recovery is prioritised over root cause analysis (fix first, learn later) 5. After recovery, the organisation learns and improves

The 2026 SRE best practices emphasise a five-stage model: Prepare, Detect, Respond, Recover, Learn. The Delivery Manager owns the "Respond" coordination and the "Learn" follow-through.

The Incident Response Framework

Severity Classification

Define severity levels before incidents happen. When production is down, you don't want to debate whether it's a P1 or P2.

P1 — Critical: Service is completely unavailable or data integrity is compromised. All customers affected. Revenue impact immediate.

  • Response: All-hands. War room activated. Stakeholder communication every 30 minutes.
  • Target resolution: < 1 hour

P2 — Major: Significant degradation affecting many customers. Core functionality impaired but workarounds exist.

  • Response: On-call team + relevant engineers. DM coordinates communication.
  • Target resolution: < 4 hours

P3 — Minor: Limited impact. Small subset of customers affected. Non-critical functionality impaired.

  • Response: On-call team handles. DM informed but not actively involved.
  • Target resolution: < 24 hours

P4 — Low: Cosmetic issues, minor bugs, no customer impact.

  • Response: Normal sprint work. No incident process needed.

The Incident Commander Role

For P1 and P2 incidents, designate an Incident Commander (IC). This can be the Delivery Manager or a senior engineer — the key is that one person owns coordination:

  • Declares the incident and severity
  • Assembles the response team
  • Coordinates workstreams (diagnosis, fix, communication, customer support)
  • Makes decisions when the team disagrees on approach
  • Declares resolution and initiates the post-mortem

Communication During Incidents

Internal communication:

  • Dedicated Slack channel per incident (not the general channel)
  • Status updates every 30 minutes for P1, every hour for P2
  • Clear format: "Current status → What we're trying → Next update at [time]"
  • Tag stakeholders who need to know — don't make them ask

External communication (customers):

  • Status page updated within 15 minutes of detection
  • Honest about impact: "Some users are experiencing..." not "We're investigating an issue"
  • Estimated resolution time (even if uncertain): "We expect to resolve within 2 hours"
  • Resolution confirmation with brief explanation

Stakeholder communication:

  • Executive summary within 30 minutes of P1 declaration
  • Format: What happened → Customer impact → Current status → Expected resolution → What we need
  • Don't wait for full understanding before communicating — share what you know

Recovery Over Root Cause

During an active incident, the priority is restoring service — not understanding why it broke. Common recovery actions:

  • Rollback the last deployment
  • Scale up infrastructure
  • Failover to backup systems
  • Disable the problematic feature (feature flag)
  • Redirect traffic away from the affected component

Root cause analysis happens after recovery, in the post-mortem. Never delay recovery to investigate cause.

Post-Incident Learning

The Blameless Post-Mortem

Within 48 hours of resolution, run a blameless post-mortem. The goal is learning, not blame.

Structure: 1. Timeline: What happened, when, in what order (facts only) 2. Impact: Who was affected, for how long, what was the business cost 3. Root cause: Why did it happen? (Use "5 Whys" to dig deeper) 4. Contributing factors: What made detection slow? What made recovery hard? 5. Action items: What will we change to prevent recurrence?

Blameless principles:

  • Focus on the system, not the person ("The deployment pipeline didn't catch this" not "John deployed broken code")
  • Assume everyone acted with the best information available at the time
  • Ask "what" and "how" questions, not "who" and "why didn't you"
  • Publish the post-mortem widely — transparency builds trust

Action Item Follow-Through

Post-mortem actions are worthless if they're not completed. The Delivery Manager owns follow-through:

  • Every action has an owner and a deadline
  • Actions are tracked in the team's backlog (not a separate document that gets forgotten)
  • Review outstanding post-mortem actions in the weekly delivery review
  • Escalate overdue actions — if the same root cause causes a second incident, the follow-through process failed

Incident Metrics

Track over time:

  • MTTR by severity: Are we getting faster at recovery?
  • Incident frequency: Are incidents becoming less common?
  • Repeat incidents: Are the same root causes recurring? (Indicates failed follow-through)
  • Detection time: How long between incident start and detection? (Indicates observability gaps)
  • Post-mortem completion rate: Are post-mortems happening within 48 hours?
  • Action completion rate: Are post-mortem actions being completed on time?

Building Incident Readiness

Don't wait for incidents to build your response capability:

Runbooks: Document recovery procedures for known failure modes. When production is down at 2am, engineers shouldn't be figuring out the rollback process from scratch.

On-call rotation: Ensure clear ownership of who responds first. Rotate fairly. Compensate appropriately.

Game Days: Periodically simulate incidents to practice the response process. Inject failures in staging and run through the full incident lifecycle.

Observability investment: You can't recover from what you can't detect. Invest in monitoring, alerting, and dashboards that surface problems before customers report them.

Communication templates: Pre-written templates for status page updates, stakeholder emails, and internal announcements. Fill in the specifics during the incident rather than composing from scratch under pressure.

---

Download the [Escalation Framework template](/templates) to define your incident severity levels and response procedures.