---
name: monitoring-alerting-designer
version: 1.0.0
---

# Monitoring & Alerting Designer

Design comprehensive observability systems with SLO-based alerting, multi-burn-rate rules, alert fatigue reduction, and incident response integration for distributed systems and microservices.

## Structure

```
monitoring-alerting-designer/
├── SKILL.md     # Main skill prompt with complete instructions
└── INIT.md      # This file - initialization guide
```

## Files to Generate

None required - this is a prompt-only skill. The SKILL.md contains the complete monitoring and alerting design framework.

## Post-Init Steps

### Claude Code

```bash
# Copy skill to Claude Code skills directory
cp -r monitoring-alerting-designer/ ~/.claude/skills/monitoring-alerting-designer/
```

### Other AI Assistants

1. Open `SKILL.md`
2. Copy all content after the frontmatter (after the second `---`)
3. Paste into your AI assistant as a system prompt or initial context

## Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `{{slo_target}}` | `99.95` | Target SLO percentage for availability/reliability |
| `{{evaluation_window}}` | `30d` | Time window for SLO evaluation |
| `{{alert_burn_rate_critical}}` | `14.4` | Burn rate multiplier for critical alerts (14.4x = 2h to exhaust 30d budget) |
| `{{alert_burn_rate_warning}}` | `1.0` | Burn rate multiplier for warning alerts |
| `{{monitoring_platform}}` | `prometheus` | Target platform (prometheus, datadog, dynatrace, grafana) |
| `{{tracing_backend}}` | `jaeger` | Distributed tracing backend (jaeger, zipkin, tempo, datadog) |

## Core Capabilities

This skill helps you:

1. **SLO-Based Alerting Design**: Define SLOs, calculate error budgets, implement multi-burn-rate alerts
2. **Alert Fatigue Reduction**: Audit alerts, classify by actionability, implement noise reduction
3. **Distributed Tracing**: Select backends, configure OpenTelemetry, correlate traces with logs
4. **Incident Response Integration**: Define severity levels, configure routing, create runbook templates
5. **Dashboard Design**: Build persona-specific dashboards (executive, on-call, engineer)

## Example Usage

```
Design an SLO-based alerting strategy for our checkout service with 99.99%
availability and p99 latency < 500ms. We're getting 200+ alerts/day with
high false positive rates on traffic spikes. Show me multi-burn-rate alert
rules, threshold recommendations, and how to integrate with our incident
response workflow.
```

## Key Workflows

### Workflow 1: SLO-Based Alerting
- Define SLOs aligned with business needs
- Calculate error budgets
- Configure multi-burn-rate alerts (14.4x critical, 6x warning, 1x info)
- Generate Prometheus/Datadog alert rules

### Workflow 2: Alert Fatigue Reduction
- Audit existing alerts using 5-point checklist
- Classify as Keep/Automate/Dashboard/Delete
- Apply noise reduction techniques (grouping, deduplication, inhibition)
- Measure improvement metrics

### Workflow 3: Distributed Tracing
- Compare tracing backends (Jaeger, Zipkin, Tempo, Datadog, Dynatrace)
- Configure OpenTelemetry collector with tail-based sampling
- Instrument services with tracing
- Correlate traces with structured logs

### Workflow 4: Incident Response
- Define severity matrix (SEV-1 through SEV-4)
- Configure alert routing (PagerDuty, Slack, email)
- Create runbook templates
- Set up escalation paths

### Workflow 5: Dashboard Design
- Executive dashboard (SLO status, error budget, incidents)
- On-call dashboard (active alerts, quick actions, escalation)
- Engineer dashboard (detailed metrics, traces, logs)

## Research Sources

- Google SRE Workbook: Alerting on SLOs
- IJFMR: From Monitoring to Observability
- Mattermost: Sloth for SLO Monitoring with Prometheus
- New Relic: AI-Powered Intelligent Alerting
- SigNoz: SLO Monitoring Guide

---
Downloaded from [Find Skill.ai](https://findskill.ai)
