claude-vault/skills/skill-creator/examples/monitoring-skill.md

# Example: Prometheus/Grafana Monitoring Skill

This is a complete example of a DevOps skill for monitoring setup.

---

```yaml
---
name: monitoring-setup
description: Creates Prometheus alerting rules, ServiceMonitors, and Grafana dashboards. Use when setting up observability for services.
argument-hint: [service-name] [metric-type: http|celery|custom]
allowed-tools: Read, Write, Edit, Glob, Grep
disable-model-invocation: true
---

# Monitoring Setup Generator

Generate production-ready Prometheus alerting rules, ServiceMonitors, and Grafana dashboards for service observability.

## When to Use

- Setting up monitoring for a new service
- Creating custom alerting rules
- Building Grafana dashboards
- Configuring metric scraping

## Prerequisites

- Prometheus Operator installed (kube-prometheus-stack)
- Grafana deployed with dashboard provisioning
- Service exposes `/metrics` endpoint

## Instructions

### Step 1: Parse Arguments

Extract from `$ARGUMENTS`:
- **service-name**: Name of the service to monitor
- **metric-type**: Type of metrics (http, celery, custom)

### Step 2: Generate ServiceMonitor

Create `servicemonitor.yaml`:

```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: ${SERVICE_NAME}
  namespace: default
  labels:
    app: ${SERVICE_NAME}
    release: prometheus  # CRITICAL: Required for Prometheus to discover
spec:
  selector:
    matchLabels:
      app: ${SERVICE_NAME}
  namespaceSelector:
    matchNames:
      - default
  endpoints:
    - port: http
      path: /metrics
      interval: 30s
      scrapeTimeout: 10s
      honorLabels: true
      metricRelabelings:
        - sourceLabels: [__name__]
          regex: 'go_.*'
          action: drop  # Drop unnecessary Go runtime metrics
```

### Step 3: Generate PrometheusRules

Create `prometheusrules.yaml`:

```yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: ${SERVICE_NAME}-alerts
  namespace: cluster-monitoring
  labels:
    app: kube-prometheus-stack  # CRITICAL: Required label
    release: prometheus         # CRITICAL: Required label
spec:
  groups:
    - name: ${SERVICE_NAME}.availability
      rules:
        # High Error Rate
        - alert: ${SERVICE_NAME}HighErrorRate
          expr: |
            sum(rate(http_requests_total{service="${SERVICE_NAME}",status=~"5.."}[5m]))
            / sum(rate(http_requests_total{service="${SERVICE_NAME}"}[5m])) > 0.05
          for: 5m
          labels:
            severity: critical
            service: ${SERVICE_NAME}
          annotations:
            summary: "High error rate on {{ $labels.service }}"
            description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
            runbook_url: "https://runbooks.example.com/${SERVICE_NAME}/high-error-rate"

        # High Latency
        - alert: ${SERVICE_NAME}HighLatency
          expr: |
            histogram_quantile(0.95,
              sum(rate(http_request_duration_seconds_bucket{service="${SERVICE_NAME}"}[5m])) by (le)
            ) > 1
          for: 5m
          labels:
            severity: warning
            service: ${SERVICE_NAME}
          annotations:
            summary: "High latency on {{ $labels.service }}"
            description: "P95 latency is {{ $value | humanizeDuration }} (threshold: 1s)"

        # Service Down
        - alert: ${SERVICE_NAME}Down
          expr: |
            up{job="${SERVICE_NAME}"} == 0
          for: 2m
          labels:
            severity: critical
            service: ${SERVICE_NAME}
          annotations:
            summary: "{{ $labels.service }} is down"
            description: "{{ $labels.instance }} has been down for more than 2 minutes"

    - name: ${SERVICE_NAME}.resources
      rules:
        # High Memory Usage
        - alert: ${SERVICE_NAME}HighMemoryUsage
          expr: |
            container_memory_working_set_bytes{container="${SERVICE_NAME}"}
            / container_spec_memory_limit_bytes{container="${SERVICE_NAME}"} > 0.85
          for: 5m
          labels:
            severity: warning
            service: ${SERVICE_NAME}
          annotations:
            summary: "High memory usage on {{ $labels.pod }}"
            description: "Memory usage is {{ $value | humanizePercentage }} of limit"

        # High CPU Usage
        - alert: ${SERVICE_NAME}HighCPUUsage
          expr: |
            sum(rate(container_cpu_usage_seconds_total{container="${SERVICE_NAME}"}[5m])) by (pod)
            / sum(container_spec_cpu_quota{container="${SERVICE_NAME}"}/container_spec_cpu_period{container="${SERVICE_NAME}"}) by (pod) > 0.85
          for: 5m
          labels:
            severity: warning
            service: ${SERVICE_NAME}
          annotations:
            summary: "High CPU usage on {{ $labels.pod }}"
            description: "CPU usage is {{ $value | humanizePercentage }} of limit"

        # Pod Restarts
        - alert: ${SERVICE_NAME}PodRestarts
          expr: |
            increase(kube_pod_container_status_restarts_total{container="${SERVICE_NAME}"}[1h]) > 3
          for: 5m
          labels:
            severity: warning
            service: ${SERVICE_NAME}
          annotations:
            summary: "{{ $labels.pod }} is restarting frequently"
            description: "{{ $value }} restarts in the last hour"
```

### Step 4: Generate Recording Rules

Create `recordingrules.yaml`:

```yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: ${SERVICE_NAME}-recording
  namespace: cluster-monitoring
  labels:
    app: kube-prometheus-stack
    release: prometheus
spec:
  groups:
    - name: ${SERVICE_NAME}.recording
      interval: 30s
      rules:
        # Request Rate
        - record: ${SERVICE_NAME}:http_requests:rate5m
          expr: |
            sum(rate(http_requests_total{service="${SERVICE_NAME}"}[5m])) by (status, method, path)

        # Error Rate
        - record: ${SERVICE_NAME}:http_errors:rate5m
          expr: |
            sum(rate(http_requests_total{service="${SERVICE_NAME}",status=~"5.."}[5m]))

        # Latency Percentiles
        - record: ${SERVICE_NAME}:http_latency_p50:5m
          expr: |
            histogram_quantile(0.50,
              sum(rate(http_request_duration_seconds_bucket{service="${SERVICE_NAME}"}[5m])) by (le))

        - record: ${SERVICE_NAME}:http_latency_p95:5m
          expr: |
            histogram_quantile(0.95,
              sum(rate(http_request_duration_seconds_bucket{service="${SERVICE_NAME}"}[5m])) by (le))

        - record: ${SERVICE_NAME}:http_latency_p99:5m
          expr: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{service="${SERVICE_NAME}"}[5m])) by (le))
```

### Step 5: Generate Grafana Dashboard

Create `dashboard-configmap.yaml`:

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: ${SERVICE_NAME}-dashboard
  namespace: cluster-monitoring
  labels:
    grafana_dashboard: "1"  # CRITICAL: Required for Grafana to discover
data:
  ${SERVICE_NAME}-dashboard.json: |
    {
      "annotations": {
        "list": []
      },
      "editable": true,
      "fiscalYearStartMonth": 0,
      "graphTooltip": 0,
      "id": null,
      "links": [],
      "liveNow": false,
      "panels": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "prometheus"
          },
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 10,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "never",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  }
                ]
              },
              "unit": "reqps"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 0,
            "y": 0
          },
          "id": 1,
          "options": {
            "legend": {
              "calcs": ["mean", "max"],
              "displayMode": "table",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "multi",
              "sort": "desc"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prometheus"
              },
              "expr": "sum(rate(http_requests_total{service=\"${SERVICE_NAME}\"}[5m])) by (status)",
              "legendFormat": "{{status}}",
              "refId": "A"
            }
          ],
          "title": "Request Rate by Status",
          "type": "timeseries"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "prometheus"
          },
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 10,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "never",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  }
                ]
              },
              "unit": "s"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 12,
            "y": 0
          },
          "id": 2,
          "options": {
            "legend": {
              "calcs": ["mean", "max"],
              "displayMode": "table",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "multi",
              "sort": "desc"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prometheus"
              },
              "expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=\"${SERVICE_NAME}\"}[5m])) by (le))",
              "legendFormat": "p50",
              "refId": "A"
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prometheus"
              },
              "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=\"${SERVICE_NAME}\"}[5m])) by (le))",
              "legendFormat": "p95",
              "refId": "B"
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prometheus"
              },
              "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=\"${SERVICE_NAME}\"}[5m])) by (le))",
              "legendFormat": "p99",
              "refId": "C"
            }
          ],
          "title": "Request Latency Percentiles",
          "type": "timeseries"
        }
      ],
      "refresh": "30s",
      "schemaVersion": 38,
      "style": "dark",
      "tags": ["${SERVICE_NAME}", "http"],
      "templating": {
        "list": []
      },
      "time": {
        "from": "now-1h",
        "to": "now"
      },
      "timepicker": {},
      "timezone": "browser",
      "title": "${SERVICE_NAME} Dashboard",
      "uid": "${SERVICE_NAME}-dashboard",
      "version": 1,
      "weekStart": ""
    }
```

## Patterns & Best Practices

### Alert Severity Levels

| Severity | Response Time | Example |
|----------|---------------|---------|
| critical | Immediate (page) | Service down, data loss risk |
| warning | Business hours | High latency, resource pressure |
| info | Next review | Approaching thresholds |

### PromQL Best Practices

- Use recording rules for complex queries
- Always include `for` duration to avoid flapping
- Use `rate()` for counters, never `increase()` in alerts
- Include meaningful labels for routing

### Dashboard Design

- Use consistent colors (green=good, red=bad)
- Include time range variables
- Add annotation markers for deployments
- Group related panels

## Validation

```bash
# Check ServiceMonitor is picked up
kubectl get servicemonitor -n default

# Check PrometheusRules
kubectl get prometheusrules -n cluster-monitoring

# Check targets in Prometheus UI
# http://prometheus:9090/targets

# Check alerts
# http://prometheus:9090/alerts

# Verify dashboard in Grafana
# http://grafana:3000/dashboards
```

## Common Pitfalls

- **Missing Labels**: `release: prometheus` and `app: kube-prometheus-stack` required
- **Wrong Namespace**: ServiceMonitor must be in same namespace as service or use `namespaceSelector`
- **Port Mismatch**: ServiceMonitor port must match Service port name
- **Dashboard Not Loading**: Check `grafana_dashboard: "1"` label on ConfigMap
- **High Cardinality**: Avoid labels with unbounded values (user IDs, request IDs)
```