Example: Prometheus/Grafana Monitoring Skill

This is a complete example of a DevOps skill for monitoring setup.

---
name: monitoring-setup
description: Creates Prometheus alerting rules, ServiceMonitors, and Grafana dashboards. Use when setting up observability for services.
argument-hint: [service-name] [metric-type: http|celery|custom]
allowed-tools: Read, Write, Edit, Glob, Grep
disable-model-invocation: true
---

# Monitoring Setup Generator

Generate production-ready Prometheus alerting rules, ServiceMonitors, and Grafana dashboards for service observability.

## When to Use

- Setting up monitoring for a new service
- Creating custom alerting rules
- Building Grafana dashboards
- Configuring metric scraping

## Prerequisites

- Prometheus Operator installed (kube-prometheus-stack)
- Grafana deployed with dashboard provisioning
- Service exposes `/metrics` endpoint

## Instructions

### Step 1: Parse Arguments

Extract from `$ARGUMENTS`:
- **service-name**: Name of the service to monitor
- **metric-type**: Type of metrics (http, celery, custom)

### Step 2: Generate ServiceMonitor

Create `servicemonitor.yaml`:

```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: ${SERVICE_NAME}
  namespace: default
  labels:
    app: ${SERVICE_NAME}
    release: prometheus  # CRITICAL: Required for Prometheus to discover
spec:
  selector:
    matchLabels:
      app: ${SERVICE_NAME}
  namespaceSelector:
    matchNames:
      - default
  endpoints:
    - port: http
      path: /metrics
      interval: 30s
      scrapeTimeout: 10s
      honorLabels: true
      metricRelabelings:
        - sourceLabels: [__name__]
          regex: 'go_.*'
          action: drop  # Drop unnecessary Go runtime metrics

Step 3: Generate PrometheusRules

Create prometheusrules.yaml:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: ${SERVICE_NAME}-alerts
  namespace: cluster-monitoring
  labels:
    app: kube-prometheus-stack  # CRITICAL: Required label
    release: prometheus         # CRITICAL: Required label
spec:
  groups:
    - name: ${SERVICE_NAME}.availability
      rules:
        # High Error Rate
        - alert: ${SERVICE_NAME}HighErrorRate
          expr: |
            sum(rate(http_requests_total{service="${SERVICE_NAME}",status=~"5.."}[5m]))
            / sum(rate(http_requests_total{service="${SERVICE_NAME}"}[5m])) > 0.05
          for: 5m
          labels:
            severity: critical
            service: ${SERVICE_NAME}
          annotations:
            summary: "High error rate on {{ $labels.service }}"
            description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
            runbook_url: "https://runbooks.example.com/${SERVICE_NAME}/high-error-rate"

        # High Latency
        - alert: ${SERVICE_NAME}HighLatency
          expr: |
            histogram_quantile(0.95,
              sum(rate(http_request_duration_seconds_bucket{service="${SERVICE_NAME}"}[5m])) by (le)
            ) > 1
          for: 5m
          labels:
            severity: warning
            service: ${SERVICE_NAME}
          annotations:
            summary: "High latency on {{ $labels.service }}"
            description: "P95 latency is {{ $value | humanizeDuration }} (threshold: 1s)"

        # Service Down
        - alert: ${SERVICE_NAME}Down
          expr: |
            up{job="${SERVICE_NAME}"} == 0
          for: 2m
          labels:
            severity: critical
            service: ${SERVICE_NAME}
          annotations:
            summary: "{{ $labels.service }} is down"
            description: "{{ $labels.instance }} has been down for more than 2 minutes"

    - name: ${SERVICE_NAME}.resources
      rules:
        # High Memory Usage
        - alert: ${SERVICE_NAME}HighMemoryUsage
          expr: |
            container_memory_working_set_bytes{container="${SERVICE_NAME}"}
            / container_spec_memory_limit_bytes{container="${SERVICE_NAME}"} > 0.85
          for: 5m
          labels:
            severity: warning
            service: ${SERVICE_NAME}
          annotations:
            summary: "High memory usage on {{ $labels.pod }}"
            description: "Memory usage is {{ $value | humanizePercentage }} of limit"

        # High CPU Usage
        - alert: ${SERVICE_NAME}HighCPUUsage
          expr: |
            sum(rate(container_cpu_usage_seconds_total{container="${SERVICE_NAME}"}[5m])) by (pod)
            / sum(container_spec_cpu_quota{container="${SERVICE_NAME}"}/container_spec_cpu_period{container="${SERVICE_NAME}"}) by (pod) > 0.85
          for: 5m
          labels:
            severity: warning
            service: ${SERVICE_NAME}
          annotations:
            summary: "High CPU usage on {{ $labels.pod }}"
            description: "CPU usage is {{ $value | humanizePercentage }} of limit"

        # Pod Restarts
        - alert: ${SERVICE_NAME}PodRestarts
          expr: |
            increase(kube_pod_container_status_restarts_total{container="${SERVICE_NAME}"}[1h]) > 3
          for: 5m
          labels:
            severity: warning
            service: ${SERVICE_NAME}
          annotations:
            summary: "{{ $labels.pod }} is restarting frequently"
            description: "{{ $value }} restarts in the last hour"

Step 4: Generate Recording Rules

Create recordingrules.yaml:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: ${SERVICE_NAME}-recording
  namespace: cluster-monitoring
  labels:
    app: kube-prometheus-stack
    release: prometheus
spec:
  groups:
    - name: ${SERVICE_NAME}.recording
      interval: 30s
      rules:
        # Request Rate
        - record: ${SERVICE_NAME}:http_requests:rate5m
          expr: |
            sum(rate(http_requests_total{service="${SERVICE_NAME}"}[5m])) by (status, method, path)

        # Error Rate
        - record: ${SERVICE_NAME}:http_errors:rate5m
          expr: |
            sum(rate(http_requests_total{service="${SERVICE_NAME}",status=~"5.."}[5m]))

        # Latency Percentiles
        - record: ${SERVICE_NAME}:http_latency_p50:5m
          expr: |
            histogram_quantile(0.50,
              sum(rate(http_request_duration_seconds_bucket{service="${SERVICE_NAME}"}[5m])) by (le))

        - record: ${SERVICE_NAME}:http_latency_p95:5m
          expr: |
            histogram_quantile(0.95,
              sum(rate(http_request_duration_seconds_bucket{service="${SERVICE_NAME}"}[5m])) by (le))

        - record: ${SERVICE_NAME}:http_latency_p99:5m
          expr: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{service="${SERVICE_NAME}"}[5m])) by (le))

Step 5: Generate Grafana Dashboard

Create dashboard-configmap.yaml:

apiVersion: v1
kind: ConfigMap
metadata:
  name: ${SERVICE_NAME}-dashboard
  namespace: cluster-monitoring
  labels:
    grafana_dashboard: "1"  # CRITICAL: Required for Grafana to discover
data:
  ${SERVICE_NAME}-dashboard.json: |
    {
      "annotations": {
        "list": []
      },
      "editable": true,
      "fiscalYearStartMonth": 0,
      "graphTooltip": 0,
      "id": null,
      "links": [],
      "liveNow": false,
      "panels": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "prometheus"
          },
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 10,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "never",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  }
                ]
              },
              "unit": "reqps"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 0,
            "y": 0
          },
          "id": 1,
          "options": {
            "legend": {
              "calcs": ["mean", "max"],
              "displayMode": "table",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "multi",
              "sort": "desc"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prometheus"
              },
              "expr": "sum(rate(http_requests_total{service=\"${SERVICE_NAME}\"}[5m])) by (status)",
              "legendFormat": "{{status}}",
              "refId": "A"
            }
          ],
          "title": "Request Rate by Status",
          "type": "timeseries"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "prometheus"
          },
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 10,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "never",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  }
                ]
              },
              "unit": "s"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 12,
            "y": 0
          },
          "id": 2,
          "options": {
            "legend": {
              "calcs": ["mean", "max"],
              "displayMode": "table",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "multi",
              "sort": "desc"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prometheus"
              },
              "expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=\"${SERVICE_NAME}\"}[5m])) by (le))",
              "legendFormat": "p50",
              "refId": "A"
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prometheus"
              },
              "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=\"${SERVICE_NAME}\"}[5m])) by (le))",
              "legendFormat": "p95",
              "refId": "B"
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prometheus"
              },
              "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=\"${SERVICE_NAME}\"}[5m])) by (le))",
              "legendFormat": "p99",
              "refId": "C"
            }
          ],
          "title": "Request Latency Percentiles",
          "type": "timeseries"
        }
      ],
      "refresh": "30s",
      "schemaVersion": 38,
      "style": "dark",
      "tags": ["${SERVICE_NAME}", "http"],
      "templating": {
        "list": []
      },
      "time": {
        "from": "now-1h",
        "to": "now"
      },
      "timepicker": {},
      "timezone": "browser",
      "title": "${SERVICE_NAME} Dashboard",
      "uid": "${SERVICE_NAME}-dashboard",
      "version": 1,
      "weekStart": ""
    }

Patterns & Best Practices

Alert Severity Levels

Severity	Response Time	Example
critical	Immediate (page)	Service down, data loss risk
warning	Business hours	High latency, resource pressure
info	Next review	Approaching thresholds

PromQL Best Practices

Use recording rules for complex queries
Always include for duration to avoid flapping
Use rate() for counters, never increase() in alerts
Include meaningful labels for routing

Dashboard Design

Use consistent colors (green=good, red=bad)
Include time range variables
Add annotation markers for deployments
Group related panels

Validation

# Check ServiceMonitor is picked up
kubectl get servicemonitor -n default

# Check PrometheusRules
kubectl get prometheusrules -n cluster-monitoring

# Check targets in Prometheus UI
# http://prometheus:9090/targets

# Check alerts
# http://prometheus:9090/alerts

# Verify dashboard in Grafana
# http://grafana:3000/dashboards

Common Pitfalls

Missing Labels: release: prometheus and app: kube-prometheus-stack required
Wrong Namespace: ServiceMonitor must be in same namespace as service or use namespaceSelector
Port Mismatch: ServiceMonitor port must match Service port name
Dashboard Not Loading: Check grafana_dashboard: "1" label on ConfigMap
High Cardinality: Avoid labels with unbounded values (user IDs, request IDs)

15 KiB Raw Permalink Blame History

Example: Prometheus/Grafana Monitoring Skill

Step 3: Generate PrometheusRules

Step 4: Generate Recording Rules

Step 5: Generate Grafana Dashboard

Patterns & Best Practices

Alert Severity Levels

PromQL Best Practices

Dashboard Design

Validation

Common Pitfalls

15 KiB

Raw Permalink Blame History