2026-02-04 15:29:11 +01:00

15 KiB

Example: Prometheus/Grafana Monitoring Skill

This is a complete example of a DevOps skill for monitoring setup.


---
name: monitoring-setup
description: Creates Prometheus alerting rules, ServiceMonitors, and Grafana dashboards. Use when setting up observability for services.
argument-hint: [service-name] [metric-type: http|celery|custom]
allowed-tools: Read, Write, Edit, Glob, Grep
disable-model-invocation: true
---

# Monitoring Setup Generator

Generate production-ready Prometheus alerting rules, ServiceMonitors, and Grafana dashboards for service observability.

## When to Use

- Setting up monitoring for a new service
- Creating custom alerting rules
- Building Grafana dashboards
- Configuring metric scraping

## Prerequisites

- Prometheus Operator installed (kube-prometheus-stack)
- Grafana deployed with dashboard provisioning
- Service exposes `/metrics` endpoint

## Instructions

### Step 1: Parse Arguments

Extract from `$ARGUMENTS`:
- **service-name**: Name of the service to monitor
- **metric-type**: Type of metrics (http, celery, custom)

### Step 2: Generate ServiceMonitor

Create `servicemonitor.yaml`:

```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: ${SERVICE_NAME}
  namespace: default
  labels:
    app: ${SERVICE_NAME}
    release: prometheus  # CRITICAL: Required for Prometheus to discover
spec:
  selector:
    matchLabels:
      app: ${SERVICE_NAME}
  namespaceSelector:
    matchNames:
      - default
  endpoints:
    - port: http
      path: /metrics
      interval: 30s
      scrapeTimeout: 10s
      honorLabels: true
      metricRelabelings:
        - sourceLabels: [__name__]
          regex: 'go_.*'
          action: drop  # Drop unnecessary Go runtime metrics

Step 3: Generate PrometheusRules

Create prometheusrules.yaml:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: ${SERVICE_NAME}-alerts
  namespace: cluster-monitoring
  labels:
    app: kube-prometheus-stack  # CRITICAL: Required label
    release: prometheus         # CRITICAL: Required label
spec:
  groups:
    - name: ${SERVICE_NAME}.availability
      rules:
        # High Error Rate
        - alert: ${SERVICE_NAME}HighErrorRate
          expr: |
            sum(rate(http_requests_total{service="${SERVICE_NAME}",status=~"5.."}[5m]))
            / sum(rate(http_requests_total{service="${SERVICE_NAME}"}[5m])) > 0.05
          for: 5m
          labels:
            severity: critical
            service: ${SERVICE_NAME}
          annotations:
            summary: "High error rate on {{ $labels.service }}"
            description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
            runbook_url: "https://runbooks.example.com/${SERVICE_NAME}/high-error-rate"

        # High Latency
        - alert: ${SERVICE_NAME}HighLatency
          expr: |
            histogram_quantile(0.95,
              sum(rate(http_request_duration_seconds_bucket{service="${SERVICE_NAME}"}[5m])) by (le)
            ) > 1
          for: 5m
          labels:
            severity: warning
            service: ${SERVICE_NAME}
          annotations:
            summary: "High latency on {{ $labels.service }}"
            description: "P95 latency is {{ $value | humanizeDuration }} (threshold: 1s)"

        # Service Down
        - alert: ${SERVICE_NAME}Down
          expr: |
            up{job="${SERVICE_NAME}"} == 0
          for: 2m
          labels:
            severity: critical
            service: ${SERVICE_NAME}
          annotations:
            summary: "{{ $labels.service }} is down"
            description: "{{ $labels.instance }} has been down for more than 2 minutes"

    - name: ${SERVICE_NAME}.resources
      rules:
        # High Memory Usage
        - alert: ${SERVICE_NAME}HighMemoryUsage
          expr: |
            container_memory_working_set_bytes{container="${SERVICE_NAME}"}
            / container_spec_memory_limit_bytes{container="${SERVICE_NAME}"} > 0.85
          for: 5m
          labels:
            severity: warning
            service: ${SERVICE_NAME}
          annotations:
            summary: "High memory usage on {{ $labels.pod }}"
            description: "Memory usage is {{ $value | humanizePercentage }} of limit"

        # High CPU Usage
        - alert: ${SERVICE_NAME}HighCPUUsage
          expr: |
            sum(rate(container_cpu_usage_seconds_total{container="${SERVICE_NAME}"}[5m])) by (pod)
            / sum(container_spec_cpu_quota{container="${SERVICE_NAME}"}/container_spec_cpu_period{container="${SERVICE_NAME}"}) by (pod) > 0.85
          for: 5m
          labels:
            severity: warning
            service: ${SERVICE_NAME}
          annotations:
            summary: "High CPU usage on {{ $labels.pod }}"
            description: "CPU usage is {{ $value | humanizePercentage }} of limit"

        # Pod Restarts
        - alert: ${SERVICE_NAME}PodRestarts
          expr: |
            increase(kube_pod_container_status_restarts_total{container="${SERVICE_NAME}"}[1h]) > 3
          for: 5m
          labels:
            severity: warning
            service: ${SERVICE_NAME}
          annotations:
            summary: "{{ $labels.pod }} is restarting frequently"
            description: "{{ $value }} restarts in the last hour"

Step 4: Generate Recording Rules

Create recordingrules.yaml:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: ${SERVICE_NAME}-recording
  namespace: cluster-monitoring
  labels:
    app: kube-prometheus-stack
    release: prometheus
spec:
  groups:
    - name: ${SERVICE_NAME}.recording
      interval: 30s
      rules:
        # Request Rate
        - record: ${SERVICE_NAME}:http_requests:rate5m
          expr: |
            sum(rate(http_requests_total{service="${SERVICE_NAME}"}[5m])) by (status, method, path)

        # Error Rate
        - record: ${SERVICE_NAME}:http_errors:rate5m
          expr: |
            sum(rate(http_requests_total{service="${SERVICE_NAME}",status=~"5.."}[5m]))

        # Latency Percentiles
        - record: ${SERVICE_NAME}:http_latency_p50:5m
          expr: |
            histogram_quantile(0.50,
              sum(rate(http_request_duration_seconds_bucket{service="${SERVICE_NAME}"}[5m])) by (le))

        - record: ${SERVICE_NAME}:http_latency_p95:5m
          expr: |
            histogram_quantile(0.95,
              sum(rate(http_request_duration_seconds_bucket{service="${SERVICE_NAME}"}[5m])) by (le))

        - record: ${SERVICE_NAME}:http_latency_p99:5m
          expr: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{service="${SERVICE_NAME}"}[5m])) by (le))

Step 5: Generate Grafana Dashboard

Create dashboard-configmap.yaml:

apiVersion: v1
kind: ConfigMap
metadata:
  name: ${SERVICE_NAME}-dashboard
  namespace: cluster-monitoring
  labels:
    grafana_dashboard: "1"  # CRITICAL: Required for Grafana to discover
data:
  ${SERVICE_NAME}-dashboard.json: |
    {
      "annotations": {
        "list": []
      },
      "editable": true,
      "fiscalYearStartMonth": 0,
      "graphTooltip": 0,
      "id": null,
      "links": [],
      "liveNow": false,
      "panels": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "prometheus"
          },
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 10,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "never",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  }
                ]
              },
              "unit": "reqps"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 0,
            "y": 0
          },
          "id": 1,
          "options": {
            "legend": {
              "calcs": ["mean", "max"],
              "displayMode": "table",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "multi",
              "sort": "desc"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prometheus"
              },
              "expr": "sum(rate(http_requests_total{service=\"${SERVICE_NAME}\"}[5m])) by (status)",
              "legendFormat": "{{status}}",
              "refId": "A"
            }
          ],
          "title": "Request Rate by Status",
          "type": "timeseries"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "prometheus"
          },
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 10,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "never",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  }
                ]
              },
              "unit": "s"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 12,
            "y": 0
          },
          "id": 2,
          "options": {
            "legend": {
              "calcs": ["mean", "max"],
              "displayMode": "table",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "multi",
              "sort": "desc"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prometheus"
              },
              "expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=\"${SERVICE_NAME}\"}[5m])) by (le))",
              "legendFormat": "p50",
              "refId": "A"
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prometheus"
              },
              "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=\"${SERVICE_NAME}\"}[5m])) by (le))",
              "legendFormat": "p95",
              "refId": "B"
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prometheus"
              },
              "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=\"${SERVICE_NAME}\"}[5m])) by (le))",
              "legendFormat": "p99",
              "refId": "C"
            }
          ],
          "title": "Request Latency Percentiles",
          "type": "timeseries"
        }
      ],
      "refresh": "30s",
      "schemaVersion": 38,
      "style": "dark",
      "tags": ["${SERVICE_NAME}", "http"],
      "templating": {
        "list": []
      },
      "time": {
        "from": "now-1h",
        "to": "now"
      },
      "timepicker": {},
      "timezone": "browser",
      "title": "${SERVICE_NAME} Dashboard",
      "uid": "${SERVICE_NAME}-dashboard",
      "version": 1,
      "weekStart": ""
    }

Patterns & Best Practices

Alert Severity Levels

Severity Response Time Example
critical Immediate (page) Service down, data loss risk
warning Business hours High latency, resource pressure
info Next review Approaching thresholds

PromQL Best Practices

  • Use recording rules for complex queries
  • Always include for duration to avoid flapping
  • Use rate() for counters, never increase() in alerts
  • Include meaningful labels for routing

Dashboard Design

  • Use consistent colors (green=good, red=bad)
  • Include time range variables
  • Add annotation markers for deployments
  • Group related panels

Validation

# Check ServiceMonitor is picked up
kubectl get servicemonitor -n default

# Check PrometheusRules
kubectl get prometheusrules -n cluster-monitoring

# Check targets in Prometheus UI
# http://prometheus:9090/targets

# Check alerts
# http://prometheus:9090/alerts

# Verify dashboard in Grafana
# http://grafana:3000/dashboards

Common Pitfalls

  • Missing Labels: release: prometheus and app: kube-prometheus-stack required
  • Wrong Namespace: ServiceMonitor must be in same namespace as service or use namespaceSelector
  • Port Mismatch: ServiceMonitor port must match Service port name
  • Dashboard Not Loading: Check grafana_dashboard: "1" label on ConfigMap
  • High Cardinality: Avoid labels with unbounded values (user IDs, request IDs)