# Example: Prometheus/Grafana Monitoring Skill This is a complete example of a DevOps skill for monitoring setup. --- ```yaml --- name: monitoring-setup description: Creates Prometheus alerting rules, ServiceMonitors, and Grafana dashboards. Use when setting up observability for services. argument-hint: [service-name] [metric-type: http|celery|custom] allowed-tools: Read, Write, Edit, Glob, Grep disable-model-invocation: true --- # Monitoring Setup Generator Generate production-ready Prometheus alerting rules, ServiceMonitors, and Grafana dashboards for service observability. ## When to Use - Setting up monitoring for a new service - Creating custom alerting rules - Building Grafana dashboards - Configuring metric scraping ## Prerequisites - Prometheus Operator installed (kube-prometheus-stack) - Grafana deployed with dashboard provisioning - Service exposes `/metrics` endpoint ## Instructions ### Step 1: Parse Arguments Extract from `$ARGUMENTS`: - **service-name**: Name of the service to monitor - **metric-type**: Type of metrics (http, celery, custom) ### Step 2: Generate ServiceMonitor Create `servicemonitor.yaml`: ```yaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: ${SERVICE_NAME} namespace: default labels: app: ${SERVICE_NAME} release: prometheus # CRITICAL: Required for Prometheus to discover spec: selector: matchLabels: app: ${SERVICE_NAME} namespaceSelector: matchNames: - default endpoints: - port: http path: /metrics interval: 30s scrapeTimeout: 10s honorLabels: true metricRelabelings: - sourceLabels: [__name__] regex: 'go_.*' action: drop # Drop unnecessary Go runtime metrics ``` ### Step 3: Generate PrometheusRules Create `prometheusrules.yaml`: ```yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: ${SERVICE_NAME}-alerts namespace: cluster-monitoring labels: app: kube-prometheus-stack # CRITICAL: Required label release: prometheus # CRITICAL: Required label spec: groups: - name: ${SERVICE_NAME}.availability rules: # High Error Rate - alert: ${SERVICE_NAME}HighErrorRate expr: | sum(rate(http_requests_total{service="${SERVICE_NAME}",status=~"5.."}[5m])) / sum(rate(http_requests_total{service="${SERVICE_NAME}"}[5m])) > 0.05 for: 5m labels: severity: critical service: ${SERVICE_NAME} annotations: summary: "High error rate on {{ $labels.service }}" description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)" runbook_url: "https://runbooks.example.com/${SERVICE_NAME}/high-error-rate" # High Latency - alert: ${SERVICE_NAME}HighLatency expr: | histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="${SERVICE_NAME}"}[5m])) by (le) ) > 1 for: 5m labels: severity: warning service: ${SERVICE_NAME} annotations: summary: "High latency on {{ $labels.service }}" description: "P95 latency is {{ $value | humanizeDuration }} (threshold: 1s)" # Service Down - alert: ${SERVICE_NAME}Down expr: | up{job="${SERVICE_NAME}"} == 0 for: 2m labels: severity: critical service: ${SERVICE_NAME} annotations: summary: "{{ $labels.service }} is down" description: "{{ $labels.instance }} has been down for more than 2 minutes" - name: ${SERVICE_NAME}.resources rules: # High Memory Usage - alert: ${SERVICE_NAME}HighMemoryUsage expr: | container_memory_working_set_bytes{container="${SERVICE_NAME}"} / container_spec_memory_limit_bytes{container="${SERVICE_NAME}"} > 0.85 for: 5m labels: severity: warning service: ${SERVICE_NAME} annotations: summary: "High memory usage on {{ $labels.pod }}" description: "Memory usage is {{ $value | humanizePercentage }} of limit" # High CPU Usage - alert: ${SERVICE_NAME}HighCPUUsage expr: | sum(rate(container_cpu_usage_seconds_total{container="${SERVICE_NAME}"}[5m])) by (pod) / sum(container_spec_cpu_quota{container="${SERVICE_NAME}"}/container_spec_cpu_period{container="${SERVICE_NAME}"}) by (pod) > 0.85 for: 5m labels: severity: warning service: ${SERVICE_NAME} annotations: summary: "High CPU usage on {{ $labels.pod }}" description: "CPU usage is {{ $value | humanizePercentage }} of limit" # Pod Restarts - alert: ${SERVICE_NAME}PodRestarts expr: | increase(kube_pod_container_status_restarts_total{container="${SERVICE_NAME}"}[1h]) > 3 for: 5m labels: severity: warning service: ${SERVICE_NAME} annotations: summary: "{{ $labels.pod }} is restarting frequently" description: "{{ $value }} restarts in the last hour" ``` ### Step 4: Generate Recording Rules Create `recordingrules.yaml`: ```yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: ${SERVICE_NAME}-recording namespace: cluster-monitoring labels: app: kube-prometheus-stack release: prometheus spec: groups: - name: ${SERVICE_NAME}.recording interval: 30s rules: # Request Rate - record: ${SERVICE_NAME}:http_requests:rate5m expr: | sum(rate(http_requests_total{service="${SERVICE_NAME}"}[5m])) by (status, method, path) # Error Rate - record: ${SERVICE_NAME}:http_errors:rate5m expr: | sum(rate(http_requests_total{service="${SERVICE_NAME}",status=~"5.."}[5m])) # Latency Percentiles - record: ${SERVICE_NAME}:http_latency_p50:5m expr: | histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service="${SERVICE_NAME}"}[5m])) by (le)) - record: ${SERVICE_NAME}:http_latency_p95:5m expr: | histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="${SERVICE_NAME}"}[5m])) by (le)) - record: ${SERVICE_NAME}:http_latency_p99:5m expr: | histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="${SERVICE_NAME}"}[5m])) by (le)) ``` ### Step 5: Generate Grafana Dashboard Create `dashboard-configmap.yaml`: ```yaml apiVersion: v1 kind: ConfigMap metadata: name: ${SERVICE_NAME}-dashboard namespace: cluster-monitoring labels: grafana_dashboard: "1" # CRITICAL: Required for Grafana to discover data: ${SERVICE_NAME}-dashboard.json: | { "annotations": { "list": [] }, "editable": true, "fiscalYearStartMonth": 0, "graphTooltip": 0, "id": null, "links": [], "liveNow": false, "panels": [ { "datasource": { "type": "prometheus", "uid": "prometheus" }, "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisCenteredZero": false, "axisColorMode": "text", "axisLabel": "", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 10, "gradientMode": "none", "hideFrom": { "legend": false, "tooltip": false, "viz": false }, "lineInterpolation": "linear", "lineWidth": 1, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "never", "spanNulls": false, "stacking": { "group": "A", "mode": "none" }, "thresholdsStyle": { "mode": "off" } }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null } ] }, "unit": "reqps" }, "overrides": [] }, "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 }, "id": 1, "options": { "legend": { "calcs": ["mean", "max"], "displayMode": "table", "placement": "bottom", "showLegend": true }, "tooltip": { "mode": "multi", "sort": "desc" } }, "targets": [ { "datasource": { "type": "prometheus", "uid": "prometheus" }, "expr": "sum(rate(http_requests_total{service=\"${SERVICE_NAME}\"}[5m])) by (status)", "legendFormat": "{{status}}", "refId": "A" } ], "title": "Request Rate by Status", "type": "timeseries" }, { "datasource": { "type": "prometheus", "uid": "prometheus" }, "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisCenteredZero": false, "axisColorMode": "text", "axisLabel": "", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 10, "gradientMode": "none", "hideFrom": { "legend": false, "tooltip": false, "viz": false }, "lineInterpolation": "linear", "lineWidth": 1, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "never", "spanNulls": false, "stacking": { "group": "A", "mode": "none" }, "thresholdsStyle": { "mode": "off" } }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null } ] }, "unit": "s" }, "overrides": [] }, "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 }, "id": 2, "options": { "legend": { "calcs": ["mean", "max"], "displayMode": "table", "placement": "bottom", "showLegend": true }, "tooltip": { "mode": "multi", "sort": "desc" } }, "targets": [ { "datasource": { "type": "prometheus", "uid": "prometheus" }, "expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=\"${SERVICE_NAME}\"}[5m])) by (le))", "legendFormat": "p50", "refId": "A" }, { "datasource": { "type": "prometheus", "uid": "prometheus" }, "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=\"${SERVICE_NAME}\"}[5m])) by (le))", "legendFormat": "p95", "refId": "B" }, { "datasource": { "type": "prometheus", "uid": "prometheus" }, "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=\"${SERVICE_NAME}\"}[5m])) by (le))", "legendFormat": "p99", "refId": "C" } ], "title": "Request Latency Percentiles", "type": "timeseries" } ], "refresh": "30s", "schemaVersion": 38, "style": "dark", "tags": ["${SERVICE_NAME}", "http"], "templating": { "list": [] }, "time": { "from": "now-1h", "to": "now" }, "timepicker": {}, "timezone": "browser", "title": "${SERVICE_NAME} Dashboard", "uid": "${SERVICE_NAME}-dashboard", "version": 1, "weekStart": "" } ``` ## Patterns & Best Practices ### Alert Severity Levels | Severity | Response Time | Example | |----------|---------------|---------| | critical | Immediate (page) | Service down, data loss risk | | warning | Business hours | High latency, resource pressure | | info | Next review | Approaching thresholds | ### PromQL Best Practices - Use recording rules for complex queries - Always include `for` duration to avoid flapping - Use `rate()` for counters, never `increase()` in alerts - Include meaningful labels for routing ### Dashboard Design - Use consistent colors (green=good, red=bad) - Include time range variables - Add annotation markers for deployments - Group related panels ## Validation ```bash # Check ServiceMonitor is picked up kubectl get servicemonitor -n default # Check PrometheusRules kubectl get prometheusrules -n cluster-monitoring # Check targets in Prometheus UI # http://prometheus:9090/targets # Check alerts # http://prometheus:9090/alerts # Verify dashboard in Grafana # http://grafana:3000/dashboards ``` ## Common Pitfalls - **Missing Labels**: `release: prometheus` and `app: kube-prometheus-stack` required - **Wrong Namespace**: ServiceMonitor must be in same namespace as service or use `namespaceSelector` - **Port Mismatch**: ServiceMonitor port must match Service port name - **Dashboard Not Loading**: Check `grafana_dashboard: "1"` label on ConfigMap - **High Cardinality**: Avoid labels with unbounded values (user IDs, request IDs) ```