15 KiB
15 KiB
Example: Prometheus/Grafana Monitoring Skill
This is a complete example of a DevOps skill for monitoring setup.
---
name: monitoring-setup
description: Creates Prometheus alerting rules, ServiceMonitors, and Grafana dashboards. Use when setting up observability for services.
argument-hint: [service-name] [metric-type: http|celery|custom]
allowed-tools: Read, Write, Edit, Glob, Grep
disable-model-invocation: true
---
# Monitoring Setup Generator
Generate production-ready Prometheus alerting rules, ServiceMonitors, and Grafana dashboards for service observability.
## When to Use
- Setting up monitoring for a new service
- Creating custom alerting rules
- Building Grafana dashboards
- Configuring metric scraping
## Prerequisites
- Prometheus Operator installed (kube-prometheus-stack)
- Grafana deployed with dashboard provisioning
- Service exposes `/metrics` endpoint
## Instructions
### Step 1: Parse Arguments
Extract from `$ARGUMENTS`:
- **service-name**: Name of the service to monitor
- **metric-type**: Type of metrics (http, celery, custom)
### Step 2: Generate ServiceMonitor
Create `servicemonitor.yaml`:
```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: ${SERVICE_NAME}
namespace: default
labels:
app: ${SERVICE_NAME}
release: prometheus # CRITICAL: Required for Prometheus to discover
spec:
selector:
matchLabels:
app: ${SERVICE_NAME}
namespaceSelector:
matchNames:
- default
endpoints:
- port: http
path: /metrics
interval: 30s
scrapeTimeout: 10s
honorLabels: true
metricRelabelings:
- sourceLabels: [__name__]
regex: 'go_.*'
action: drop # Drop unnecessary Go runtime metrics
Step 3: Generate PrometheusRules
Create prometheusrules.yaml:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: ${SERVICE_NAME}-alerts
namespace: cluster-monitoring
labels:
app: kube-prometheus-stack # CRITICAL: Required label
release: prometheus # CRITICAL: Required label
spec:
groups:
- name: ${SERVICE_NAME}.availability
rules:
# High Error Rate
- alert: ${SERVICE_NAME}HighErrorRate
expr: |
sum(rate(http_requests_total{service="${SERVICE_NAME}",status=~"5.."}[5m]))
/ sum(rate(http_requests_total{service="${SERVICE_NAME}"}[5m])) > 0.05
for: 5m
labels:
severity: critical
service: ${SERVICE_NAME}
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
runbook_url: "https://runbooks.example.com/${SERVICE_NAME}/high-error-rate"
# High Latency
- alert: ${SERVICE_NAME}HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{service="${SERVICE_NAME}"}[5m])) by (le)
) > 1
for: 5m
labels:
severity: warning
service: ${SERVICE_NAME}
annotations:
summary: "High latency on {{ $labels.service }}"
description: "P95 latency is {{ $value | humanizeDuration }} (threshold: 1s)"
# Service Down
- alert: ${SERVICE_NAME}Down
expr: |
up{job="${SERVICE_NAME}"} == 0
for: 2m
labels:
severity: critical
service: ${SERVICE_NAME}
annotations:
summary: "{{ $labels.service }} is down"
description: "{{ $labels.instance }} has been down for more than 2 minutes"
- name: ${SERVICE_NAME}.resources
rules:
# High Memory Usage
- alert: ${SERVICE_NAME}HighMemoryUsage
expr: |
container_memory_working_set_bytes{container="${SERVICE_NAME}"}
/ container_spec_memory_limit_bytes{container="${SERVICE_NAME}"} > 0.85
for: 5m
labels:
severity: warning
service: ${SERVICE_NAME}
annotations:
summary: "High memory usage on {{ $labels.pod }}"
description: "Memory usage is {{ $value | humanizePercentage }} of limit"
# High CPU Usage
- alert: ${SERVICE_NAME}HighCPUUsage
expr: |
sum(rate(container_cpu_usage_seconds_total{container="${SERVICE_NAME}"}[5m])) by (pod)
/ sum(container_spec_cpu_quota{container="${SERVICE_NAME}"}/container_spec_cpu_period{container="${SERVICE_NAME}"}) by (pod) > 0.85
for: 5m
labels:
severity: warning
service: ${SERVICE_NAME}
annotations:
summary: "High CPU usage on {{ $labels.pod }}"
description: "CPU usage is {{ $value | humanizePercentage }} of limit"
# Pod Restarts
- alert: ${SERVICE_NAME}PodRestarts
expr: |
increase(kube_pod_container_status_restarts_total{container="${SERVICE_NAME}"}[1h]) > 3
for: 5m
labels:
severity: warning
service: ${SERVICE_NAME}
annotations:
summary: "{{ $labels.pod }} is restarting frequently"
description: "{{ $value }} restarts in the last hour"
Step 4: Generate Recording Rules
Create recordingrules.yaml:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: ${SERVICE_NAME}-recording
namespace: cluster-monitoring
labels:
app: kube-prometheus-stack
release: prometheus
spec:
groups:
- name: ${SERVICE_NAME}.recording
interval: 30s
rules:
# Request Rate
- record: ${SERVICE_NAME}:http_requests:rate5m
expr: |
sum(rate(http_requests_total{service="${SERVICE_NAME}"}[5m])) by (status, method, path)
# Error Rate
- record: ${SERVICE_NAME}:http_errors:rate5m
expr: |
sum(rate(http_requests_total{service="${SERVICE_NAME}",status=~"5.."}[5m]))
# Latency Percentiles
- record: ${SERVICE_NAME}:http_latency_p50:5m
expr: |
histogram_quantile(0.50,
sum(rate(http_request_duration_seconds_bucket{service="${SERVICE_NAME}"}[5m])) by (le))
- record: ${SERVICE_NAME}:http_latency_p95:5m
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{service="${SERVICE_NAME}"}[5m])) by (le))
- record: ${SERVICE_NAME}:http_latency_p99:5m
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{service="${SERVICE_NAME}"}[5m])) by (le))
Step 5: Generate Grafana Dashboard
Create dashboard-configmap.yaml:
apiVersion: v1
kind: ConfigMap
metadata:
name: ${SERVICE_NAME}-dashboard
namespace: cluster-monitoring
labels:
grafana_dashboard: "1" # CRITICAL: Required for Grafana to discover
data:
${SERVICE_NAME}-dashboard.json: |
{
"annotations": {
"list": []
},
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 0,
"id": null,
"links": [],
"liveNow": false,
"panels": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 10,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
}
]
},
"unit": "reqps"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"id": 1,
"options": {
"legend": {
"calcs": ["mean", "max"],
"displayMode": "table",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"expr": "sum(rate(http_requests_total{service=\"${SERVICE_NAME}\"}[5m])) by (status)",
"legendFormat": "{{status}}",
"refId": "A"
}
],
"title": "Request Rate by Status",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 10,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
}
]
},
"unit": "s"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"id": 2,
"options": {
"legend": {
"calcs": ["mean", "max"],
"displayMode": "table",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=\"${SERVICE_NAME}\"}[5m])) by (le))",
"legendFormat": "p50",
"refId": "A"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=\"${SERVICE_NAME}\"}[5m])) by (le))",
"legendFormat": "p95",
"refId": "B"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=\"${SERVICE_NAME}\"}[5m])) by (le))",
"legendFormat": "p99",
"refId": "C"
}
],
"title": "Request Latency Percentiles",
"type": "timeseries"
}
],
"refresh": "30s",
"schemaVersion": 38,
"style": "dark",
"tags": ["${SERVICE_NAME}", "http"],
"templating": {
"list": []
},
"time": {
"from": "now-1h",
"to": "now"
},
"timepicker": {},
"timezone": "browser",
"title": "${SERVICE_NAME} Dashboard",
"uid": "${SERVICE_NAME}-dashboard",
"version": 1,
"weekStart": ""
}
Patterns & Best Practices
Alert Severity Levels
| Severity | Response Time | Example |
|---|---|---|
| critical | Immediate (page) | Service down, data loss risk |
| warning | Business hours | High latency, resource pressure |
| info | Next review | Approaching thresholds |
PromQL Best Practices
- Use recording rules for complex queries
- Always include
forduration to avoid flapping - Use
rate()for counters, neverincrease()in alerts - Include meaningful labels for routing
Dashboard Design
- Use consistent colors (green=good, red=bad)
- Include time range variables
- Add annotation markers for deployments
- Group related panels
Validation
# Check ServiceMonitor is picked up
kubectl get servicemonitor -n default
# Check PrometheusRules
kubectl get prometheusrules -n cluster-monitoring
# Check targets in Prometheus UI
# http://prometheus:9090/targets
# Check alerts
# http://prometheus:9090/alerts
# Verify dashboard in Grafana
# http://grafana:3000/dashboards
Common Pitfalls
- Missing Labels:
release: prometheusandapp: kube-prometheus-stackrequired - Wrong Namespace: ServiceMonitor must be in same namespace as service or use
namespaceSelector - Port Mismatch: ServiceMonitor port must match Service port name
- Dashboard Not Loading: Check
grafana_dashboard: "1"label on ConfigMap - High Cardinality: Avoid labels with unbounded values (user IDs, request IDs)