505 lines
15 KiB
Markdown
505 lines
15 KiB
Markdown
# Example: Prometheus/Grafana Monitoring Skill
|
|
|
|
This is a complete example of a DevOps skill for monitoring setup.
|
|
|
|
---
|
|
|
|
```yaml
|
|
---
|
|
name: monitoring-setup
|
|
description: Creates Prometheus alerting rules, ServiceMonitors, and Grafana dashboards. Use when setting up observability for services.
|
|
argument-hint: [service-name] [metric-type: http|celery|custom]
|
|
allowed-tools: Read, Write, Edit, Glob, Grep
|
|
disable-model-invocation: true
|
|
---
|
|
|
|
# Monitoring Setup Generator
|
|
|
|
Generate production-ready Prometheus alerting rules, ServiceMonitors, and Grafana dashboards for service observability.
|
|
|
|
## When to Use
|
|
|
|
- Setting up monitoring for a new service
|
|
- Creating custom alerting rules
|
|
- Building Grafana dashboards
|
|
- Configuring metric scraping
|
|
|
|
## Prerequisites
|
|
|
|
- Prometheus Operator installed (kube-prometheus-stack)
|
|
- Grafana deployed with dashboard provisioning
|
|
- Service exposes `/metrics` endpoint
|
|
|
|
## Instructions
|
|
|
|
### Step 1: Parse Arguments
|
|
|
|
Extract from `$ARGUMENTS`:
|
|
- **service-name**: Name of the service to monitor
|
|
- **metric-type**: Type of metrics (http, celery, custom)
|
|
|
|
### Step 2: Generate ServiceMonitor
|
|
|
|
Create `servicemonitor.yaml`:
|
|
|
|
```yaml
|
|
apiVersion: monitoring.coreos.com/v1
|
|
kind: ServiceMonitor
|
|
metadata:
|
|
name: ${SERVICE_NAME}
|
|
namespace: default
|
|
labels:
|
|
app: ${SERVICE_NAME}
|
|
release: prometheus # CRITICAL: Required for Prometheus to discover
|
|
spec:
|
|
selector:
|
|
matchLabels:
|
|
app: ${SERVICE_NAME}
|
|
namespaceSelector:
|
|
matchNames:
|
|
- default
|
|
endpoints:
|
|
- port: http
|
|
path: /metrics
|
|
interval: 30s
|
|
scrapeTimeout: 10s
|
|
honorLabels: true
|
|
metricRelabelings:
|
|
- sourceLabels: [__name__]
|
|
regex: 'go_.*'
|
|
action: drop # Drop unnecessary Go runtime metrics
|
|
```
|
|
|
|
### Step 3: Generate PrometheusRules
|
|
|
|
Create `prometheusrules.yaml`:
|
|
|
|
```yaml
|
|
apiVersion: monitoring.coreos.com/v1
|
|
kind: PrometheusRule
|
|
metadata:
|
|
name: ${SERVICE_NAME}-alerts
|
|
namespace: cluster-monitoring
|
|
labels:
|
|
app: kube-prometheus-stack # CRITICAL: Required label
|
|
release: prometheus # CRITICAL: Required label
|
|
spec:
|
|
groups:
|
|
- name: ${SERVICE_NAME}.availability
|
|
rules:
|
|
# High Error Rate
|
|
- alert: ${SERVICE_NAME}HighErrorRate
|
|
expr: |
|
|
sum(rate(http_requests_total{service="${SERVICE_NAME}",status=~"5.."}[5m]))
|
|
/ sum(rate(http_requests_total{service="${SERVICE_NAME}"}[5m])) > 0.05
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
service: ${SERVICE_NAME}
|
|
annotations:
|
|
summary: "High error rate on {{ $labels.service }}"
|
|
description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
|
|
runbook_url: "https://runbooks.example.com/${SERVICE_NAME}/high-error-rate"
|
|
|
|
# High Latency
|
|
- alert: ${SERVICE_NAME}HighLatency
|
|
expr: |
|
|
histogram_quantile(0.95,
|
|
sum(rate(http_request_duration_seconds_bucket{service="${SERVICE_NAME}"}[5m])) by (le)
|
|
) > 1
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
service: ${SERVICE_NAME}
|
|
annotations:
|
|
summary: "High latency on {{ $labels.service }}"
|
|
description: "P95 latency is {{ $value | humanizeDuration }} (threshold: 1s)"
|
|
|
|
# Service Down
|
|
- alert: ${SERVICE_NAME}Down
|
|
expr: |
|
|
up{job="${SERVICE_NAME}"} == 0
|
|
for: 2m
|
|
labels:
|
|
severity: critical
|
|
service: ${SERVICE_NAME}
|
|
annotations:
|
|
summary: "{{ $labels.service }} is down"
|
|
description: "{{ $labels.instance }} has been down for more than 2 minutes"
|
|
|
|
- name: ${SERVICE_NAME}.resources
|
|
rules:
|
|
# High Memory Usage
|
|
- alert: ${SERVICE_NAME}HighMemoryUsage
|
|
expr: |
|
|
container_memory_working_set_bytes{container="${SERVICE_NAME}"}
|
|
/ container_spec_memory_limit_bytes{container="${SERVICE_NAME}"} > 0.85
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
service: ${SERVICE_NAME}
|
|
annotations:
|
|
summary: "High memory usage on {{ $labels.pod }}"
|
|
description: "Memory usage is {{ $value | humanizePercentage }} of limit"
|
|
|
|
# High CPU Usage
|
|
- alert: ${SERVICE_NAME}HighCPUUsage
|
|
expr: |
|
|
sum(rate(container_cpu_usage_seconds_total{container="${SERVICE_NAME}"}[5m])) by (pod)
|
|
/ sum(container_spec_cpu_quota{container="${SERVICE_NAME}"}/container_spec_cpu_period{container="${SERVICE_NAME}"}) by (pod) > 0.85
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
service: ${SERVICE_NAME}
|
|
annotations:
|
|
summary: "High CPU usage on {{ $labels.pod }}"
|
|
description: "CPU usage is {{ $value | humanizePercentage }} of limit"
|
|
|
|
# Pod Restarts
|
|
- alert: ${SERVICE_NAME}PodRestarts
|
|
expr: |
|
|
increase(kube_pod_container_status_restarts_total{container="${SERVICE_NAME}"}[1h]) > 3
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
service: ${SERVICE_NAME}
|
|
annotations:
|
|
summary: "{{ $labels.pod }} is restarting frequently"
|
|
description: "{{ $value }} restarts in the last hour"
|
|
```
|
|
|
|
### Step 4: Generate Recording Rules
|
|
|
|
Create `recordingrules.yaml`:
|
|
|
|
```yaml
|
|
apiVersion: monitoring.coreos.com/v1
|
|
kind: PrometheusRule
|
|
metadata:
|
|
name: ${SERVICE_NAME}-recording
|
|
namespace: cluster-monitoring
|
|
labels:
|
|
app: kube-prometheus-stack
|
|
release: prometheus
|
|
spec:
|
|
groups:
|
|
- name: ${SERVICE_NAME}.recording
|
|
interval: 30s
|
|
rules:
|
|
# Request Rate
|
|
- record: ${SERVICE_NAME}:http_requests:rate5m
|
|
expr: |
|
|
sum(rate(http_requests_total{service="${SERVICE_NAME}"}[5m])) by (status, method, path)
|
|
|
|
# Error Rate
|
|
- record: ${SERVICE_NAME}:http_errors:rate5m
|
|
expr: |
|
|
sum(rate(http_requests_total{service="${SERVICE_NAME}",status=~"5.."}[5m]))
|
|
|
|
# Latency Percentiles
|
|
- record: ${SERVICE_NAME}:http_latency_p50:5m
|
|
expr: |
|
|
histogram_quantile(0.50,
|
|
sum(rate(http_request_duration_seconds_bucket{service="${SERVICE_NAME}"}[5m])) by (le))
|
|
|
|
- record: ${SERVICE_NAME}:http_latency_p95:5m
|
|
expr: |
|
|
histogram_quantile(0.95,
|
|
sum(rate(http_request_duration_seconds_bucket{service="${SERVICE_NAME}"}[5m])) by (le))
|
|
|
|
- record: ${SERVICE_NAME}:http_latency_p99:5m
|
|
expr: |
|
|
histogram_quantile(0.99,
|
|
sum(rate(http_request_duration_seconds_bucket{service="${SERVICE_NAME}"}[5m])) by (le))
|
|
```
|
|
|
|
### Step 5: Generate Grafana Dashboard
|
|
|
|
Create `dashboard-configmap.yaml`:
|
|
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: ConfigMap
|
|
metadata:
|
|
name: ${SERVICE_NAME}-dashboard
|
|
namespace: cluster-monitoring
|
|
labels:
|
|
grafana_dashboard: "1" # CRITICAL: Required for Grafana to discover
|
|
data:
|
|
${SERVICE_NAME}-dashboard.json: |
|
|
{
|
|
"annotations": {
|
|
"list": []
|
|
},
|
|
"editable": true,
|
|
"fiscalYearStartMonth": 0,
|
|
"graphTooltip": 0,
|
|
"id": null,
|
|
"links": [],
|
|
"liveNow": false,
|
|
"panels": [
|
|
{
|
|
"datasource": {
|
|
"type": "prometheus",
|
|
"uid": "prometheus"
|
|
},
|
|
"fieldConfig": {
|
|
"defaults": {
|
|
"color": {
|
|
"mode": "palette-classic"
|
|
},
|
|
"custom": {
|
|
"axisCenteredZero": false,
|
|
"axisColorMode": "text",
|
|
"axisLabel": "",
|
|
"axisPlacement": "auto",
|
|
"barAlignment": 0,
|
|
"drawStyle": "line",
|
|
"fillOpacity": 10,
|
|
"gradientMode": "none",
|
|
"hideFrom": {
|
|
"legend": false,
|
|
"tooltip": false,
|
|
"viz": false
|
|
},
|
|
"lineInterpolation": "linear",
|
|
"lineWidth": 1,
|
|
"pointSize": 5,
|
|
"scaleDistribution": {
|
|
"type": "linear"
|
|
},
|
|
"showPoints": "never",
|
|
"spanNulls": false,
|
|
"stacking": {
|
|
"group": "A",
|
|
"mode": "none"
|
|
},
|
|
"thresholdsStyle": {
|
|
"mode": "off"
|
|
}
|
|
},
|
|
"mappings": [],
|
|
"thresholds": {
|
|
"mode": "absolute",
|
|
"steps": [
|
|
{
|
|
"color": "green",
|
|
"value": null
|
|
}
|
|
]
|
|
},
|
|
"unit": "reqps"
|
|
},
|
|
"overrides": []
|
|
},
|
|
"gridPos": {
|
|
"h": 8,
|
|
"w": 12,
|
|
"x": 0,
|
|
"y": 0
|
|
},
|
|
"id": 1,
|
|
"options": {
|
|
"legend": {
|
|
"calcs": ["mean", "max"],
|
|
"displayMode": "table",
|
|
"placement": "bottom",
|
|
"showLegend": true
|
|
},
|
|
"tooltip": {
|
|
"mode": "multi",
|
|
"sort": "desc"
|
|
}
|
|
},
|
|
"targets": [
|
|
{
|
|
"datasource": {
|
|
"type": "prometheus",
|
|
"uid": "prometheus"
|
|
},
|
|
"expr": "sum(rate(http_requests_total{service=\"${SERVICE_NAME}\"}[5m])) by (status)",
|
|
"legendFormat": "{{status}}",
|
|
"refId": "A"
|
|
}
|
|
],
|
|
"title": "Request Rate by Status",
|
|
"type": "timeseries"
|
|
},
|
|
{
|
|
"datasource": {
|
|
"type": "prometheus",
|
|
"uid": "prometheus"
|
|
},
|
|
"fieldConfig": {
|
|
"defaults": {
|
|
"color": {
|
|
"mode": "palette-classic"
|
|
},
|
|
"custom": {
|
|
"axisCenteredZero": false,
|
|
"axisColorMode": "text",
|
|
"axisLabel": "",
|
|
"axisPlacement": "auto",
|
|
"barAlignment": 0,
|
|
"drawStyle": "line",
|
|
"fillOpacity": 10,
|
|
"gradientMode": "none",
|
|
"hideFrom": {
|
|
"legend": false,
|
|
"tooltip": false,
|
|
"viz": false
|
|
},
|
|
"lineInterpolation": "linear",
|
|
"lineWidth": 1,
|
|
"pointSize": 5,
|
|
"scaleDistribution": {
|
|
"type": "linear"
|
|
},
|
|
"showPoints": "never",
|
|
"spanNulls": false,
|
|
"stacking": {
|
|
"group": "A",
|
|
"mode": "none"
|
|
},
|
|
"thresholdsStyle": {
|
|
"mode": "off"
|
|
}
|
|
},
|
|
"mappings": [],
|
|
"thresholds": {
|
|
"mode": "absolute",
|
|
"steps": [
|
|
{
|
|
"color": "green",
|
|
"value": null
|
|
}
|
|
]
|
|
},
|
|
"unit": "s"
|
|
},
|
|
"overrides": []
|
|
},
|
|
"gridPos": {
|
|
"h": 8,
|
|
"w": 12,
|
|
"x": 12,
|
|
"y": 0
|
|
},
|
|
"id": 2,
|
|
"options": {
|
|
"legend": {
|
|
"calcs": ["mean", "max"],
|
|
"displayMode": "table",
|
|
"placement": "bottom",
|
|
"showLegend": true
|
|
},
|
|
"tooltip": {
|
|
"mode": "multi",
|
|
"sort": "desc"
|
|
}
|
|
},
|
|
"targets": [
|
|
{
|
|
"datasource": {
|
|
"type": "prometheus",
|
|
"uid": "prometheus"
|
|
},
|
|
"expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=\"${SERVICE_NAME}\"}[5m])) by (le))",
|
|
"legendFormat": "p50",
|
|
"refId": "A"
|
|
},
|
|
{
|
|
"datasource": {
|
|
"type": "prometheus",
|
|
"uid": "prometheus"
|
|
},
|
|
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=\"${SERVICE_NAME}\"}[5m])) by (le))",
|
|
"legendFormat": "p95",
|
|
"refId": "B"
|
|
},
|
|
{
|
|
"datasource": {
|
|
"type": "prometheus",
|
|
"uid": "prometheus"
|
|
},
|
|
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=\"${SERVICE_NAME}\"}[5m])) by (le))",
|
|
"legendFormat": "p99",
|
|
"refId": "C"
|
|
}
|
|
],
|
|
"title": "Request Latency Percentiles",
|
|
"type": "timeseries"
|
|
}
|
|
],
|
|
"refresh": "30s",
|
|
"schemaVersion": 38,
|
|
"style": "dark",
|
|
"tags": ["${SERVICE_NAME}", "http"],
|
|
"templating": {
|
|
"list": []
|
|
},
|
|
"time": {
|
|
"from": "now-1h",
|
|
"to": "now"
|
|
},
|
|
"timepicker": {},
|
|
"timezone": "browser",
|
|
"title": "${SERVICE_NAME} Dashboard",
|
|
"uid": "${SERVICE_NAME}-dashboard",
|
|
"version": 1,
|
|
"weekStart": ""
|
|
}
|
|
```
|
|
|
|
## Patterns & Best Practices
|
|
|
|
### Alert Severity Levels
|
|
|
|
| Severity | Response Time | Example |
|
|
|----------|---------------|---------|
|
|
| critical | Immediate (page) | Service down, data loss risk |
|
|
| warning | Business hours | High latency, resource pressure |
|
|
| info | Next review | Approaching thresholds |
|
|
|
|
### PromQL Best Practices
|
|
|
|
- Use recording rules for complex queries
|
|
- Always include `for` duration to avoid flapping
|
|
- Use `rate()` for counters, never `increase()` in alerts
|
|
- Include meaningful labels for routing
|
|
|
|
### Dashboard Design
|
|
|
|
- Use consistent colors (green=good, red=bad)
|
|
- Include time range variables
|
|
- Add annotation markers for deployments
|
|
- Group related panels
|
|
|
|
## Validation
|
|
|
|
```bash
|
|
# Check ServiceMonitor is picked up
|
|
kubectl get servicemonitor -n default
|
|
|
|
# Check PrometheusRules
|
|
kubectl get prometheusrules -n cluster-monitoring
|
|
|
|
# Check targets in Prometheus UI
|
|
# http://prometheus:9090/targets
|
|
|
|
# Check alerts
|
|
# http://prometheus:9090/alerts
|
|
|
|
# Verify dashboard in Grafana
|
|
# http://grafana:3000/dashboards
|
|
```
|
|
|
|
## Common Pitfalls
|
|
|
|
- **Missing Labels**: `release: prometheus` and `app: kube-prometheus-stack` required
|
|
- **Wrong Namespace**: ServiceMonitor must be in same namespace as service or use `namespaceSelector`
|
|
- **Port Mismatch**: ServiceMonitor port must match Service port name
|
|
- **Dashboard Not Loading**: Check `grafana_dashboard: "1"` label on ConfigMap
|
|
- **High Cardinality**: Avoid labels with unbounded values (user IDs, request IDs)
|
|
```
|