A Complete Guide to Prometheus, Grafana, and ServiceMonitors
How to deploy enterprise-grade monitoring for your Kubernetes workloads using Helm, the Prometheus Operator, and Azure Kubernetes Service
Why Monitoring Matters More Than Ever
In today’s cloud-native landscape, observability isn’t just a nice-to-have — it’s mission-critical. When your applications are distributed across multiple containers, pods, and nodes, understanding what’s happening inside your Kubernetes cluster becomes exponentially more complex. Without proper monitoring, you’re essentially flying blind.
I’ve spent countless hours debugging production issues that could have been prevented with proper monitoring in place. Today, I’m going to walk you through building a robust, production-ready monitoring stack on Azure Kubernetes Service (AKS) using Prometheus and Grafana — the de facto standards for Kubernetes monitoring.
What We’re Building
By the end of this guide, you’ll have:
- Prometheus collects metrics from your entire Kubernetes cluster
- Grafana provides beautiful, actionable dashboards
- ServiceMonitor resources for automatic service discovery
- AlertManager for intelligent alerting
- Persistent storage to retain your monitoring data
- Production-ready configuration with proper security and scaling
The best part? Everything will be managed through Helm charts, making it reproducible and maintainable.
The Problem with DIY Monitoring
Before we dive in, let me share why this approach matters. Early in my career, I tried setting up Prometheus manually — writing custom ConfigMaps, managing discovery rules by hand, and wrestling with RBAC permissions. It was a nightmare to maintain.
The Prometheus Operator changed everything. It introduces custom Kubernetes resources like ServiceMonitor PrometheusRule That makes monitoring configuration declarative and GitOps-friendly. Instead of editing ConfigMaps, you define what you want to monitor using Kubernetes-native resources.
Prerequisites: What You’ll Need
Before we start, make sure you have:
- Azure CLI installed and configured with your subscription
- kubectl for Kubernetes cluster management
- Helm 3.x for package management
- An Azure subscription with permissions to create AKS clusters
If you’re missing any of these, the official documentation for each tool provides excellent installation guides.
Step 1: Creating Your AKS Cluster
Let’s start by creating a properly configured AKS cluster. I’m using specific settings that work well for monitoring workloads:
# Set up our environment variables
RESOURCE_GROUP="rg-monitoring"
CLUSTER_NAME="aks-monitoring"
LOCATION="East US"# Create the resource group
az group create --name $RESOURCE_GROUP --location "$LOCATION"# Create the AKS cluster with monitoring addon enabled
az aks create \
--resource-group $RESOURCE_GROUP \
--name $CLUSTER_NAME \
--node-count 2\
--node-vm-size Standard_DS2_v2 \
--enable-addons monitoring \
--generate-ssh-keys# Configure kubectl to use our new cluster
az aks get-credentials --resource-group $RESOURCE_GROUP --name $CLUSTER_NAME
Why these settings matter:
- 2 nodes: Provides redundancy for our monitoring stack
- Standard_DS2_v2: Enough resources for Prometheus and Grafana
- monitoring addon: Enables Azure Monitor integration (bonus observability!)
Step 2: Setting Up Helm Repositories
Helm makes deploying complex applications like Prometheus incredibly simple. We’ll add the official repositories:
# Add the Prometheus community repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts# Add the Grafana repository
helm repo add grafana https://grafana.github.io/helm-charts# Update to get the latest charts
helm repo update
The the prometheus-community/kube-prometheus-stack chart is a game-changer. It includes everything we need: Prometheus, Grafana, AlertManager, node exporters, and the Prometheus Operator.
Step 3: Preparing the Monitoring Namespace
Organisation is key in Kubernetes. Let’s create a dedicated namespace for our monitoring stack:
# Create the monitoring namespace
kubectl create namespace monitoring# Set it as our default to save typing
kubectl config set-context --current --namespace=monitoringThis separation provides better security boundaries and makes resource management easier.
Step 4: Installing the Prometheus Stack
Here’s where the magic happens. This single Helm command deploys our entire monitoring stack:
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
--set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.accessModes[0]=ReadWriteOnce \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=20Gi \
--set grafana.persistence.enabled=true \
--set grafana.persistence.size=10GiLet me break down these critical settings:
- serviceMonitorSelectorNilUsesHelmValues=false: This is crucial! It allows Prometheus to discover ServiceMonitors across all namespaces, not just ones with specific labels.
- retention=30d: Keeps 30 days of metrics data
- storage=20Gi: Persistent storage for Prometheus data
- grafana.persistence=true: Ensures Grafana dashboards and settings survive pod restarts
Step 5: Verifying Your Installation
Let’s make sure everything was deployed correctly:
# Check that all pods are running
kubectl get pods -n monitoring# List the services that were created
kubectl get svc -n monitoring# See what ServiceMonitors are already configured
kubectl get servicemonitors -n monitoring
You should see pods for Prometheus, Grafana, AlertManager, and various exporters, all in Running state. If any pods are stuck in Pending or CrashLoopBackOffCheck the logs with kubectl logs <pod-name> -n monitoring.
Step 6: Understanding ServiceMonitors
This is where ServiceMonitors shine. Instead of manually configuring Prometheus scrape targets, you create ServiceMonitor resources that automatically discover services to monitor.
Here’s an example ServiceMonitor for a hypothetical application:
# servicemonitor-example.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app-servicemonitor
namespace: monitoring
labels:
app: my-app
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics
interval: 30s
path: /metrics
namespaceSelector:
matchNames:
- default
- my-app-namespaceKey concepts:
- selector: Matches services with specific labels
- endpoints: Defines which port and path to scrape
- namespaceSelector: Controls which namespaces to search in
Apply it with:
kubectl apply -f servicemonitor-example.yamlStep 7: Accessing Your Monitoring Stack
Now for the moment of truth — accessing our monitoring tools. For development, port-forwarding is the quickest way:
# Access Prometheus (background process)
kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring &# Access Grafana (background process)
kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring &# Access AlertManager (background process)
kubectl port-forward svc/prometheus-kube-prometheus-alertmanager 9093:9093 -n monitoring &
Now you can access:
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000
- AlertManager: http://localhost:9093
Step 8: Logging into Grafana
Grafana generates a random admin password during installation. Here’s how to retrieve it:
# Get the admin password
kubectl get secret prometheus-grafana -n monitoring -o template='{{.data.admin-password | base64decode}}'Log in with:
- Username:
admin - Password: (the output from the command above)
Step 9: Exploring Pre-Built Dashboards
One of Grafana’s biggest advantages is its ecosystem of pre-built dashboards. Navigate to Dashboards → Browse to see what’s already available. You’ll find dashboards for:
- Kubernetes cluster overview
- Node metrics
- Pod resource usage
- Persistent volume monitoring
These dashboards are production-ready and provide immediate value.
Step 10: Creating a Sample Application
Let’s deploy a simple application to see ServiceMonitor discovery in action:
# sample-app.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: sample-app
namespace: default
spec:
replicas: 2
selector:
matchLabels:
app: sample-app
template:
metadata:
labels:
app: sample-app
spec:
containers:
- name: sample-app
image: prom/node-exporter:latest
ports:
- containerPort: 9100
name: metrics
---
apiVersion: v1
kind: Service
metadata:
name: sample-app-service
namespace: default
labels:
app: sample-app
spec:
selector:
app: sample-app
ports:
- name: metrics
port: 9100
targetPort: 9100
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: sample-app-monitor
namespace: monitoring
spec:
selector:
matchLabels:
app: sample-app
endpoints:
- port: metrics
interval: 30s
namespaceSelector:
matchNames:
- defaultDeploy it:
kubectl apply -f sample-app.yamlStep 11: Verifying Automatic Discovery
Here’s the beautiful part — within 30 seconds, Prometheus should automatically discover your new application. Check this by:
- Opening Prometheus at http://localhost:9090
- Going to Status → Targets
- Looking for your
sample-app-monitortarget
If it shows as “UP”, congratulations! You’ve just experienced the power of ServiceMonitor-based discovery.
Step 12: Setting Up Production Access
Port-forwarding is great for development, but production needs proper ingress. Here’s how to expose services using LoadBalancer:
# prometheus-loadbalancer.yaml
apiVersion: v1
kind: Service
metadata:
name: prometheus-external
namespace: monitoring
spec:
type: LoadBalancer
ports:
- port: 9090
targetPort: 9090
selector:
app.kubernetes.io/name: prometheus
prometheus: prometheus-kube-prometheus-prometheus
---
apiVersion: v1
kind: Service
metadata:
name: grafana-external
namespace: monitoring
spec:
type: LoadBalancer
ports:
- port: 80
targetPort: 3000
selector:
app.kubernetes.io/name: grafanaApply and get external IPs:
kubectl apply -f prometheus-loadbalancer.yaml
kubectl get svc -n monitoring | grep LoadBalancerProduction tip: In real environments, consider using ingress controllers with TLS termination and authentication instead of direct LoadBalancer exposure.
Step 13: Adding Custom Alerts
Monitoring without alerting is just expensive logging. Let’s add some intelligent alerts:
# custom-alert-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: custom-alerts
namespace: monitoring
labels:
prometheus: kube-prometheus
role: alert-rules
spec:
groups:
- name: custom.rules
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for more than 5 minutes"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod is crash looping"
description: "Pod {{ $labels.pod }} is restarting frequently"Apply the rules:
kubectl apply -f custom-alert-rules.yamlThese alerts will trigger when CPU usage exceeds 80% or when pods start crashing.
Best Practices I’ve Learned
Through years of running Prometheus in production, here are some hard-earned lessons:
1. Resource Planning
Prometheus can be memory-hungry. Monitor your Prometheus pod’s resource usage and adjust requests/limits accordingly. A good starting point is 2GB RAM and 1 CPU core.
2. Retention Strategy
Don’t store metrics forever. 30 days is usually sufficient for most use cases. For longer-term storage, consider remote write to Azure Monitor or other TSDB solutions.
3. Label Hygiene
Be careful with high-cardinality labels. Labels like user IDs or request IDs can explode your metric cardinality and kill Prometheus performance.
4. ServiceMonitor Organisation
Keep ServiceMonitors in the monitoring namespace for better organisation and RBAC control.
5. Backup Strategy
While we’ve enabled persistent storage, consider also backing up Grafana dashboards and Prometheus rules to Git repositories.
Troubleshooting Common Issues
ServiceMonitor Not Discovered
This is the most common issue I see. Check:
- Service labels match the ServiceMonitor selector
- The namespace selector is correct
- Prometheus operator has permissions to read the ServiceMonitor
Grafana Dashboards Show No Data
Usually, a data source issue:
- Verify the Prometheus data source URL in Grafana
- Check if metrics are being scraped in Prometheus
- Verify time range settings
High Memory Usage
Prometheus' memory usage is directly related to the number of series it’s scraping:
- Review your metric cardinality
- Consider reducing scrape intervals
- Implement metric relabeling to drop unnecessary metrics
Production Considerations
Before taking this to production, consider:
Security
- Enable TLS for all communications
- Implement proper RBAC policies
- Use network policies to restrict traffic
- Enable audit logging
High Availability
- Run multiple Prometheus replicas
- Use Grafana clustering
- Implement proper backup strategies
Scaling
- Monitor Prometheus resource usage
- Consider federation for very large clusters
- Use recording rules for expensive queries
Integration
- Connect AlertManager to your incident management system
- Implement proper notification channels (Slack, PagerDuty, etc.)
- Set up escalation policies
Wrapping Up
Congratulations! You’ve just built a production-ready monitoring stack that rivals what you’d find at any major tech company. The combination of Prometheus, Grafana, and ServiceMonitors provides incredible power and flexibility for monitoring Kubernetes workloads.
Have questions about implementing this in your environment? Found this helpful? Drop a comment below or connect with me on LinkedIn. I love discussing observability and sharing experiences from the trenches.