Sitemap

A Complete Guide to Prometheus, Grafana, and ServiceMonitors

9 min readMay 25, 2025

How to deploy enterprise-grade monitoring for your Kubernetes workloads using Helm, the Prometheus Operator, and Azure Kubernetes Service

Press enter or click to view image in full size

Why Monitoring Matters More Than Ever

In today’s cloud-native landscape, observability isn’t just a nice-to-have — it’s mission-critical. When your applications are distributed across multiple containers, pods, and nodes, understanding what’s happening inside your Kubernetes cluster becomes exponentially more complex. Without proper monitoring, you’re essentially flying blind.

I’ve spent countless hours debugging production issues that could have been prevented with proper monitoring in place. Today, I’m going to walk you through building a robust, production-ready monitoring stack on Azure Kubernetes Service (AKS) using Prometheus and Grafana — the de facto standards for Kubernetes monitoring.

What We’re Building

By the end of this guide, you’ll have:

  • Prometheus collects metrics from your entire Kubernetes cluster
  • Grafana provides beautiful, actionable dashboards
  • ServiceMonitor resources for automatic service discovery
  • AlertManager for intelligent alerting
  • Persistent storage to retain your monitoring data
  • Production-ready configuration with proper security and scaling

The best part? Everything will be managed through Helm charts, making it reproducible and maintainable.

The Problem with DIY Monitoring

Before we dive in, let me share why this approach matters. Early in my career, I tried setting up Prometheus manually — writing custom ConfigMaps, managing discovery rules by hand, and wrestling with RBAC permissions. It was a nightmare to maintain.

The Prometheus Operator changed everything. It introduces custom Kubernetes resources like ServiceMonitor PrometheusRule That makes monitoring configuration declarative and GitOps-friendly. Instead of editing ConfigMaps, you define what you want to monitor using Kubernetes-native resources.

Prerequisites: What You’ll Need

Before we start, make sure you have:

  • Azure CLI installed and configured with your subscription
  • kubectl for Kubernetes cluster management
  • Helm 3.x for package management
  • An Azure subscription with permissions to create AKS clusters

If you’re missing any of these, the official documentation for each tool provides excellent installation guides.

Step 1: Creating Your AKS Cluster

Let’s start by creating a properly configured AKS cluster. I’m using specific settings that work well for monitoring workloads:

# Set up our environment variables
RESOURCE_GROUP="rg-monitoring"
CLUSTER_NAME="aks-monitoring"
LOCATION="East US"
# Create the resource group
az group create --name $RESOURCE_GROUP --location "$LOCATION"
# Create the AKS cluster with monitoring addon enabled
az aks create \
--resource-group $RESOURCE_GROUP \
--name $CLUSTER_NAME \
--node-count 2\
--node-vm-size Standard_DS2_v2 \
--enable-addons monitoring \
--generate-ssh-keys
# Configure kubectl to use our new cluster
az aks get-credentials --resource-group $RESOURCE_GROUP --name $CLUSTER_NAME

Why these settings matter:

  • 2 nodes: Provides redundancy for our monitoring stack
  • Standard_DS2_v2: Enough resources for Prometheus and Grafana
  • monitoring addon: Enables Azure Monitor integration (bonus observability!)

Step 2: Setting Up Helm Repositories

Helm makes deploying complex applications like Prometheus incredibly simple. We’ll add the official repositories:

# Add the Prometheus community repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
# Add the Grafana repository  
helm repo add grafana https://grafana.github.io/helm-charts
# Update to get the latest charts
helm repo update

The the prometheus-community/kube-prometheus-stack chart is a game-changer. It includes everything we need: Prometheus, Grafana, AlertManager, node exporters, and the Prometheus Operator.

Step 3: Preparing the Monitoring Namespace

Organisation is key in Kubernetes. Let’s create a dedicated namespace for our monitoring stack:

# Create the monitoring namespace
kubectl create namespace monitoring
# Set it as our default to save typing
kubectl config set-context --current --namespace=monitoring

This separation provides better security boundaries and makes resource management easier.

Step 4: Installing the Prometheus Stack

Here’s where the magic happens. This single Helm command deploys our entire monitoring stack:

helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
--set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.accessModes[0]=ReadWriteOnce \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=20Gi \
--set grafana.persistence.enabled=true \
--set grafana.persistence.size=10Gi

Let me break down these critical settings:

  • serviceMonitorSelectorNilUsesHelmValues=false: This is crucial! It allows Prometheus to discover ServiceMonitors across all namespaces, not just ones with specific labels.
  • retention=30d: Keeps 30 days of metrics data
  • storage=20Gi: Persistent storage for Prometheus data
  • grafana.persistence=true: Ensures Grafana dashboards and settings survive pod restarts

Step 5: Verifying Your Installation

Let’s make sure everything was deployed correctly:

# Check that all pods are running
kubectl get pods -n monitoring
# List the services that were created
kubectl get svc -n monitoring
# See what ServiceMonitors are already configured
kubectl get servicemonitors -n monitoring

You should see pods for Prometheus, Grafana, AlertManager, and various exporters, all in Running state. If any pods are stuck in Pending or CrashLoopBackOffCheck the logs with kubectl logs <pod-name> -n monitoring.

Step 6: Understanding ServiceMonitors

This is where ServiceMonitors shine. Instead of manually configuring Prometheus scrape targets, you create ServiceMonitor resources that automatically discover services to monitor.

Here’s an example ServiceMonitor for a hypothetical application:

# servicemonitor-example.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app-servicemonitor
namespace: monitoring
labels:
app: my-app
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics
interval: 30s
path: /metrics
namespaceSelector:
matchNames:
- default
- my-app-namespace

Key concepts:

  • selector: Matches services with specific labels
  • endpoints: Defines which port and path to scrape
  • namespaceSelector: Controls which namespaces to search in

Apply it with:

kubectl apply -f servicemonitor-example.yaml

Step 7: Accessing Your Monitoring Stack

Now for the moment of truth — accessing our monitoring tools. For development, port-forwarding is the quickest way:

# Access Prometheus (background process)
kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring &
# Access Grafana (background process)  
kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring &
# Access AlertManager (background process)
kubectl port-forward svc/prometheus-kube-prometheus-alertmanager 9093:9093 -n monitoring &

Now you can access:

Press enter or click to view image in full size
Press enter or click to view image in full size
Press enter or click to view image in full size
Press enter or click to view image in full size
Press enter or click to view image in full size

Please go through this comprehensive guide on how to effectively visualise and interpret data in Grafana for Kubernetes monitoring.

Step 8: Logging into Grafana

Grafana generates a random admin password during installation. Here’s how to retrieve it:

# Get the admin password
kubectl get secret prometheus-grafana -n monitoring -o template='{{.data.admin-password | base64decode}}'

Log in with:

  • Username: admin
  • Password: (the output from the command above)

Step 9: Exploring Pre-Built Dashboards

One of Grafana’s biggest advantages is its ecosystem of pre-built dashboards. Navigate to Dashboards → Browse to see what’s already available. You’ll find dashboards for:

  • Kubernetes cluster overview
  • Node metrics
  • Pod resource usage
  • Persistent volume monitoring

These dashboards are production-ready and provide immediate value.

Step 10: Creating a Sample Application

Let’s deploy a simple application to see ServiceMonitor discovery in action:

# sample-app.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: sample-app
namespace: default
spec:
replicas: 2
selector:
matchLabels:
app: sample-app
template:
metadata:
labels:
app: sample-app
spec:
containers:
- name: sample-app
image: prom/node-exporter:latest
ports:
- containerPort: 9100
name: metrics
---
apiVersion: v1
kind: Service
metadata:
name: sample-app-service
namespace: default
labels:
app: sample-app
spec:
selector:
app: sample-app
ports:
- name: metrics
port: 9100
targetPort: 9100
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: sample-app-monitor
namespace: monitoring
spec:
selector:
matchLabels:
app: sample-app
endpoints:
- port: metrics
interval: 30s
namespaceSelector:
matchNames:
- default

Deploy it:

kubectl apply -f sample-app.yaml

Step 11: Verifying Automatic Discovery

Here’s the beautiful part — within 30 seconds, Prometheus should automatically discover your new application. Check this by:

  1. Opening Prometheus at http://localhost:9090
  2. Going to Status → Targets
  3. Looking for your sample-app-monitor target

If it shows as “UP”, congratulations! You’ve just experienced the power of ServiceMonitor-based discovery.

Press enter or click to view image in full size

Step 12: Setting Up Production Access

Port-forwarding is great for development, but production needs proper ingress. Here’s how to expose services using LoadBalancer:

# prometheus-loadbalancer.yaml
apiVersion: v1
kind: Service
metadata:
name: prometheus-external
namespace: monitoring
spec:
type: LoadBalancer
ports:
- port: 9090
targetPort: 9090
selector:
app.kubernetes.io/name: prometheus
prometheus: prometheus-kube-prometheus-prometheus
---
apiVersion: v1
kind: Service
metadata:
name: grafana-external
namespace: monitoring
spec:
type: LoadBalancer
ports:
- port: 80
targetPort: 3000
selector:
app.kubernetes.io/name: grafana

Apply and get external IPs:

kubectl apply -f prometheus-loadbalancer.yaml
kubectl get svc -n monitoring | grep LoadBalancer

Production tip: In real environments, consider using ingress controllers with TLS termination and authentication instead of direct LoadBalancer exposure.

Step 13: Adding Custom Alerts

Monitoring without alerting is just expensive logging. Let’s add some intelligent alerts:

# custom-alert-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: custom-alerts
namespace: monitoring
labels:
prometheus: kube-prometheus
role: alert-rules
spec:
groups:
- name: custom.rules
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for more than 5 minutes"

- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod is crash looping"
description: "Pod {{ $labels.pod }} is restarting frequently"

Apply the rules:

kubectl apply -f custom-alert-rules.yaml

These alerts will trigger when CPU usage exceeds 80% or when pods start crashing.

Best Practices I’ve Learned

Through years of running Prometheus in production, here are some hard-earned lessons:

1. Resource Planning

Prometheus can be memory-hungry. Monitor your Prometheus pod’s resource usage and adjust requests/limits accordingly. A good starting point is 2GB RAM and 1 CPU core.

2. Retention Strategy

Don’t store metrics forever. 30 days is usually sufficient for most use cases. For longer-term storage, consider remote write to Azure Monitor or other TSDB solutions.

3. Label Hygiene

Be careful with high-cardinality labels. Labels like user IDs or request IDs can explode your metric cardinality and kill Prometheus performance.

4. ServiceMonitor Organisation

Keep ServiceMonitors in the monitoring namespace for better organisation and RBAC control.

5. Backup Strategy

While we’ve enabled persistent storage, consider also backing up Grafana dashboards and Prometheus rules to Git repositories.

Troubleshooting Common Issues

ServiceMonitor Not Discovered

This is the most common issue I see. Check:

  • Service labels match the ServiceMonitor selector
  • The namespace selector is correct
  • Prometheus operator has permissions to read the ServiceMonitor

Grafana Dashboards Show No Data

Usually, a data source issue:

  • Verify the Prometheus data source URL in Grafana
  • Check if metrics are being scraped in Prometheus
  • Verify time range settings

High Memory Usage

Prometheus' memory usage is directly related to the number of series it’s scraping:

  • Review your metric cardinality
  • Consider reducing scrape intervals
  • Implement metric relabeling to drop unnecessary metrics

Production Considerations

Before taking this to production, consider:

Security

  • Enable TLS for all communications
  • Implement proper RBAC policies
  • Use network policies to restrict traffic
  • Enable audit logging

High Availability

  • Run multiple Prometheus replicas
  • Use Grafana clustering
  • Implement proper backup strategies

Scaling

  • Monitor Prometheus resource usage
  • Consider federation for very large clusters
  • Use recording rules for expensive queries

Integration

  • Connect AlertManager to your incident management system
  • Implement proper notification channels (Slack, PagerDuty, etc.)
  • Set up escalation policies

Wrapping Up

Congratulations! You’ve just built a production-ready monitoring stack that rivals what you’d find at any major tech company. The combination of Prometheus, Grafana, and ServiceMonitors provides incredible power and flexibility for monitoring Kubernetes workloads.

Have questions about implementing this in your environment? Found this helpful? Drop a comment below or connect with me on LinkedIn. I love discussing observability and sharing experiences from the trenches.

--

--

Saraswathi Lakshman
Saraswathi Lakshman

Written by Saraswathi Lakshman

Cloud Engineer | Azure | AWS | Kubernetes | Terraform | Security & Automation Expert Transforming Cloud Infrastructure with Automation, Security & Scalability

No responses yet