KubernetesPrometheusAWS EKSObservability Nov 2024 12 min read

Setting Up Prometheus and Grafana on EKS: A Production Guide

How to install and configure the kube-prometheus-stack on AWS EKS, set up your first alerts, and build the observability foundation your cluster needs before it scales.

Why observability before scaling

The order matters. I've seen teams scale their EKS clusters to handle load before setting up proper monitoring, and they end up flying blind — CPUs are high but which pods? Memory is growing but is it a leak or normal growth? Setting up Prometheus and Grafana before you need them is like putting on a seatbelt before you drive, not after you crash. This is the setup I use across production EKS clusters.

Prerequisites

Installing the kube-prometheus-stack

Don't install Prometheus and Grafana separately. The kube-prometheus-stack Helm chart installs Prometheus, Grafana, Alertmanager, and a set of pre-built Kubernetes dashboards in one shot.

# Add the Prometheus community Helm repo
helm repo add prometheus-community \
  https://prometheus-community.github.io/helm-charts
helm repo update

# Create a dedicated namespace
kubectl create namespace monitoring

# Install the stack
helm install kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set grafana.adminPassword=changeme \
  --set prometheus.prometheusSpec.retention=15d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName=gp2 \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi
Important: Change the grafana.adminPassword value and never commit it to your repo. Use a Kubernetes secret or pass it via CI/CD environment variable.

Verify the installation

kubectl get pods -n monitoring
# All pods should reach Running state within 2-3 minutes
# You should see: alertmanager, grafana, kube-state-metrics,
# node-exporter (one per node), prometheus-operator, prometheus

kubectl get svc -n monitoring

Accessing Grafana

# Port-forward for local access
kubectl port-forward svc/kube-prometheus-stack-grafana \
  3000:80 -n monitoring

# Open http://localhost:3000
# Username: admin / Password: what you set above

For production access, use an ingress with TLS rather than port-forward. The stack includes a Grafana service you can expose via an AWS ALB ingress controller.

The dashboards you actually need

The chart installs ~30 pre-built dashboards. The ones I keep pinned:

Setting up your first alert

Dashboards are for humans. Alerts are for when humans are asleep. This PrometheusRule fires when a pod has been in CrashLoopBackOff for 5 minutes:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: pod-alerts
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  groups:
  - name: pod.rules
    rules:
    - alert: PodCrashLooping
      expr: rate(kube_pod_container_status_restarts_total[5m]) * 60 > 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{{{ $labels.pod }}}} is crash looping"
        description: "Pod {{{{ $labels.pod }}}} has restarted more than once in 5 minutes"
Key detail: The release: kube-prometheus-stack label is required for Prometheus to discover and load this rule. Without it, the rule exists in the cluster but Prometheus ignores it completely.

Configuring Alertmanager for Slack

# values.yaml for helm upgrade
alertmanager:
  config:
    global:
      resolve_timeout: 5m
    route:
      group_by: ['alertname', 'namespace']
      receiver: 'slack-notifications'
    receivers:
    - name: 'slack-notifications'
      slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

Storage considerations on EKS

Prometheus stores metrics on disk. Plan for this before you run out of space in production:

# Check if EBS CSI driver is installed
kubectl get pods -n kube-system | grep ebs-csi

# If not, install via EKS add-on
aws eks create-addon \
  --cluster-name your-cluster \
  --addon-name aws-ebs-csi-driver \
  --region us-east-1

Get this stack running before your first production workload goes live. The 45% reduction in incident response time I've seen in practice comes almost entirely from having the right alerts already configured — not from looking at dashboards after something breaks.

← Back to all articles