Why observability before scaling
The order matters. I've seen teams scale their EKS clusters to handle load before setting up proper monitoring, and they end up flying blind — CPUs are high but which pods? Memory is growing but is it a leak or normal growth? Setting up Prometheus and Grafana before you need them is like putting on a seatbelt before you drive, not after you crash. This is the setup I use across production EKS clusters.
Prerequisites
- EKS cluster running (1.24+)
kubectlconfigured and connected- Helm 3 installed
- At least 2 worker nodes with 2 vCPU / 4GB RAM each for the monitoring stack
Installing the kube-prometheus-stack
Don't install Prometheus and Grafana separately. The kube-prometheus-stack Helm chart installs Prometheus, Grafana, Alertmanager, and a set of pre-built Kubernetes dashboards in one shot.
# Add the Prometheus community Helm repo
helm repo add prometheus-community \
https://prometheus-community.github.io/helm-charts
helm repo update
# Create a dedicated namespace
kubectl create namespace monitoring
# Install the stack
helm install kube-prometheus-stack \
prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--set grafana.adminPassword=changeme \
--set prometheus.prometheusSpec.retention=15d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName=gp2 \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi
grafana.adminPassword value and never commit it to your repo. Use a Kubernetes secret or pass it via CI/CD environment variable.
Verify the installation
kubectl get pods -n monitoring
# All pods should reach Running state within 2-3 minutes
# You should see: alertmanager, grafana, kube-state-metrics,
# node-exporter (one per node), prometheus-operator, prometheus
kubectl get svc -n monitoring
Accessing Grafana
# Port-forward for local access
kubectl port-forward svc/kube-prometheus-stack-grafana \
3000:80 -n monitoring
# Open http://localhost:3000
# Username: admin / Password: what you set above
For production access, use an ingress with TLS rather than port-forward. The stack includes a Grafana service you can expose via an AWS ALB ingress controller.
The dashboards you actually need
The chart installs ~30 pre-built dashboards. The ones I keep pinned:
- Kubernetes / Compute Resources / Cluster — CPU and memory usage across all namespaces
- Kubernetes / Compute Resources / Pod — per-pod resource usage vs requests/limits
- Node Exporter / Nodes — disk, network, system-level metrics per node
- Kubernetes / Persistent Volumes — PVC usage and inode pressure
Setting up your first alert
Dashboards are for humans. Alerts are for when humans are asleep. This PrometheusRule fires when a pod has been in CrashLoopBackOff for 5 minutes:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: pod-alerts
namespace: monitoring
labels:
release: kube-prometheus-stack
spec:
groups:
- name: pod.rules
rules:
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[5m]) * 60 > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{{{ $labels.pod }}}} is crash looping"
description: "Pod {{{{ $labels.pod }}}} has restarted more than once in 5 minutes"
release: kube-prometheus-stack label is required for Prometheus to discover and load this rule. Without it, the rule exists in the cluster but Prometheus ignores it completely.
Configuring Alertmanager for Slack
# values.yaml for helm upgrade
alertmanager:
config:
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'namespace']
receiver: 'slack-notifications'
receivers:
- name: 'slack-notifications'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
Storage considerations on EKS
Prometheus stores metrics on disk. Plan for this before you run out of space in production:
- Default retention is 10 days — I set it to 15 days for production
- At 50 metrics/second, expect roughly 1–2GB per day of storage
- Use
gp3instead ofgp2for EBS volumes — same cost, better baseline performance - Enable EBS CSI driver on your EKS cluster before installing — the chart needs it for PVC provisioning
# Check if EBS CSI driver is installed
kubectl get pods -n kube-system | grep ebs-csi
# If not, install via EKS add-on
aws eks create-addon \
--cluster-name your-cluster \
--addon-name aws-ebs-csi-driver \
--region us-east-1
Get this stack running before your first production workload goes live. The 45% reduction in incident response time I've seen in practice comes almost entirely from having the right alerts already configured — not from looking at dashboards after something breaks.