The mindset before the commands
Kubernetes troubleshooting is mostly about asking the right question before running the right command. The cluster is rarely lying to you — it's usually telling you exactly what went wrong, but in a format that requires some translation. After managing EKS clusters across dozens of microservices, these are the errors I've debugged most often and what they actually mean.
CrashLoopBackOff
This is the most common one and it always means the same thing: your container started, crashed, Kubernetes tried to restart it, it crashed again. The backoff is just Kubernetes slowing down the restart attempts.
# First thing to run
kubectl describe pod <pod-name> -n <namespace>
# Look at previous container logs (the crashed instance)
kubectl logs <pod-name> --previous -n <namespace>
The --previous flag is the one people forget. The current container logs are often empty because it crashed before writing anything. The previous container's logs usually have the actual error.
Common causes:
- Application can't connect to a dependency (DB, Redis, external API) on startup
- Missing environment variable or misconfigured secret reference
- OOMKilled — container hit its memory limit immediately
- Wrong entrypoint or command in the Docker image
kubectl describe pod will show OOMKilled in the Last State section if memory was the cause. If you see exit code 137, that's OOM.
Pending pods that never schedule
A pod stuck in Pending means the scheduler couldn't find a node to place it on.
kubectl describe pod <pod-name> -n <namespace>
# Look for Events section at the bottom
# "0/3 nodes are available" tells you exactly why
Common reasons from the Events output:
Insufficient cpuorInsufficient memory— nodes don't have enough resources. Either scale up nodes or reduce pod requests.node(s) didn't match Pod's node affinity/selector— your nodeSelector or affinity rules don't match any available nodesnode(s) had taint that the pod didn't tolerate— nodes are tainted, pod needs a tolerationpod has unbound PersistentVolumeClaims— your PVC isn't binding. Check the PVC status separately.
# Check node capacity vs requests
kubectl describe nodes | grep -A 5 "Allocated resources"
# Check PVC status
kubectl get pvc -n <namespace>
ImagePullBackOff
The container image couldn't be pulled. Either the image doesn't exist, the tag is wrong, or the cluster doesn't have credentials to pull from a private registry.
# Verify the image exists and tag is correct
kubectl describe pod <pod-name> | grep Image
# Check if imagePullSecret is configured
kubectl get secret -n <namespace>
kubectl describe serviceaccount default -n <namespace>
For private ECR on EKS — the node's IAM role needs ecr:GetAuthorizationToken and ecr:BatchGetImage. This is the most common cause in AWS environments.
Service not reachable
Pod is running but nothing can talk to it. This is almost always a label selector mismatch.
# Check if service selectors match pod labels
kubectl get svc <service-name> -n <namespace> -o yaml | grep selector
kubectl get pods -n <namespace> --show-labels
# Check if endpoints are populated
kubectl get endpoints <service-name> -n <namespace>
If kubectl get endpoints shows <none>, the service has no matching pods. Compare the selector in the Service with the labels on your pods — one character difference is enough.
OOMKilled — container killed by memory limit
# Find OOMKilled pods
kubectl get pods -A | grep OOMKilled
# Check actual memory usage vs limits
kubectl top pod <pod-name> -n <namespace>
kubectl describe pod <pod-name> | grep -A 3 Limits
Don't just increase the limit and move on. First understand why memory usage spiked. A memory leak in the application won't be fixed by a higher limit — you'll just OOMKill at a higher threshold.
Debugging inside a running pod
# Exec into a running container
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh
# Run a debug pod in the same namespace
kubectl run debug --image=busybox -it --rm -n <namespace> -- sh
# Check DNS resolution from inside the cluster
nslookup <service-name>.<namespace>.svc.cluster.local
The five commands I run first on any broken cluster
kubectl get pods -A | grep -v Running # everything not healthy
kubectl get events -A --sort-by='.lastTimestamp' # recent cluster events
kubectl top nodes # node resource pressure
kubectl get nodes # node status
kubectl describe pod <broken-pod> # the actual error
In that order. The events log is the most underused tool in Kubernetes — it shows you what the cluster was doing in the last hour without having to know which pod to look at first.