KubernetesEKSDevOps Jan 2025 11 min read

Kubernetes Troubleshooting: Common Errors and How to Fix Them

CrashLoopBackOff, Pending pods, ImagePullBackOff — a production-focused guide to diagnosing and fixing the most common Kubernetes errors with real commands.

The mindset before the commands

Kubernetes troubleshooting is mostly about asking the right question before running the right command. The cluster is rarely lying to you — it's usually telling you exactly what went wrong, but in a format that requires some translation. After managing EKS clusters across dozens of microservices, these are the errors I've debugged most often and what they actually mean.

CrashLoopBackOff

This is the most common one and it always means the same thing: your container started, crashed, Kubernetes tried to restart it, it crashed again. The backoff is just Kubernetes slowing down the restart attempts.

# First thing to run
kubectl describe pod <pod-name> -n <namespace>

# Look at previous container logs (the crashed instance)
kubectl logs <pod-name> --previous -n <namespace>

The --previous flag is the one people forget. The current container logs are often empty because it crashed before writing anything. The previous container's logs usually have the actual error.

Common causes:

Quick check: kubectl describe pod will show OOMKilled in the Last State section if memory was the cause. If you see exit code 137, that's OOM.

Pending pods that never schedule

A pod stuck in Pending means the scheduler couldn't find a node to place it on.

kubectl describe pod <pod-name> -n <namespace>
# Look for Events section at the bottom
# "0/3 nodes are available" tells you exactly why

Common reasons from the Events output:

# Check node capacity vs requests
kubectl describe nodes | grep -A 5 "Allocated resources"

# Check PVC status
kubectl get pvc -n <namespace>

ImagePullBackOff

The container image couldn't be pulled. Either the image doesn't exist, the tag is wrong, or the cluster doesn't have credentials to pull from a private registry.

# Verify the image exists and tag is correct
kubectl describe pod <pod-name> | grep Image

# Check if imagePullSecret is configured
kubectl get secret -n <namespace>
kubectl describe serviceaccount default -n <namespace>

For private ECR on EKS — the node's IAM role needs ecr:GetAuthorizationToken and ecr:BatchGetImage. This is the most common cause in AWS environments.

Service not reachable

Pod is running but nothing can talk to it. This is almost always a label selector mismatch.

# Check if service selectors match pod labels
kubectl get svc <service-name> -n <namespace> -o yaml | grep selector
kubectl get pods -n <namespace> --show-labels

# Check if endpoints are populated
kubectl get endpoints <service-name> -n <namespace>

If kubectl get endpoints shows <none>, the service has no matching pods. Compare the selector in the Service with the labels on your pods — one character difference is enough.

OOMKilled — container killed by memory limit

# Find OOMKilled pods
kubectl get pods -A | grep OOMKilled

# Check actual memory usage vs limits
kubectl top pod <pod-name> -n <namespace>
kubectl describe pod <pod-name> | grep -A 3 Limits

Don't just increase the limit and move on. First understand why memory usage spiked. A memory leak in the application won't be fixed by a higher limit — you'll just OOMKill at a higher threshold.

Debugging inside a running pod

# Exec into a running container
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh

# Run a debug pod in the same namespace
kubectl run debug --image=busybox -it --rm -n <namespace> -- sh

# Check DNS resolution from inside the cluster
nslookup <service-name>.<namespace>.svc.cluster.local
Production note: On hardened clusters, exec access may be restricted by admission controllers. Have a debug namespace with relaxed policies ready before you need it at 2am.

The five commands I run first on any broken cluster

kubectl get pods -A | grep -v Running          # everything not healthy
kubectl get events -A --sort-by='.lastTimestamp' # recent cluster events
kubectl top nodes                                # node resource pressure
kubectl get nodes                                # node status
kubectl describe pod <broken-pod>               # the actual error

In that order. The events log is the most underused tool in Kubernetes — it shows you what the cluster was doing in the last hour without having to know which pod to look at first.

← Back to all articles