Local Kubernetes Dev — Part 15: Common problems and how to fix them
A field guide to the Kubernetes statuses you will meet every week — ImagePullBackOff, CrashLoopBackOff, Pending, OOMKilled and more — diagnosed with the same three commands and fixed with a short list of common causes.
Sooner or later kubectl get pods will show you something other than a cozy Running — maybe ImagePullBackOff or CrashLoopBackOff. That's normal: even experienced engineers see these statuses every week. The good news is that almost every Kubernetes problem is diagnosed with the same three commands, and most errors come down to a short list of common causes.
Memorize this "golden set" — you'll need it in every section below:
1# 1. Describe the pod: status, reason, events — ALWAYS the starting point
2kubectl describe pod <pod> -n myapp
3
4# 2. Application logs (and logs of a CRASHED container via --previous)
5kubectl logs <pod> -n myapp
6kubectl logs <pod> -n myapp --previous
7
8# 3. Timeline of events across the whole namespace
9kubectl get events -n myapp --sort-by=.lastTimestampkubectl describe shows the State field (the container's current state), Reason (why it's in that state), and the Events block at the bottom — those are the three most important places to look. Let's go through the common statuses in order.
ImagePullBackOff / ErrImagePull
These two statuses are about an image that Kubernetes couldn't pull. First you see ErrImagePull (the first failed attempt), and after several retries with an increasing pause (backoff) the pod moves to ImagePullBackOff. The pod stays in the Waiting state and never starts.
Let's look at the reason:
1kubectl describe pod <pod> -n myapp
2# In the output, look for:
3# State: Waiting
4# Reason: ImagePullBackOff
5# Events: Failed to pull image "...": ... not found / unauthorized / no such hostCommon causes:
- A typo in the image name or tag. The most basic and most common. Check
image:in the manifest against what actually exists. - A private registry you don't have access to. You need
imagePullSecrets(for secrets, see chapter 10). - Rate limiting on Docker Hub. Anonymous users are limited in how many pulls they can do; the message will say
toomanyrequests. - Network problems or a typo in the registry address —
no such host.
The big gotcha specific to k3d
k3d nodes run on containerd — Kubernetes' own container engine — and it is isolated from your Docker daemon. This leads to a counterintuitive fact: the cluster doesn't see the image you just built locally with docker build. Docker has it, the cluster doesn't, and you get ImagePullBackOff even though "the image is built" (OneUptime: Docker images with k3d). This is just a quick summary for diagnosis; image delivery into k3d is covered in detail in the chapter on containerization.
There are two correct ways to deliver an image into the cluster.
Option 1. Import the image into the nodes directly:
1docker build -t myapp:dev .
2k3d image import myapp:dev -c devOption 2 (recommended). A local registry. When you create the cluster, bring up the built-in registry, and in your manifests reference the full name with the address and port (k3d: Using Image Registries):
1# the registry is created together with the cluster
2k3d cluster create dev --registry-create k3d-registry.localhost:50001# in the Deployment the full name matters: registry address + port + tag
2containers:
3 - name: myapp
4 image: k3d-registry.localhost:5000/myapp:devA common mistake here is an incomplete image name: writing myapp:dev instead of k3d-registry.localhost:5000/myapp:dev, or forgetting the port or address. The cluster will go looking for the image in the wrong place and fall into ImagePullBackOff again.
If you use Tilt, you can forget about importing: docker_build in the Tiltfile builds and delivers the image into the cluster for you. But there's a condition — the image name in the Tiltfile must exactly match the image: value in the manifest. If it doesn't match, Tilt builds one image while the Deployment asks for another, and you get the same ImagePullBackOff (more on Tilt in chapter 8).
CrashLoopBackOff
CrashLoopBackOff means: the container starts, crashes, the kubelet restarts it, it crashes again — and round and round. To avoid spinning on restarts pointlessly, the kubelet increases the pause between attempts exponentially: roughly 10s → 20s → 40s and so on, with a ceiling of 5 minutes (GKE: Troubleshoot CrashLoopBackOff).
The status by itself doesn't tell you why it crashed. The key diagnostic command is the logs of the previous, already-dead instance of the container:
1kubectl logs <pod> -n myapp --previousWithout --previous you'll see the logs of the freshly started container, which most likely hasn't had time to write anything. --previous pulls the stack trace from exactly the instance that crashed. Additionally, in kubectl describe pod look at the Last State: Terminated block and its Exit Code.
Common causes:
- A bug in the application — an exception at startup, a non-zero exit code. For our
myappthe classic case is missing database connection variables (DB_HOST,DB_PASSWORD, etc.), and FastAPI crashes when it tries to connect to PostgreSQL at startup. - A missing env variable or config (a ConfigMap/Secret isn't mounted).
- A dependency is unavailable — PostgreSQL hasn't come up yet, and the service doesn't know how to wait.
- OOM — the application ran out of memory (see the OOMKilled section below).
- An overly strict liveness probe kills the container before it has time to warm up. The fix is
initialDelaySecondsor a separate startup probe (Kubernetes: Probes). - The container exited with exit code 0. Counterintuitive, but for a long-running service this is also
CrashLoopBackOff: the controller expects the process to run continuously, but it "exited successfully." Usually the cause is an incorrect entrypoint/command that runs and exits instead of startinguvicorn.
A useful trick when the container crashes instantly and the logs are empty: temporarily override the command to "do nothing" so the container survives, and get inside it by hand.
1# temporary, in the Deployment, so the container doesn't crash and you can get inside
2command: ["sleep", "infinity"]1kubectl exec -it <pod> -n myapp -- sh
2# inside: check env, try running uvicorn by hand, and read the real error
3env | grep DB_
4uvicorn app.main:app --host 0.0.0.0 --port 8080Pending
A pod in the Pending status hasn't been assigned to any node yet — the scheduler couldn't find a place to put it. The command is the same:
1kubectl describe pod <pod> -n myapp
2# In Events, look for:
3# Warning FailedScheduling ... 0/1 nodes are available: 1 Insufficient cpu, 1 Insufficient memoryThe FailedScheduling message usually names the cause directly (Kubernetes: Debug Pods):
- Not enough CPU or memory for the pod's
requests— the most common cause on a local cluster. If you setmyappto request 4 CPUs but your k3d VM only has 2, the pod will never be scheduled. - A nodeSelector / affinity mismatch — the pod asks for a node with a label that doesn't exist.
- Taints without tolerations — the node is "marked" in a way that keeps ordinary pods off it.
- An unbound PVC — the pod is waiting for a volume that can't be created.
hostPort— the port is already taken on the node, so there's nowhere to place the pod.
Solutions, in order: lower requests to something reasonable, fix the selector/PVC, and if it's just a small local cluster — add nodes or recreate it bigger:
1# add agents (worker nodes) to an existing cluster
2k3d node create extra --cluster dev --role agent
3
4# or recreate the cluster with several agents
5k3d cluster delete dev
6k3d cluster create dev --agents 2 --registry-create k3d-registry.localhost:5000For myapp, sensible requests on a local machine are something modest, for example 100m CPU and 128Mi of memory; don't blindly copy "production" numbers into a small cluster.
OOMKilled
OOMKilled is "Out Of Memory Killed": the process exceeded its memory limit and the kernel killed it with a SIGKILL signal. It's easy to recognize by exit code 137 = 128 + 9, where 9 is the number of the SIGKILL signal (Komodor: OOMKilled / Exit Code 137).
1kubectl describe pod <pod> -n myapp
2# Last State: Terminated
3# Reason: OOMKilled
4# Exit Code: 137Important: it's the Reason: OOMKilled field that distinguishes running out of memory from other cases of SIGKILL. OOM is often the hidden cause of the CrashLoopBackOff from the previous section — the container crashes on memory, restarts, and hits the limit again.
There are two scenarios:
- Container-level OOM. The container exceeded its
resources.limits.memory. The fix is either raising the limit or fixing a leak/inefficiency in the code. - Node-level OOM. The whole node ran out of memory, and the kubelet starts evicting pods. On k3d this is especially sneaky: the nodes live inside a Docker VM (Docker Desktop, colima, etc.), and if your pods' combined limits exceed that VM's memory, pods will be OOM-killed even when each application is within its own limit.
For myapp it looks like this:
1resources:
2 requests:
3 memory: "128Mi"
4 limits:
5 memory: "256Mi" # FastAPI + a couple of workers usually fit; watch for leaksIf you're hitting node-level OOM, either reduce the pods' combined limits or give the Docker VM more memory in its settings. Don't overcommit: the sum of all pods' limits should not exceed the VM's memory.
The Service doesn't respond (selector / ports / readiness)
A separate genre of problems: the pods are Running, everything is "green," yet a request to the Service never reaches the application. Almost always the culprit is one of the three breaks in the Service → endpoints → Pod chain.
The first command is to look at the endpoints. If they're empty, the Service didn't find a single pod:
1kubectl get endpoints myapp -n myapp
2# NAME ENDPOINTS AGE
3# myapp <none> 5m <- bad: no pods behind the service
4
5# on newer clusters the same thing via EndpointSlices:
6kubectl get endpointslices -n myapp -l kubernetes.io/service-name=myappCause 1. Selector mismatch. Labels and selectors are case-sensitive: app: Web and app: web are different things, and the Service simply won't pick up the pod. Check the pods' labels against the service's selector (OneUptime: Service not reaching pods):
1kubectl get pods -n myapp --show-labels
2kubectl get svc myapp -n myapp -o jsonpath='{.spec.selector}'
3# the service's selector must match the pods' labels character for characterCause 2. Port mismatch. The targetPort in the Service must point to the containerPort the application actually listens on. For myapp that's 8080 (Kubernetes: Debug Pods):
1kubectl get svc myapp -n myapp -o yaml | grep -A3 ports
2# check targetPort against the containerPort in the Deployment (everywhere it's 8080 for us)Cause 3. The readiness probe doesn't pass. This is the trickiest of the three. If the readiness probe is red, the pod shows 0/1 Ready, and Kubernetes removes it from the endpoints — traffic doesn't flow. Meanwhile the container doesn't restart and in kubectl get pods looks like Running. This is precisely how readiness differs from liveness: a liveness failure restarts the container, while a readiness failure only takes it out of traffic (Kubernetes: Probes). The result is "the service silently doesn't respond, yet the pod seems alive."
1kubectl get pods -n myapp
2# NAME READY STATUS RESTARTS
3# myapp-xxxx 0/1 Running 0 <- 0/1: readiness didn't pass
4kubectl describe pod <pod> -n myapp | grep -A5 ReadinessA good way to localize the problem is to knock around the Service, straight into the pod via port-forward. If that works but going through the Service doesn't, the issue is in the selector/endpoints/readiness, not the application:
1kubectl port-forward <pod> -n myapp 8080:8080
2curl http://localhost:8080/healthz
3
4# check the service's DNS name from inside the cluster
5kubectl run debug --rm -it --image=busybox:1.36 -n myapp -- \
6 nslookup myapp.myapp.svc.cluster.localMore on Services, ports, and Ingress in chapter 11.
Tilt doesn't pick up changes
You saved a file, you expect Tilt to update the container instantly — but nothing happens, or Tilt does a full image rebuild every time instead of a fast Live Update. Let's figure out how this works.
A Live Update in the Tiltfile consists of steps, and the order matters (Tilt: Live Update Reference):
fall_back_on(...)— always first; lists files whose change forces a full rebuild (for example,requirements.txt).sync('./app', '/code/app')— copies the changed files into the running container (/codeis the WORKDIR from the Dockerfile in chapter 6).run('...')— runs after all the syncs (for example, to reinstall dependencies).- restarting the process — needed if the application can't hot-reload. For Kubernetes this isn't a separate step but the
docker_build_with_restartwrapper from therestart_processextension (covered in detail in chapter 8); the built-inrestart_container()step from the reference remains relevant mainly for Docker Compose.
1docker_build(
2 'k3d-registry.localhost:5000/myapp', '.',
3 live_update=[
4 fall_back_on('requirements.txt'), # 1) force a full rebuild
5 sync('./app', '/code/app'), # 2) copy the code
6 run('pip install -r requirements.txt',
7 trigger=['requirements.txt']), # 3) run after sync
8 ],
9)Why changes "aren't picked up" or a full rebuild happens:
- The synced path is outside the build context. The rule is simple: "if Tilt is watching it, you can sync it" — but you can only sync something that's inside the build context (the second argument to
docker_build). A file outside it Tilt will ignore. - The file is in the context but not covered by any
sync()— the change is there, but no rule tells Tilt to copy it. - A file from
fall_back_onchanged — by design this triggers a full rebuild, not a Live Update. run()comes beforesync()— the step order is broken.
And one separate, very common FastAPI trap: without restart_process, the synced code lands in the container but the process doesn't re-read it. The file is already new, but uvicorn keeps running the old code in memory — it looks exactly like "the changes didn't apply."
The solution depends on how the application is launched. The simplest option for myapp is to run uvicorn with auto-reload — then it picks up the synced files on its own and you don't need restart_process:
1# uvicorn re-reads the code after sync — Live Update without restarting the process
2CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8080", "--reload"]If the application is launched without --reload (or it's custom_build / Docker Compose), restarting the process for a Kubernetes resource is done with the docker_build_with_restart wrapper from the restart_process extension (the same one as in chapter 8):
1load('ext://restart_process', 'docker_build_with_restart')
2
3docker_build_with_restart(
4 'k3d-registry.localhost:5000/myapp', '.',
5 entrypoint=['uvicorn', 'app.main:app', '--host', '0.0.0.0', '--port', '8080'],
6 live_update=[
7 sync('./app', '/code/app'), # the code is synced, and the wrapper restarts the process
8 ],
9)k3d / Docker ate your disk or memory
After a couple of weeks of active development, you suddenly notice the disk is running out and Docker has swallowed tens of gigabytes. The culprits are accumulated images and build layers.
Why it "doesn't clean up by itself"
The kubelet has a built-in garbage collector for images, but it's lazy. By default imageGCHighThresholdPercent = 85%, imageGCLowThresholdPercent = 80%: until the node's disk usage exceeds 85%, cleanup doesn't run at all. Once it does, the kubelet deletes the least recently used images until usage drops back to 80%. Image GC runs roughly every 5 minutes, container GC every minute (Kubernetes: Garbage Collection).
The takeaway: space "piles up" on purpose, and it may look like GC is broken — but it's just waiting for the 85% threshold.
Cleaning up Docker by hand
First, look at what's actually taking up space:
1docker system df # summary: images, containers, volumes, build cache
2docker system df -v # detailed, line by lineThen clean up incrementally (Docker: docker system prune):
1docker system prune # stopped containers, unused networks,
2 # dangling images, and build cache
3docker builder prune # build cache only
4docker system prune -a # ALL unused images, not just dangling ones
5docker system prune -a --volumes # plus anonymous volumes (careful with data!)Two important gotchas:
docker system prune -acan wipe out images the cluster needs. If you imported an image into k3d withk3d image import, an aggressive cleanup deletes the source from Docker — and on the pod's next restart you'll hitImagePullBackOffor have to import it again.--volumestouches anonymous volumes; if they held data (for example, a local PostgreSQL formyapp), it's gone. By defaultprunedoesn't touch volumes, and that's the right behavior.
The cleanest reset
If you want to reliably free everything the cluster took and start from a clean slate, it's easier to delete and recreate the cluster itself. This removes all the node containers, their images, and their layers in one go:
1k3d cluster delete dev
2k3d cluster create dev --registry-create k3d-registry.localhost:5000About memory
k3d nodes' memory is bounded by the Docker VM's resources, just like in the OOMKilled section. If the cluster simply doesn't have enough memory, what you need to increase is not "limits in Kubernetes" but the memory allocated to the Docker VM in the Docker Desktop / colima settings. Pods live inside that VM and can't get more than it has.
The general algorithm for any problem
It's always the same: kubectl get pods shows the status → kubectl describe pod explains the cause in Reason/Events → kubectl logs --previous gives the details of the crash. Ninety percent of cases are something covered above. For the rest, the chapter on debugging and observability and the official documentation below will help.
Sources
- Debug Pods — Kubernetes
- Configure Liveness, Readiness and Startup Probes — Kubernetes
- Garbage Collection — Kubernetes
- Troubleshoot CrashLoopBackOff events — GKE
- Live Update Reference — Tilt
- Using Image Registries — k3d
- How to Fix OOMKilled Kubernetes Error (Exit Code 137) — Komodor
- How to Debug Kubernetes Service Not Reaching Pods — OneUptime
- How to Use Docker Images with k3d (k3s in Docker) — OneUptime
- docker system prune — Docker Docs