Local Kubernetes Dev — Part 15: Common problems and how to fix them

Sooner or later kubectl get pods will show you something other than a cozy Running — maybe ImagePullBackOff or CrashLoopBackOff. That's normal: even experienced engineers see these statuses every week. The good news is that almost every Kubernetes problem is diagnosed with the same three commands, and most errors come down to a short list of common causes.

Memorize this "golden set" — you'll need it in every section below:

bash

1# 1. Describe the pod: status, reason, events — ALWAYS the starting point
2kubectl describe pod <pod> -n myapp
3
4# 2. Application logs (and logs of a CRASHED container via --previous)
5kubectl logs <pod> -n myapp
6kubectl logs <pod> -n myapp --previous
7
8# 3. Timeline of events across the whole namespace
9kubectl get events -n myapp --sort-by=.lastTimestamp

kubectl describe shows the State field (the container's current state), Reason (why it's in that state), and the Events block at the bottom — those are the three most important places to look. Let's go through the common statuses in order.

ImagePullBackOff / ErrImagePull

These two statuses are about an image that Kubernetes couldn't pull. First you see ErrImagePull (the first failed attempt), and after several retries with an increasing pause (backoff) the pod moves to ImagePullBackOff. The pod stays in the Waiting state and never starts.

Let's look at the reason:

bash

1kubectl describe pod <pod> -n myapp
2# In the output, look for:
3#   State:   Waiting
4#   Reason:  ImagePullBackOff
5#   Events:  Failed to pull image "...": ... not found / unauthorized / no such host

Common causes:

A typo in the image name or tag. The most basic and most common. Check image: in the manifest against what actually exists.
A private registry you don't have access to. You need imagePullSecrets (for secrets, see chapter 10).
Rate limiting on Docker Hub. Anonymous users are limited in how many pulls they can do; the message will say toomanyrequests.
Network problems or a typo in the registry address — no such host.

The big gotcha specific to k3d

k3d nodes run on containerd — Kubernetes' own container engine — and it is isolated from your Docker daemon. This leads to a counterintuitive fact: the cluster doesn't see the image you just built locally with docker build. Docker has it, the cluster doesn't, and you get ImagePullBackOff even though "the image is built" (OneUptime: Docker images with k3d). This is just a quick summary for diagnosis; image delivery into k3d is covered in detail in the chapter on containerization.

There are two correct ways to deliver an image into the cluster.

Option 1. Import the image into the nodes directly:

bash

1docker build -t myapp:dev .
2k3d image import myapp:dev -c dev

Option 2 (recommended). A local registry. When you create the cluster, bring up the built-in registry, and in your manifests reference the full name with the address and port (k3d: Using Image Registries):

bash

1# the registry is created together with the cluster
2k3d cluster create dev --registry-create k3d-registry.localhost:5000

yaml

1# in the Deployment the full name matters: registry address + port + tag
2containers:
3  - name: myapp
4    image: k3d-registry.localhost:5000/myapp:dev

A common mistake here is an incomplete image name: writing myapp:dev instead of k3d-registry.localhost:5000/myapp:dev, or forgetting the port or address. The cluster will go looking for the image in the wrong place and fall into ImagePullBackOff again.

If you use Tilt, you can forget about importing: docker_build in the Tiltfile builds and delivers the image into the cluster for you. But there's a condition — the image name in the Tiltfile must exactly match the image: value in the manifest. If it doesn't match, Tilt builds one image while the Deployment asks for another, and you get the same ImagePullBackOff (more on Tilt in chapter 8).

CrashLoopBackOff

CrashLoopBackOff means: the container starts, crashes, the kubelet restarts it, it crashes again — and round and round. To avoid spinning on restarts pointlessly, the kubelet increases the pause between attempts exponentially: roughly 10s → 20s → 40s and so on, with a ceiling of 5 minutes (GKE: Troubleshoot CrashLoopBackOff).

The status by itself doesn't tell you why it crashed. The key diagnostic command is the logs of the previous, already-dead instance of the container:

bash

1kubectl logs <pod> -n myapp --previous

Without --previous you'll see the logs of the freshly started container, which most likely hasn't had time to write anything. --previous pulls the stack trace from exactly the instance that crashed. Additionally, in kubectl describe pod look at the Last State: Terminated block and its Exit Code.

Common causes:

A bug in the application — an exception at startup, a non-zero exit code. For our myapp the classic case is missing database connection variables (DB_HOST, DB_PASSWORD, etc.), and FastAPI crashes when it tries to connect to PostgreSQL at startup.
A missing env variable or config (a ConfigMap/Secret isn't mounted).
A dependency is unavailable — PostgreSQL hasn't come up yet, and the service doesn't know how to wait.
OOM — the application ran out of memory (see the OOMKilled section below).
An overly strict liveness probe kills the container before it has time to warm up. The fix is initialDelaySeconds or a separate startup probe (Kubernetes: Probes).
The container exited with exit code 0. Counterintuitive, but for a long-running service this is also CrashLoopBackOff: the controller expects the process to run continuously, but it "exited successfully." Usually the cause is an incorrect entrypoint/command that runs and exits instead of starting uvicorn.

A useful trick when the container crashes instantly and the logs are empty: temporarily override the command to "do nothing" so the container survives, and get inside it by hand.

yaml

1# temporary, in the Deployment, so the container doesn't crash and you can get inside
2command: ["sleep", "infinity"]

bash

1kubectl exec -it <pod> -n myapp -- sh
2# inside: check env, try running uvicorn by hand, and read the real error
3env | grep DB_
4uvicorn app.main:app --host 0.0.0.0 --port 8080

Pending

A pod in the Pending status hasn't been assigned to any node yet — the scheduler couldn't find a place to put it. The command is the same:

bash

1kubectl describe pod <pod> -n myapp
2# In Events, look for:
3#   Warning  FailedScheduling  ... 0/1 nodes are available: 1 Insufficient cpu, 1 Insufficient memory

The FailedScheduling message usually names the cause directly (Kubernetes: Debug Pods):

Not enough CPU or memory for the pod's requests — the most common cause on a local cluster. If you set myapp to request 4 CPUs but your k3d VM only has 2, the pod will never be scheduled.
A nodeSelector / affinity mismatch — the pod asks for a node with a label that doesn't exist.
Taints without tolerations — the node is "marked" in a way that keeps ordinary pods off it.
An unbound PVC — the pod is waiting for a volume that can't be created.
hostPort — the port is already taken on the node, so there's nowhere to place the pod.

Solutions, in order: lower requests to something reasonable, fix the selector/PVC, and if it's just a small local cluster — add nodes or recreate it bigger:

bash

1# add agents (worker nodes) to an existing cluster
2k3d node create extra --cluster dev --role agent
3
4# or recreate the cluster with several agents
5k3d cluster delete dev
6k3d cluster create dev --agents 2 --registry-create k3d-registry.localhost:5000

For myapp, sensible requests on a local machine are something modest, for example 100m CPU and 128Mi of memory; don't blindly copy "production" numbers into a small cluster.

OOMKilled

OOMKilled is "Out Of Memory Killed": the process exceeded its memory limit and the kernel killed it with a SIGKILL signal. It's easy to recognize by exit code 137 = 128 + 9, where 9 is the number of the SIGKILL signal (Komodor: OOMKilled / Exit Code 137).

bash

1kubectl describe pod <pod> -n myapp
2# Last State:  Terminated
3# Reason:      OOMKilled
4# Exit Code:   137

Important: it's the Reason: OOMKilled field that distinguishes running out of memory from other cases of SIGKILL. OOM is often the hidden cause of the CrashLoopBackOff from the previous section — the container crashes on memory, restarts, and hits the limit again.

There are two scenarios:

Container-level OOM. The container exceeded its resources.limits.memory. The fix is either raising the limit or fixing a leak/inefficiency in the code.
Node-level OOM. The whole node ran out of memory, and the kubelet starts evicting pods. On k3d this is especially sneaky: the nodes live inside a Docker VM (Docker Desktop, colima, etc.), and if your pods' combined limits exceed that VM's memory, pods will be OOM-killed even when each application is within its own limit.

For myapp it looks like this:

yaml

1resources:
2  requests:
3    memory: "128Mi"
4  limits:
5    memory: "256Mi"   # FastAPI + a couple of workers usually fit; watch for leaks

If you're hitting node-level OOM, either reduce the pods' combined limits or give the Docker VM more memory in its settings. Don't overcommit: the sum of all pods' limits should not exceed the VM's memory.

The Service doesn't respond (selector / ports / readiness)

A separate genre of problems: the pods are Running, everything is "green," yet a request to the Service never reaches the application. Almost always the culprit is one of the three breaks in the Service → endpoints → Pod chain.

The first command is to look at the endpoints. If they're empty, the Service didn't find a single pod:

bash

1kubectl get endpoints myapp -n myapp
2# NAME    ENDPOINTS   AGE
3# myapp   <none>      5m      <- bad: no pods behind the service
4
5# on newer clusters the same thing via EndpointSlices:
6kubectl get endpointslices -n myapp -l kubernetes.io/service-name=myapp

Cause 1. Selector mismatch. Labels and selectors are case-sensitive: app: Web and app: web are different things, and the Service simply won't pick up the pod. Check the pods' labels against the service's selector (OneUptime: Service not reaching pods):

bash

1kubectl get pods -n myapp --show-labels
2kubectl get svc myapp -n myapp -o jsonpath='{.spec.selector}'
3# the service's selector must match the pods' labels character for character

Cause 2. Port mismatch. The targetPort in the Service must point to the containerPort the application actually listens on. For myapp that's 8080 (Kubernetes: Debug Pods):

bash

1kubectl get svc myapp -n myapp -o yaml | grep -A3 ports
2# check targetPort against the containerPort in the Deployment (everywhere it's 8080 for us)

Cause 3. The readiness probe doesn't pass. This is the trickiest of the three. If the readiness probe is red, the pod shows 0/1 Ready, and Kubernetes removes it from the endpoints — traffic doesn't flow. Meanwhile the container doesn't restart and in kubectl get pods looks like Running. This is precisely how readiness differs from liveness: a liveness failure restarts the container, while a readiness failure only takes it out of traffic (Kubernetes: Probes). The result is "the service silently doesn't respond, yet the pod seems alive."

bash

1kubectl get pods -n myapp
2# NAME            READY   STATUS    RESTARTS
3# myapp-xxxx      0/1     Running   0          <- 0/1: readiness didn't pass
4kubectl describe pod <pod> -n myapp | grep -A5 Readiness

A good way to localize the problem is to knock around the Service, straight into the pod via port-forward. If that works but going through the Service doesn't, the issue is in the selector/endpoints/readiness, not the application:

bash

1kubectl port-forward <pod> -n myapp 8080:8080
2curl http://localhost:8080/healthz
3
4# check the service's DNS name from inside the cluster
5kubectl run debug --rm -it --image=busybox:1.36 -n myapp -- \
6  nslookup myapp.myapp.svc.cluster.local

More on Services, ports, and Ingress in chapter 11.

Tilt doesn't pick up changes

You saved a file, you expect Tilt to update the container instantly — but nothing happens, or Tilt does a full image rebuild every time instead of a fast Live Update. Let's figure out how this works.

A Live Update in the Tiltfile consists of steps, and the order matters (Tilt: Live Update Reference):

fall_back_on(...) — always first; lists files whose change forces a full rebuild (for example, requirements.txt).
sync('./app', '/code/app') — copies the changed files into the running container (/code is the WORKDIR from the Dockerfile in chapter 6).
run('...') — runs after all the syncs (for example, to reinstall dependencies).
restarting the process — needed if the application can't hot-reload. For Kubernetes this isn't a separate step but the docker_build_with_restart wrapper from the restart_process extension (covered in detail in chapter 8); the built-in restart_container() step from the reference remains relevant mainly for Docker Compose.

python

1docker_build(
2    'k3d-registry.localhost:5000/myapp', '.',
3    live_update=[
4        fall_back_on('requirements.txt'),   # 1) force a full rebuild
5        sync('./app', '/code/app'),         # 2) copy the code
6        run('pip install -r requirements.txt',
7            trigger=['requirements.txt']),  # 3) run after sync
8    ],
9)

Why changes "aren't picked up" or a full rebuild happens:

The synced path is outside the build context. The rule is simple: "if Tilt is watching it, you can sync it" — but you can only sync something that's inside the build context (the second argument to docker_build). A file outside it Tilt will ignore.
The file is in the context but not covered by any sync() — the change is there, but no rule tells Tilt to copy it.
A file from fall_back_on changed — by design this triggers a full rebuild, not a Live Update.
run() comes before sync() — the step order is broken.

And one separate, very common FastAPI trap: without restart_process, the synced code lands in the container but the process doesn't re-read it. The file is already new, but uvicorn keeps running the old code in memory — it looks exactly like "the changes didn't apply."

The solution depends on how the application is launched. The simplest option for myapp is to run uvicorn with auto-reload — then it picks up the synced files on its own and you don't need restart_process:

dockerfile

1# uvicorn re-reads the code after sync — Live Update without restarting the process
2CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8080", "--reload"]

If the application is launched without --reload (or it's custom_build / Docker Compose), restarting the process for a Kubernetes resource is done with the docker_build_with_restart wrapper from the restart_process extension (the same one as in chapter 8):

python

1load('ext://restart_process', 'docker_build_with_restart')
2
3docker_build_with_restart(
4    'k3d-registry.localhost:5000/myapp', '.',
5    entrypoint=['uvicorn', 'app.main:app', '--host', '0.0.0.0', '--port', '8080'],
6    live_update=[
7        sync('./app', '/code/app'),   # the code is synced, and the wrapper restarts the process
8    ],
9)

k3d / Docker ate your disk or memory

After a couple of weeks of active development, you suddenly notice the disk is running out and Docker has swallowed tens of gigabytes. The culprits are accumulated images and build layers.

Why it "doesn't clean up by itself"

The kubelet has a built-in garbage collector for images, but it's lazy. By default imageGCHighThresholdPercent = 85%, imageGCLowThresholdPercent = 80%: until the node's disk usage exceeds 85%, cleanup doesn't run at all. Once it does, the kubelet deletes the least recently used images until usage drops back to 80%. Image GC runs roughly every 5 minutes, container GC every minute (Kubernetes: Garbage Collection).

The takeaway: space "piles up" on purpose, and it may look like GC is broken — but it's just waiting for the 85% threshold.

Cleaning up Docker by hand

First, look at what's actually taking up space:

bash

1docker system df          # summary: images, containers, volumes, build cache
2docker system df -v       # detailed, line by line

Then clean up incrementally (Docker: docker system prune):

bash

1docker system prune               # stopped containers, unused networks,
2                                  # dangling images, and build cache
3docker builder prune              # build cache only
4docker system prune -a            # ALL unused images, not just dangling ones
5docker system prune -a --volumes  # plus anonymous volumes (careful with data!)

Two important gotchas:

docker system prune -a can wipe out images the cluster needs. If you imported an image into k3d with k3d image import, an aggressive cleanup deletes the source from Docker — and on the pod's next restart you'll hit ImagePullBackOff or have to import it again.
--volumes touches anonymous volumes; if they held data (for example, a local PostgreSQL for myapp), it's gone. By default prune doesn't touch volumes, and that's the right behavior.

The cleanest reset

If you want to reliably free everything the cluster took and start from a clean slate, it's easier to delete and recreate the cluster itself. This removes all the node containers, their images, and their layers in one go:

bash

1k3d cluster delete dev
2k3d cluster create dev --registry-create k3d-registry.localhost:5000

About memory

k3d nodes' memory is bounded by the Docker VM's resources, just like in the OOMKilled section. If the cluster simply doesn't have enough memory, what you need to increase is not "limits in Kubernetes" but the memory allocated to the Docker VM in the Docker Desktop / colima settings. Pods live inside that VM and can't get more than it has.

The general algorithm for any problem

It's always the same: kubectl get pods shows the status → kubectl describe pod explains the cause in Reason/Events → kubectl logs --previous gives the details of the crash. Ninety percent of cases are something covered above. For the rest, the chapter on debugging and observability and the official documentation below will help.

ImagePullBackOff / ErrImagePull

The big gotcha specific to k3d

CrashLoopBackOff

Pending

OOMKilled

The Service doesn't respond (selector / ports / readiness)

Tilt doesn't pick up changes

k3d / Docker ate your disk or memory

Why it "doesn't clean up by itself"

Cleaning up Docker by hand

The cleanest reset

About memory

Sources