Local Kubernetes Dev — Part 13: Making your local setup truly production-like

By now myapp is already running in your local k3d cluster (cluster dev, namespace myapp), Tilt rebuilds the image on every change, and you have an Ingress and a PostgreSQL alongside it. Technically, everything works. But "works on my laptop" and "survives production" are two different things.

The good news: almost everything that separates a textbook manifest from a production-ready one can be tested right on your local setup. The behavior of probes, restarts triggered by liveness, how a rolling update proceeds, how the application shuts down on a signal — k3d reproduces all of this just like a real cluster. That means you can catch these traps at your own desk rather than in production on a Friday evening.

This is the very chapter that the first chapter, the chapter on production-like environments, and the chapter on manifests pointed to: here is where we finally add to myapp the probes and requests/limits that were described there as hallmarks of being production-ready. One at a time, we'll add to the Deployment the things that make a service resilient: resource limits, three kinds of probes, multiple replicas, graceful termination, and a zero-downtime rollout strategy. At the end — a checklist.

Resource requests/limits: CPU and memory, OOMKilled

Every container can (and should) be described with two sets of numbers:

requests — how many resources the container is guaranteed to need. The scheduler (the component that decides which node to place a Pod on) uses these numbers: it looks for a node with enough free requests.
limits — the ceiling the container must not exceed. These are enforced by the kubelet (the Kubernetes agent on every node) together with the Linux kernel.

yaml

1resources:
2  requests:
3    cpu: 100m
4    memory: 128Mi
5  limits:
6    cpu: 500m
7    memory: 256Mi

Units. CPU: 1 is one core, 500m is "500 milli," i.e. half a core, with a minimum precision of 1m. Memory is in bytes, but it's more convenient to write it with suffixes: binary Ki/Mi/Gi (1 Mi = 1024 Ki) or decimal M/G.

The key difference between CPU and memory is what happens when a limit is exceeded, and it explains half of all mysterious restarts:

CPU is a compressible resource. If a container hits its cpu limit, it simply gets throttled (slowed down). The container stays alive, it just runs slower.
Memory is NOT compressible. You can't take back memory that's already been handed out. So when the memory limit is exceeded, the kernel kills the process — this is the famous OOMKilled (Out Of Memory) with exit code 137.

So don't confuse the two behaviors: a CPU limit that's set too low gives you slowdowns, a memory limit that's set too low gives you instant Pod death. Details are in the official documentation, Resource Management for Pods and Containers.

A small trap: if you set a limit without a request, Kubernetes silently sets request = limit. Sometimes this inflates your demands on the scheduler more than you intended.

QoS classes: who gets killed first

Based on the presence of requests/limits, Kubernetes assigns the Pod a QoS class (Quality of Service), which determines who gets evicted first under memory pressure:

Guaranteed — all containers have request = limit for both CPU and memory. Evicted last.
Burstable — there are requests, but it's not Guaranteed. The typical case.
BestEffort — no requests and no limits. Evicted first.

In other words, a Pod with no resources at all (BestEffort) is the first candidate for eviction when a node runs short on memory. This is one more reason to always specify at least requests.

How to pick the numbers

Don't guess. The official Kubernetes blog advises you to measure first, then set. You can look at actual consumption right in k3d:

bash

1kubectl top pods -n myapp

(kubectl top requires metrics-server; in k3d it's usually already enabled.)

A practical approach: start with a reasonable minimum — 100m CPU and 128Mi memory — and tune based on reality. For memory, keep some headroom (for example, a request around P99 consumption plus ~20%), because missing the mark = OOMKilled. For CPU you can be more modest (P95 is a good reference), since the worst that can happen is throttling, not death.

Liveness probe: "the app is hung — restart it"

Kubernetes considers a container alive as long as its process hasn't exited. The problem: a process can be alive but hung — a deadlock, a stuck event loop, a leak after which the service stops responding. From the cluster's point of view everything is fine: the process is there.

The liveness probe solves exactly this. The kubelet periodically pokes the container, and if the probe fails failureThreshold times in a row, the kubelet kills the container and restarts it (by default, restartPolicy: Always).

For myapp we'll add a lightweight endpoint in FastAPI:

python

1from fastapi import FastAPI
2
3app = FastAPI()
4
5@app.get("/healthz")
6def healthz():
7    # The check should be simple: is the process itself alive.
8    return {"status": "ok"}

And we'll declare the probe in the Deployment:

yaml

1livenessProbe:
2  httpGet:
3    path: /healthz
4    port: 8080
5  initialDelaySeconds: 15
6  periodSeconds: 10
7  failureThreshold: 3

The key rule (especially emphasized by the Kubernetes blog): keep liveness simple. Don't reach out to the database or external APIs from it. A complex check produces false positives: a dependency goes down → liveness fails → the kubelet restarts a perfectly healthy container. For checking dependencies there's readiness (below). The behavior of probes is described in detail in Configure Liveness, Readiness and Startup Probes.

Readiness probe: "not ready to take traffic"

The readiness probe answers a different question: is the Pod ready to serve requests right now?

The difference from liveness is fundamental:

liveness fails → the container is restarted;
readiness fails → the Pod is marked unready and removed from the Service's endpoints (traffic via the Service no longer goes to it), but the container is NOT restarted.

In other words, readiness means "let's temporarily step away from taking traffic, but no need to restart." Ideal for: warmup/initialization after startup, and for the case where a dependency is temporarily unavailable (for example, PostgreSQL flickering).

For myapp it makes sense to tie readiness to actual readiness to work — including database availability:

python

1@app.get("/ready")
2def ready():
3    # This is the right place to check that the connection to PostgreSQL is alive.
4    # If the DB is unavailable, return 503: the Pod leaves the endpoints,
5    # but is NOT restarted.
6    ...
7    return {"status": "ready"}

yaml

1readinessProbe:
2  httpGet:
3    path: /ready
4    port: 8080
5  periodSeconds: 5

Readiness and liveness can (and should) be used together for the same container: liveness watches the process itself via /healthz, readiness watches readiness to serve requests via /ready. The set of fields for the probes is the same; only the key differs (livenessProbe / readinessProbe).

It's easy to verify on your local setup that readiness actually controls traffic: kubectl get endpointslices -n myapp — the Pod appears and disappears from the list depending on the probe's state.

Startup probe: for those that start slowly

Sometimes an application takes a long time to come up: it warms a cache, runs migrations, reads a large config. If you immediately put an aggressive liveness probe on such a container, you get a restart loop: the application is still starting, but liveness has already decided it's dead and killed it.

In the past this was patched with a large initialDelaySeconds on liveness, but that's a crutch: the delay is fixed and the same for everyone. The right tool is the startup probe.

The logic: until the startup probe succeeds, liveness and readiness are completely disabled. The application is allowed to come up in peace. As soon as startup passes, the other probes turn on. And if startup never passes within the allotted number of attempts (failureThreshold), the container is killed and restarted.

yaml

1startupProbe:
2  httpGet:
3    path: /healthz
4    port: 8080
5  failureThreshold: 30
6  periodSeconds: 10

This example gives the application up to 30 × 10 = 300 seconds to start, and liveness can stay frequent and strict — it simply won't interfere until the service comes up.

For reference, here are the common probe fields and their defaults (the same for all three types):

Field	Default	What it means
`initialDelaySeconds`	`0`	pause before the first check
`periodSeconds`	`10`	interval between checks
`timeoutSeconds`	`1`	timeout for a single check
`failureThreshold`	`3`	how many failures in a row = failure
`successThreshold`	`1`	how many successes = "ok" again

Probe types: httpGet (success is a response code of 200–399), tcpSocket (the port opened), exec (the command returned 0), grpc. All of this is from the official documentation on probes.

Multiple replicas and graceful shutdown (SIGTERM)

A single replica (replicas: 1) means that any restart, rollout, or node failure = downtime. For fault tolerance you need more than one:

yaml

1spec:
2  replicas: 2

But just "adding replicas" isn't enough — you need the Pod to be able to shut down gracefully. Otherwise, with every rollout some requests will be dropped.

What happens when a Pod is deleted

When a Pod is deleted (rollout, scale down, node drain), the sequence is as follows (see Pod Lifecycle and the CNCF breakdown):

The Pod gets a deletionTimestamp, status → Terminating.
The preStop hook runs (if defined).
The kubelet sends SIGTERM to PID 1 of the container.
The kubelet waits terminationGracePeriodSeconds (default 30 seconds).
If the process hasn't exited — SIGKILL (a hard kill).

There are two important subtleties here.

First subtlety: the application must catch SIGTERM. On receiving the signal, myapp should stop taking new requests, finish the current ones, close the PostgreSQL connections, and exit — all within the grace period. A common mistake: launching the application through a shell wrapper without exec. Then PID 1 is the shell, it doesn't forward SIGTERM, the application never sees the signal and gets SIGKILL'd when the grace period expires. In the Dockerfile, start the process "cleanly" (the exec form of CMD) so that uvicorn becomes PID 1 — we talked about this in the chapter on containerization. A modern uvicorn shuts down correctly on SIGTERM.

Second subtlety: the race with endpoints. Removing the Pod from the Service's endpoints and sending SIGTERM happen in parallel, asynchronously. This means that kube-proxy/Ingress may still send traffic for some time to a Pod that has already received SIGTERM and started shutting down — and those requests will fail with 5xx right during the deploy.

The cure for the race: a preStop sleep

The classic solution (see Graceful shutdown in Kubernetes from Learnk8s) is to add a preStop hook with a small sleep. The Pod is already marked for deletion and is leaving the endpoints, but during the sleep the routing has time to update across the whole cluster, and only then does the application begin to shut down.

yaml

1spec:
2  terminationGracePeriodSeconds: 30
3  containers:
4    - name: myapp
5      # ...
6      lifecycle:
7        preStop:
8          exec:
9            command: ["sh", "-c", "sleep 15"]

The recommended shutdown order: preStop sleep → SIGTERM → finish in-flight requests → close long-lived connections (DB, WebSocket) → exit.

Important: preStop counts toward the overall terminationGracePeriodSeconds budget. If you set sleep 30 with a grace period of 30 seconds, there will be no time left for the application itself to shut down — and it'll be killed with SIGKILL. Keep preStop noticeably smaller than the grace period (for example, sleep 15 with terminationGracePeriodSeconds: 30).

Rolling update and why readiness is the key here

By default, a Deployment uses the RollingUpdate strategy: old Pods are replaced with new ones gradually, not all at once. It's controlled by two parameters (see Deployments):

maxSurge (default 25%) — how many Pods can be brought up above the desired number during the rollout;
maxUnavailable (default 25%) — how many Pods can be unavailable at the same time.

The alternative is Recreate: first kill all the old Pods, then bring up the new ones. That's guaranteed downtime on every deploy, and in production it's almost never done that way.

Why readiness is critical here. The rollout waits until the new Pod passes its readiness probe, and only then removes the old one. If there's no readiness probe, Kubernetes considers the Pod ready as soon as the container starts — and begins sending traffic to it even before myapp is actually ready to respond. The result: errors on every rollout. In other words, without readiness, zero-downtime is fundamentally impossible.

For a zero-downtime rollout for myapp:

yaml

1spec:
2  replicas: 2
3  strategy:
4    type: RollingUpdate
5    rollingUpdate:
6      maxSurge: 1
7      maxUnavailable: 0
8  minReadySeconds: 5

maxUnavailable: 0 guarantees that the number of working Pods won't drop during the rollout; maxSurge: 1 brings up one new Pod, waits for its readiness, takes down one old one — and so on in a loop. minReadySeconds is an extra buffer: a Pod is considered available only N seconds after it became ready (protection against Pods that are "ready" and immediately crash).

To check the rollout on your local setup:

bash

1kubectl rollout status deployment/myapp -n myapp
2kubectl get endpointslices -n myapp -w   # you can see Pods come in/out

Do this in your own k3d under load (even a simple while true; do curl ...): if 5xx errors appear at the moment of the rollout, then something in the readiness + preStop sleep + maxUnavailable: 0 combination is missing.

Production-readiness checklist

Run myapp through this list — all of it is verifiable locally in k3d, even before production:

requests and limits are set on every container; the numbers were chosen with kubectl top, not by eye.
memory limit with headroom — missing the mark = OOMKilled (137). Remember: CPU throttles, memory kills.
The Pod is not BestEffort — there are at least requests (otherwise it's evicted first).
livenessProbe — simple, with no trips to the DB; it restarts a hung process.
readinessProbe — the gate for traffic and for the rolling update; this is the right place for dependency checks.
startupProbe — if the application starts slowly (instead of a large initialDelaySeconds).
replicas > 1 — otherwise any restart = downtime.
The application catches SIGTERM and shuts down gracefully; uvicorn is PID 1 (the exec form of CMD, no shell wrapper).
terminationGracePeriodSeconds + preStop sleep against the race with endpoints; preStop smaller than the grace period.
RollingUpdate with maxUnavailable: 0 and maxSurge: 1 for zero-downtime.
Parity with production: the same manifests (Helm/Kustomize), the same major version of Kubernetes locally and in production — packaging manifests is covered in the chapter on preparing for deployment and CI.

If every item is checked off, myapp no longer "works on my laptop" — it "survives production." And, importantly, you verified that ahead of time, on a cheap local cluster.

Resource requests/limits: CPU and memory, OOMKilled

QoS classes: who gets killed first

How to pick the numbers

Liveness probe: "the app is hung — restart it"

Readiness probe: "not ready to take traffic"

Startup probe: for those that start slowly

Multiple replicas and graceful shutdown (SIGTERM)

What happens when a Pod is deleted

The cure for the race: a preStop sleep

Rolling update and why readiness is the key here

Production-readiness checklist

Sources