Kubernetes Health Probes
The container-level health-checking system that lets Kubernetes decide when to restart a container, when to stop sending it traffic, and when to give a slow-starter time to initialize. A core Workloads & Troubleshooting topic on the CKA exam. Synthesized from CKA Day 18 — Kubernetes Health Probes Explained.
The Problem: Kubernetes Is Blind Without Probes
By default, Kubernetes only knows whether a container’s main process is running. It has no insight into:
- Whether the application has finished initializing (database connections, cache warm-up)
- Whether the application is functionally healthy (responding to requests, not deadlocked)
- Whether the application is in a degraded state that warrants a restart
Probes are user-defined health checks that run inside or against containers. The kubelet on each node executes them and reacts according to their results.
The Three Probe Types
| Probe | Question It Answers | Action on Failure | Scope |
|---|---|---|---|
| Liveness | Is the container alive and should keep running? | Restart the container | Container |
| Readiness | Is the container ready to serve traffic? | Remove from Service Endpoints | Pod + Service |
| Startup | Has a slow-starting container finished booting? | Disable other probes until success | Container (guard) |
Golden Rule: Liveness protects the container (kill & restart), Readiness protects the Service (stop routing traffic), Startup protects slow starters (prevent premature death).
Probe Mechanisms
Kubernetes can check health in four ways. The mechanism is declared under the probe block (livenessProbe, readinessProbe, or startupProbe).
1. HTTP GET Probe
Sends an HTTP GET request to a specified path and port. Any response code between 200 and 399 is considered success.
livenessProbe:
httpGet:
path: /healthz
port: 8080
httpHeaders:
- name: Custom-Header
value: probe
initialDelaySeconds: 30
periodSeconds: 10| Field | Description |
|---|---|
path | URL path to request |
port | Container port (name or number) |
httpHeaders | Optional headers to send |
scheme | HTTP (default) or HTTPS |
Best for: Web applications, REST APIs, microservices with a dedicated health endpoint.
2. TCP Socket Probe
Attempts to open a TCP connection to the specified port. Success = connection established.
readinessProbe:
tcpSocket:
port: 5432
initialDelaySeconds: 5
periodSeconds: 5Best for: Databases, caches, message queues, and any service where an open port implies readiness.
3. Exec Probe
Runs a command inside the container. Exit code 0 = success; any non-zero exit code = failure.
livenessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5Best for: Legacy applications, custom validation logic, checking PID files, or verifying file existence.
4. gRPC Probe
Native gRPC health-checking using the standard gRPC health protocol. Available in Kubernetes 1.27+ (alpha, enabled via feature gate).
livenessProbe:
grpc:
port: 50051
initialDelaySeconds: 10Best for: gRPC-first microservices where an HTTP endpoint would be artificial.
Probe Parameters (Timing & Thresholds)
Every probe shares these fields. Understanding their interaction is critical for both production tuning and the CKA exam.
| Parameter | Default | Description |
|---|---|---|
initialDelaySeconds | 0 | Seconds to wait after container start before the first probe |
periodSeconds | 10 | How often to run the probe |
timeoutSeconds | 1 | Seconds to wait for a response before counting it as failed |
successThreshold | 1 | Consecutive successes required to transition from Failure → Success |
failureThreshold | 3 | Consecutive failures required to transition from Success → Failure |
Time-to-Action Math
The total time before a probe triggers its action after a container starts:
total_wait = initialDelaySeconds + (periodSeconds × failureThreshold)
Example:
initialDelaySeconds: 30periodSeconds: 10failureThreshold: 3
Total time before liveness restart = 30 + (10 × 3) = 60 seconds
Exam Trap: Many candidates assume
failureThreshold: 3means 3 seconds. It means 3 probe periods.
Liveness Probe Deep Dive
Purpose
Detect when a container has entered a broken but running state — infinite loops, deadlocks, memory leaks that haven’t caused an OOMKill, or thread starvation.
What Happens on Failure?
- kubelet marks the container as failed
- kubelet kills the container process (SIGTERM, then SIGKILL after grace period)
- kubelet creates a new container from the same image
- The Pod stays on the same node; its IP may or may not change depending on restart policy
- Restart count increments (
kubectl get podshowsRESTARTS)
YAML Example
apiVersion: v1
kind: Pod
metadata:
name: liveness-demo
spec:
containers:
- name: app
image: myapp:v1
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3Danger: Liveness Without Startup Probe
If a container takes 2 minutes to start but liveness begins after 10 seconds with failureThreshold: 3, it will be killed before ever becoming healthy. This is the classic “crash loop” caused by misconfigured probes.
Fix: Add a startup probe with a generous failureThreshold.
Readiness Probe Deep Dive
Purpose
Determine whether a container is ready to accept traffic. An application may be running but not yet usable (e.g., loading configuration, warming caches, waiting for a leader election).
What Happens on Failure?
- kubelet marks the Pod as
NotReady - The Pod’s IP is removed from the Service’s EndpointSlice (and Endpoints object)
- kube-proxy stops routing new traffic to this Pod
- Existing connections are NOT terminated — only new requests are affected
- Once readiness succeeds again, the IP is re-added automatically
YAML Example
apiVersion: v1
kind: Pod
metadata:
name: readiness-demo
spec:
containers:
- name: app
image: myapp:v1
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
failureThreshold: 3Readiness and Deployments
During a rolling update:
- New Pods must pass readiness before the Deployment counts them as “available”
- Old Pods are terminated only after new Pods are ready
- If readiness never succeeds, the rollout stalls — this is a common deployment failure mode
Readiness and Autoscaling
HPA counts only ready replicas when calculating current utilization. A Pod that is Running but NotReady does not count toward the replica target, which can cause HPA to scale up unnecessarily. Source: CKA Day 17
Startup Probe Deep Dive
Purpose
Give slow-starting containers (JVM apps, ML model loading, large dependency downloads) enough time to initialize without being killed by aggressive liveness checks.
How It Works
- While the startup probe is running, liveness and readiness probes are disabled
- Once the startup probe succeeds, liveness and readiness begin their normal cycles
- If the startup probe fails up to
failureThreshold, the container is restarted - If no startup probe is defined, liveness and readiness start immediately after container creation
YAML Example
apiVersion: v1
kind: Pod
metadata:
name: slow-start
spec:
containers:
- name: app
image: large-java-app:v1
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5Time budget: 30 × 10s = 300s (5 minutes) to start. After that, liveness checks every 10s and readiness every 5s.
Full Pod Example: All Three Probes
apiVersion: v1
kind: Pod
metadata:
name: probes-demo
labels:
app: web
spec:
containers:
- name: nginx
image: nginx:alpine
ports:
- containerPort: 80
startupProbe:
httpGet:
path: /
port: 80
failureThreshold: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /healthz
port: 80
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 80
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
failureThreshold: 3Probes in the Kubernetes Control Loop
Probes are not isolated features — they participate in the cluster-wide state machine:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ kubelet │────▶│ API Server │────▶│ EndpointSlice│
│ (runs │ │ (stores │ │ Controller │
│ probes) │ │ status) │ │ │
└─────────────┘ └─────────────┘ └──────┬──────┘
│
┌────────────────────┘
▼
┌─────────────┐
│ kube-proxy │
│ (programs │
│ iptables) │
└─────────────┘
- kubelet runs probes on the node
- kubelet updates Pod
status.containerStatusesvia the API Server - EndpointSlice controller watches Pod readiness; ready Pods are added to EndpointSlices
- kube-proxy reads EndpointSlices and updates
iptables/ipvsrules - Deployment controller counts available replicas based on readiness state
Troubleshooting Matrix
| Symptom | Likely Cause | Diagnostic Command |
|---|---|---|
| Pod restarts every 60s | Liveness probe too aggressive; app needs more initialDelaySeconds or a startup probe | kubectl describe pod <name> → look at Events |
Pod Running but not receiving traffic | Readiness probe failing; app not fully initialized | kubectl get endpoints <svc> → check if Pod IP is listed |
| Rolling update stuck | New Pods never become ready | kubectl rollout status deployment/<name> |
| HPA scaling unexpectedly high | NotReady Pods not counted; utilization calculated on fewer replicas | kubectl get hpa → check CURRENT vs TARGET |
CrashLoopBackOff | Liveness or startup probe failing repeatedly; or app actually crashing | kubectl logs --previous <pod> |
| Probe works locally but fails in cluster | Path/port mismatch; container listens on 127.0.0.1 instead of 0.0.0.0 | kubectl exec -it <pod> -- curl localhost:8080/healthz |
CKA Speed Patterns
- Write probes from memory: You will see YAML-writing questions. Memorize the structure:
probeType: { mechanism: { ... }, timingFields } - Check events first:
kubectl describe pod <name>| grep -i probe` - Endpoint check:
kubectl get endpoints <svc>shows whether readiness is working - Restart count:
kubectl get pod <name>→RESTARTScolumn tells you if liveness is firing - No imperative probe support:
kubectl runcannot add probes. Use YAML manifests orkubectl createwith--dry-run=client -o yamland edit.
Production Best Practices
| Practice | Rationale |
|---|---|
Separate /healthz and /ready | Liveness checks “not deadlocked”; readiness checks “dependencies up”. They often test different things. |
| Always use startup probes for slow apps | JVM, .NET, ML models — anything with >30s startup time. |
| Keep liveness stricter than readiness | Liveness should catch real failure; readiness should tolerate brief dependency hiccups. |
| Don’t probe external dependencies in liveness | If a database is down, you don’t want to restart every app Pod. Keep liveness local. |
Set timeoutSeconds realistically | Default 1s is too aggressive for remote endpoints or busy containers. |
| Log probe traffic separately | Exclude health-check endpoints from application request logs to reduce noise. |
Related Pages
- Pod Fundamentals — The unit that probes inspect
- Deployment, ReplicaSet & Replication Controller — Rolling updates depend on readiness
- Kubernetes Services — Readiness controls EndpointSlice membership
- Kubernetes Architecture — How kubelet, API Server, and kube-proxy coordinate
- Kubernetes Autoscaling — HPA counts ready replicas
- Kubernetes Resource Requests and Limits — Resource context for probe behavior
- CKA Certification — Exam domains where probes appear
- CKA Study Roadmap — Day 18: Health Probes
- CKA Day 18 — Kubernetes Health Probes Explained — Source video
Tags: kubernetes health-probes liveness readiness startup kubelet cka devops production troubleshooting