Kubernetes Health Probes

The container-level health-checking system that lets Kubernetes decide when to restart a container, when to stop sending it traffic, and when to give a slow-starter time to initialize. A core Workloads & Troubleshooting topic on the CKA exam. Synthesized from CKA Day 18 — Kubernetes Health Probes Explained.

The Problem: Kubernetes Is Blind Without Probes

By default, Kubernetes only knows whether a container’s main process is running. It has no insight into:

  • Whether the application has finished initializing (database connections, cache warm-up)
  • Whether the application is functionally healthy (responding to requests, not deadlocked)
  • Whether the application is in a degraded state that warrants a restart

Probes are user-defined health checks that run inside or against containers. The kubelet on each node executes them and reacts according to their results.

The Three Probe Types

ProbeQuestion It AnswersAction on FailureScope
LivenessIs the container alive and should keep running?Restart the containerContainer
ReadinessIs the container ready to serve traffic?Remove from Service EndpointsPod + Service
StartupHas a slow-starting container finished booting?Disable other probes until successContainer (guard)

Golden Rule: Liveness protects the container (kill & restart), Readiness protects the Service (stop routing traffic), Startup protects slow starters (prevent premature death).


Probe Mechanisms

Kubernetes can check health in four ways. The mechanism is declared under the probe block (livenessProbe, readinessProbe, or startupProbe).

1. HTTP GET Probe

Sends an HTTP GET request to a specified path and port. Any response code between 200 and 399 is considered success.

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
    httpHeaders:
      - name: Custom-Header
        value: probe
  initialDelaySeconds: 30
  periodSeconds: 10
FieldDescription
pathURL path to request
portContainer port (name or number)
httpHeadersOptional headers to send
schemeHTTP (default) or HTTPS

Best for: Web applications, REST APIs, microservices with a dedicated health endpoint.

2. TCP Socket Probe

Attempts to open a TCP connection to the specified port. Success = connection established.

readinessProbe:
  tcpSocket:
    port: 5432
  initialDelaySeconds: 5
  periodSeconds: 5

Best for: Databases, caches, message queues, and any service where an open port implies readiness.

3. Exec Probe

Runs a command inside the container. Exit code 0 = success; any non-zero exit code = failure.

livenessProbe:
  exec:
    command:
      - cat
      - /tmp/healthy
  initialDelaySeconds: 5
  periodSeconds: 5

Best for: Legacy applications, custom validation logic, checking PID files, or verifying file existence.

4. gRPC Probe

Native gRPC health-checking using the standard gRPC health protocol. Available in Kubernetes 1.27+ (alpha, enabled via feature gate).

livenessProbe:
  grpc:
    port: 50051
  initialDelaySeconds: 10

Best for: gRPC-first microservices where an HTTP endpoint would be artificial.


Probe Parameters (Timing & Thresholds)

Every probe shares these fields. Understanding their interaction is critical for both production tuning and the CKA exam.

ParameterDefaultDescription
initialDelaySeconds0Seconds to wait after container start before the first probe
periodSeconds10How often to run the probe
timeoutSeconds1Seconds to wait for a response before counting it as failed
successThreshold1Consecutive successes required to transition from FailureSuccess
failureThreshold3Consecutive failures required to transition from SuccessFailure

Time-to-Action Math

The total time before a probe triggers its action after a container starts:

total_wait = initialDelaySeconds + (periodSeconds × failureThreshold)

Example:

  • initialDelaySeconds: 30
  • periodSeconds: 10
  • failureThreshold: 3

Total time before liveness restart = 30 + (10 × 3) = 60 seconds

Exam Trap: Many candidates assume failureThreshold: 3 means 3 seconds. It means 3 probe periods.


Liveness Probe Deep Dive

Purpose

Detect when a container has entered a broken but running state — infinite loops, deadlocks, memory leaks that haven’t caused an OOMKill, or thread starvation.

What Happens on Failure?

  1. kubelet marks the container as failed
  2. kubelet kills the container process (SIGTERM, then SIGKILL after grace period)
  3. kubelet creates a new container from the same image
  4. The Pod stays on the same node; its IP may or may not change depending on restart policy
  5. Restart count increments (kubectl get pod shows RESTARTS)

YAML Example

apiVersion: v1
kind: Pod
metadata:
  name: liveness-demo
spec:
  containers:
  - name: app
    image: myapp:v1
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 30
      periodSeconds: 10
      timeoutSeconds: 5
      failureThreshold: 3

Danger: Liveness Without Startup Probe

If a container takes 2 minutes to start but liveness begins after 10 seconds with failureThreshold: 3, it will be killed before ever becoming healthy. This is the classic “crash loop” caused by misconfigured probes.

Fix: Add a startup probe with a generous failureThreshold.


Readiness Probe Deep Dive

Purpose

Determine whether a container is ready to accept traffic. An application may be running but not yet usable (e.g., loading configuration, warming caches, waiting for a leader election).

What Happens on Failure?

  1. kubelet marks the Pod as NotReady
  2. The Pod’s IP is removed from the Service’s EndpointSlice (and Endpoints object)
  3. kube-proxy stops routing new traffic to this Pod
  4. Existing connections are NOT terminated — only new requests are affected
  5. Once readiness succeeds again, the IP is re-added automatically

YAML Example

apiVersion: v1
kind: Pod
metadata:
  name: readiness-demo
spec:
  containers:
  - name: app
    image: myapp:v1
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5
      successThreshold: 1
      failureThreshold: 3

Readiness and Deployments

During a rolling update:

  • New Pods must pass readiness before the Deployment counts them as “available”
  • Old Pods are terminated only after new Pods are ready
  • If readiness never succeeds, the rollout stalls — this is a common deployment failure mode

Readiness and Autoscaling

HPA counts only ready replicas when calculating current utilization. A Pod that is Running but NotReady does not count toward the replica target, which can cause HPA to scale up unnecessarily. Source: CKA Day 17


Startup Probe Deep Dive

Purpose

Give slow-starting containers (JVM apps, ML model loading, large dependency downloads) enough time to initialize without being killed by aggressive liveness checks.

How It Works

  • While the startup probe is running, liveness and readiness probes are disabled
  • Once the startup probe succeeds, liveness and readiness begin their normal cycles
  • If the startup probe fails up to failureThreshold, the container is restarted
  • If no startup probe is defined, liveness and readiness start immediately after container creation

YAML Example

apiVersion: v1
kind: Pod
metadata:
  name: slow-start
spec:
  containers:
  - name: app
    image: large-java-app:v1
    startupProbe:
      httpGet:
        path: /healthz
        port: 8080
      failureThreshold: 30
      periodSeconds: 10
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      periodSeconds: 10
      failureThreshold: 3
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      periodSeconds: 5

Time budget: 30 × 10s = 300s (5 minutes) to start. After that, liveness checks every 10s and readiness every 5s.


Full Pod Example: All Three Probes

apiVersion: v1
kind: Pod
metadata:
  name: probes-demo
  labels:
    app: web
spec:
  containers:
  - name: nginx
    image: nginx:alpine
    ports:
    - containerPort: 80
    startupProbe:
      httpGet:
        path: /
        port: 80
      failureThreshold: 30
      periodSeconds: 10
    livenessProbe:
      httpGet:
        path: /healthz
        port: 80
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 3
      failureThreshold: 3
    readinessProbe:
      httpGet:
        path: /ready
        port: 80
      initialDelaySeconds: 5
      periodSeconds: 5
      successThreshold: 1
      failureThreshold: 3

Probes in the Kubernetes Control Loop

Probes are not isolated features — they participate in the cluster-wide state machine:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   kubelet   │────▶│ API Server  │────▶│ EndpointSlice│
│  (runs      │     │  (stores     │     │  Controller  │
│   probes)   │     │   status)    │     │              │
└─────────────┘     └─────────────┘     └──────┬──────┘
                                               │
                          ┌────────────────────┘
                          ▼
                   ┌─────────────┐
                   │ kube-proxy  │
                   │ (programs   │
                   │  iptables)  │
                   └─────────────┘
  1. kubelet runs probes on the node
  2. kubelet updates Pod status.containerStatuses via the API Server
  3. EndpointSlice controller watches Pod readiness; ready Pods are added to EndpointSlices
  4. kube-proxy reads EndpointSlices and updates iptables/ipvs rules
  5. Deployment controller counts available replicas based on readiness state

Troubleshooting Matrix

SymptomLikely CauseDiagnostic Command
Pod restarts every 60sLiveness probe too aggressive; app needs more initialDelaySeconds or a startup probekubectl describe pod <name> → look at Events
Pod Running but not receiving trafficReadiness probe failing; app not fully initializedkubectl get endpoints <svc> → check if Pod IP is listed
Rolling update stuckNew Pods never become readykubectl rollout status deployment/<name>
HPA scaling unexpectedly highNotReady Pods not counted; utilization calculated on fewer replicaskubectl get hpa → check CURRENT vs TARGET
CrashLoopBackOffLiveness or startup probe failing repeatedly; or app actually crashingkubectl logs --previous <pod>
Probe works locally but fails in clusterPath/port mismatch; container listens on 127.0.0.1 instead of 0.0.0.0kubectl exec -it <pod> -- curl localhost:8080/healthz

CKA Speed Patterns

  • Write probes from memory: You will see YAML-writing questions. Memorize the structure: probeType: { mechanism: { ... }, timingFields }
  • Check events first: kubectl describe pod <name> | grep -i probe`
  • Endpoint check: kubectl get endpoints <svc> shows whether readiness is working
  • Restart count: kubectl get pod <name>RESTARTS column tells you if liveness is firing
  • No imperative probe support: kubectl run cannot add probes. Use YAML manifests or kubectl create with --dry-run=client -o yaml and edit.

Production Best Practices

PracticeRationale
Separate /healthz and /readyLiveness checks “not deadlocked”; readiness checks “dependencies up”. They often test different things.
Always use startup probes for slow appsJVM, .NET, ML models — anything with >30s startup time.
Keep liveness stricter than readinessLiveness should catch real failure; readiness should tolerate brief dependency hiccups.
Don’t probe external dependencies in livenessIf a database is down, you don’t want to restart every app Pod. Keep liveness local.
Set timeoutSeconds realisticallyDefault 1s is too aggressive for remote endpoints or busy containers.
Log probe traffic separatelyExclude health-check endpoints from application request logs to reduce noise.


Tags: kubernetes health-probes liveness readiness startup kubelet cka devops production troubleshooting