Kubernetes

Kafka on Kubernetes: The Hard Parts Nobody Warns You About

An honest account of what Kafka is, why ZooKeeper had to go, and what it actually takes to run a KRaft cluster on Kubernetes without it silently falling apart.

☕ 10 min read  ·  February 18, 2026
Golang Interfaces
Golang Interfaces
A Picture of Daniel Okoronkwo
Daniel Okoronkwo
Writing about systems, infrastructure, and software
Share:

What Kafka Actually Is

Before we talk about running Kafka on Kubernetes and all the ways it will punish you for getting one property wrong, it helps to understand what Kafka actually is and why it exists at all.

Apache Kafka is a distributed event streaming platform. That description sounds like it came from a product page, so here is a more grounded version: Kafka is a durable, ordered, replayable log that multiple services can write to and read from independently, at different speeds, without any of them needing to know the others exist.

The origin story is practical. LinkedIn built Kafka in 2010 because they were drowning in data that needed to move between systems: activity feeds, metrics, notifications. None of the existing tools handled the throughput or durability they needed. Kafka was open-sourced in 2011, donated to Apache in 2012, and by 2014 several of its original creators had left LinkedIn to found Confluent.

Today Kafka sits at the core of some of the most data-intensive systems in the world. Uber uses it to process trip data and dispatch events in real time. Netflix relies on it for operational monitoring and internal pipelines. Financial institutions use Kafka for trade event streaming, fraud detection, and audit logging where every event must be persisted and replayable.

Key intuition: Kafka is not a message broker in the traditional sense. It is a distributed commit log. That difference matters enormously once you start building systems on top of it.

Kafka vs Traditional Message Queues

Most engineers encounter message queues before they encounter Kafka, and the mental model they bring from RabbitMQ or SQS can actively mislead them.

In a traditional queue, a message has a lifecycle: produced, delivered, acknowledged, deleted. The queue is a transient holding area. Once a consumer processes a message, the system forgets it ever existed.

Kafka works differently. Messages are retained for a configurable period regardless of consumption. Consumers track their own offsets, effectively bookmarks into the log. Kafka does not care what consumers do with messages.

This allows multiple independent consumer groups to read the same data at different speeds. It allows replay. It allows new services to catch up on historical data. These properties are why Kafka shows up so often in financial and data-heavy systems.

Kafka is not simpler than a queue. It requires more infrastructure, more configuration, and more operational knowledge. But once you need durable event history, multiple consumers, or replayability, traditional queues start feeling like the wrong tool.

ZooKeeper: The Necessary Crutch That Overstayed Its Welcome

Early Kafka needed a place to store cluster metadata: broker registrations, topic configuration, partition leadership, and controller election state. Building a distributed coordination system from scratch was not realistic, so Kafka used Apache ZooKeeper.

ZooKeeper solved a real problem. It provided strong consistency, leader election, and coordination. Kafka brokers registered themselves in ZooKeeper, and the controller used it to manage cluster state.

Over time, problems accumulated.

ZooKeeper was a second distributed system to deploy, secure, monitor, and debug. At scale, it became a performance ceiling. Controller failover required reading all metadata from ZooKeeper, which could take tens of seconds in large clusters. Operational complexity became a barrier.

The honest summary: ZooKeeper solved a real problem in 2010. By 2020 it had become the part of Kafka that nobody enjoyed operating.

KRaft: Kafka Learns to Govern Itself

KRaft (Kafka Raft) is the result of KIP-500. It replaces ZooKeeper with a Raft-based consensus system built directly into Kafka.

In KRaft mode, a subset of Kafka nodes act as controllers. These controllers maintain a metadata log stored in an internal Kafka topic called __cluster_metadata. Raft keeps this log consistent across controller nodes.

If the active controller fails, another controller already has the metadata locally and can take over almost immediately. Failover times drop from tens of seconds to milliseconds.

Kafka nodes can run as brokers only, controllers only, or both. For production, controllers and brokers are often separated. For small or staging clusters, combined mode is simpler and perfectly acceptable.

Important: The cluster ID generated by kafka-storage random-uuid is not a standard UUID. It is a Kafka-specific value. Do not generate one manually unless you know exactly what you are doing.

Why Kubernetes Makes Kafka Hard

Kubernetes is built around stateless workloads. Pods are disposable. Identity does not matter.

Kafka is stateful. Each broker has:

  • a persistent node ID

  • local disk with partition data

  • advertised network identity

This is why Kafka must run as a StatefulSet, not a Deployment. StatefulSets give each pod a stable identity like kafka-0, kafka-1, kafka-2.

Kafka on Kubernetes also requires two services:

  • A headless service for stable DNS identities

  • A ClusterIP service for client traffic

They serve different purposes and are both required.

The Hidden 20% That Makes or Breaks Everything

This is where many guides quietly assume knowledge you might not have. Let’s make it explicit.

  1. publishNotReadyAddresses: The DNS Deadlock

Kubernetes normally publishes pod DNS records only after readiness probes pass. Kafka brokers cannot become ready until they can form a KRaft quorum. Quorum requires DNS. DNS requires readiness.

That is a deadlock.

The fix is:

publishNotReadyAddresses: true

This allows DNS records to exist before readiness succeeds.

  1. podManagementPolicy: Parallel StatefulSets default to creating pods one at a time. KRaft requires multiple controllers to start together. When the controllers start they sent message across the network as a way to form quorum(decide which broker is the leader and which brokers are the followers) Set:
podManagementPolicy: Parallel

Without this, the cluster never forms quorum.

  1. enableServiceLinks: false

Kubernetes injects environment variables for every service. If you have a service named kafka, Kubernetes injects KAFKA_PORT.

The Confluent image interprets this as Kafka configuration and behaves unpredictably.

Disable it:

enableServiceLinks: false
  1. Two Cluster ID Variables

Confluent’s entrypoint requires both:

  • CLUSTER_ID

  • KAFKA_CLUSTER_ID

They must contain the same value.

How All of This Comes Together in Practice

Up to this point, we have talked about KRaft, StatefulSets, headless services, and controller quorum as concepts. Let’s ground them in a concrete configuration.

The following setup describes a three-node Kafka KRaft cluster running in Kubernetes. Every design choice in this configuration directly addresses one of the failure modes discussed earlier.

Step 1: One Cluster ID, Forever

KRaft clusters require a stable cluster ID that never changes upon broker(pod in this context) restart.

We generate it once and store it in a ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: kafka-cluster-config
  namespace: infra
data:
  cluster.id: "3F148FdeTqGDnpoKGdyX4A"

This value is injected into:

  • the init container (for storage formatting)

  • the Kafka container (for runtime validation)

Both CLUSTER_ID and KAFKA_CLUSTER_ID reference the same value to satisfy the Confluent entrypoint and Kafka itself.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: kafka
  namespace: infra
spec:
  serviceName: kafka-headless
  podManagementPolicy: Parallel
  replicas: 3
  selector:
    matchLabels:
      app: kafka
  template:
    metadata:
      labels:
        app: kafka
    spec:
      enableServiceLinks: false # prevents k8s injecting KAFKA_PORT and similar vars
      securityContext:
        fsGroup: 1000
      initContainers:
        - name: init-kafka
          image: confluentinc/cp-kafka:8.0.0
          command:
            - /bin/bash
            - -c
            - |
              if [ ! -f /var/lib/kafka/data/meta.properties ]; then
                NODE_ID=$((${HOSTNAME##*-} + 1))
                echo "Generating minimal kafka properties for storage format..."
                      cat > /tmp/kafka-format.properties << EOF
              process.roles=broker,controller
              node.id=${NODE_ID}
              controller.quorum.voters=1@kafka-0.kafka-headless.infra.svc.cluster.local:9093,[email protected]:9093,[email protected]:9093
              controller.listener.names=CONTROLLER
              listeners=PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:9093
              advertised.listeners=PLAINTEXT://${HOSTNAME}.kafka-headless.infra.svc.cluster.local:9092
              log.dirs=/var/lib/kafka/data
              EOF
                
                echo "What's inside"
                cat /tmp/kafka-format.properties

                echo "Formatting KRaft storage with cluster ID: $KAFKA_CLUSTER_ID"
                kafka-storage format \
                  --config /tmp/kafka-format.properties \
                  --cluster-id "$KAFKA_CLUSTER_ID" \
                  --ignore-formatted
              else
                echo "Storage already formatted, skipping..."
              fi
          env:
            - name: KAFKA_CLUSTER_ID
              valueFrom:
                configMapKeyRef:
                  name: kafka-cluster-config
                  key: cluster.id
          volumeMounts:
            - name: kafka-data
              mountPath: /var/lib/kafka/data
          securityContext:
            runAsUser: 1000
      containers:
        - name: kafka
          image: confluentinc/cp-kafka:8.0.0
          ports:
            - containerPort: 9092
            - containerPort: 9093
          env:
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: KAFKA_CLUSTER_ID
              valueFrom:
                configMapKeyRef:
                  name: kafka-cluster-config
                  key: cluster.id
            - name: CLUSTER_ID
              valueFrom:
                configMapKeyRef:
                  name: kafka-cluster-config
                  key: cluster.id

            # KRaft
            - name: KAFKA_PROCESS_ROLES
              value: "broker,controller"
            - name: KAFKA_CONTROLLER_QUORUM_VOTERS
              value: "[email protected]:9093,[email protected]:9093,[email protected]:9093"

            # Listeners
            - name: KAFKA_LISTENERS
              value: "PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:9093"
            - name: KAFKA_LISTENER_SECURITY_PROTOCOL_MAP
              value: "PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT"
            - name: KAFKA_INTER_BROKER_LISTENER_NAME
              value: "PLAINTEXT"
            - name: KAFKA_CONTROLLER_LISTENER_NAMES
              value: "CONTROLLER"

            # Retention — 1 day for cost saving
            - name: KAFKA_LOG_RETENTION_HOURS
              value: "24"
            - name: KAFKA_LOG_RETENTION_BYTES
              value: "1073741824" # 1GB cap per partition as safety net
            - name: KAFKA_LOG_SEGMENT_BYTES
              value: "268435456" # 256MB segments
            - name: KAFKA_LOG_RETENTION_CHECK_INTERVAL_MS
              value: "300000"

            # Replication
            - name: KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR
              value: "3"
            - name: KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR
              value: "3"
            - name: KAFKA_TRANSACTION_STATE_LOG_MIN_ISR
              value: "2"
            - name: KAFKA_MIN_INSYNC_REPLICAS
              value: "2"
            - name: KAFKA_DEFAULT_REPLICATION_FACTOR
              value: "3"

            # Performance
            - name: KAFKA_NUM_PARTITIONS
              value: "3"
            - name: KAFKA_AUTO_CREATE_TOPICS_ENABLE
              value: "true"
            - name: KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS
              value: "3000"

            # JVM — minimal footprint
            - name: KAFKA_HEAP_OPTS
              value: "-Xmx256m -Xms256m"
            - name: KAFKA_JVM_PERFORMANCE_OPTS
              value: "-XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:+DisableExplicitGC"

          command:
            - /bin/bash
            - -c
            - |
              export KAFKA_NODE_ID=$((${POD_NAME##*-} + 1))
              export KAFKA_ADVERTISED_LISTENERS="PLAINTEXT://${POD_NAME}.kafka-headless.${POD_NAMESPACE}.svc.cluster.local:9092"
              exec /etc/confluent/docker/run
          resources:
            requests:
              memory: 384Mi
              cpu: 150m
            limits:
              memory: 512Mi
              cpu: 350m
          volumeMounts:
            - name: kafka-data
              mountPath: /var/lib/kafka/data
          readinessProbe:
            tcpSocket:
              port: 9092
            initialDelaySeconds: 60
            periodSeconds: 10
            failureThreshold: 6
          # Remove livenessProbe entirely OR set it very conservatively
          livenessProbe:
            tcpSocket:
              port: 9092
            initialDelaySeconds: 300 # 5 full minutes before first liveness check
            periodSeconds: 30
            failureThreshold: 10 # must fail 10 times (~5 more minutes) before killing
      volumes:
        - name: kafka-data
          emptyDir:
            sizeLimit: 5Gi # caps disk usage per pod

Step 2: Stable Network Identity

The headless service provides stable DNS names:

apiVersion: v1
kind: Service
metadata:
  name: kafka-headless
  namespace: infra
spec:
  clusterIP: None
  publishNotReadyAddresses: true # the bone of contention
  selector:
    app: kafka
  ports:
    - name: internal
      port: 9092
    - name: controller
      port: 9093

This allows pods like kafka-0.kafka-headless.infra.svc.cluster.local to exist immediately, enabling quorum formation during startup.

The separate ClusterIP service exists purely for client access.

Step 3: Parallel Startup and Identity

By default StatefulSets in Kubernetes waits for each pod to start and move into the ready state before starting the next pod, this happens until the number of Replicas is achieved.

The StatefulSet uses:

podManagementPolicy: Parallel

to ensure that all pods in the Statefulset starts up at the same time instead of the default way of waiting for each pod to start and move into the ready state before starting ythe next pod.

Each pod derives its Kafka node.id from its ordinal:

NODE_ID=$((${HOSTNAME##*-} + 1))

This ensures:

  • stable identity

  • consistent controller quorum mapping

  • repeatable restarts

Step 4: Formatting Storage Exactly Once

The init container formats the data directory only if meta.properties does not exist.

initContainers:
  - name: init-kafka
    image: confluentinc/cp-kafka:8.0.0
    command:
      - /bin/bash
      - -c
      - |
        if [ ! -f /var/lib/kafka/data/meta.properties ]; then
          NODE_ID=$((${HOSTNAME##*-} + 1))
          echo "Generating minimal kafka properties for storage format..."
                cat > /tmp/kafka-format.properties << EOF
        process.roles=broker,controller
        node.id=${NODE_ID}
        controller.quorum.voters=1@kafka-0.kafka-headless.infra.svc.cluster.local:9093,[email protected]:9093,[email protected]:9093
        controller.listener.names=CONTROLLER
        listeners=PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:9093
        advertised.listeners=PLAINTEXT://${HOSTNAME}.kafka-headless.infra.svc.cluster.local:9092
        log.dirs=/var/lib/kafka/data
        EOF
          
          echo "What's inside"
          cat /tmp/kafka-format.properties

          echo "Formatting KRaft storage with cluster ID: $KAFKA_CLUSTER_ID"
          kafka-storage format \
            --config /tmp/kafka-format.properties \
            --cluster-id "$KAFKA_CLUSTER_ID" \
            --ignore-formatted
        else
          echo "Storage already formatted, skipping..."
        fi
    env:
      - name: KAFKA_CLUSTER_ID
        valueFrom:
          configMapKeyRef:
            name: kafka-cluster-config
            key: cluster.id
    volumeMounts:
      - name: kafka-data
        mountPath: /var/lib/kafka/data
    securityContext:
      runAsUser: 1000

This guarantees:

  • idempotent restarts so that quorum mapping can be maintained

  • no accidental re-initialization

  • no split-brain metadata

Step 5: Conservative Probes and Safety Nets

Liveness probes are intentionally delayed by five minutes. Kafka is allowed to breathe, form quorum, and stabilize before Kubernetes intervenes.

A PodDisruptionBudget ensures at least two brokers are always available.

Closing

What a Healthy Cluster Looks Like

When everything is configured correctly, all nodes start, discover each other via DNS, elect a Raft leader, and bring brokers online.

This command confirms success:

kafka-metadata-quorum \
  --bootstrap-server kafka-0.kafka-headless.infra.svc.cluster.local:9092 \
  describe --status

A healthy cluster shows:

  • LeaderId not -1

  • MaxFollowerLag of 0

  • all voters present

That is what “done” looks like.