What Kafka Actually Is
Before we talk about running Kafka on Kubernetes and all the ways it will punish you for getting one property wrong, it helps to understand what Kafka actually is and why it exists at all.
Apache Kafka is a distributed event streaming platform. That description sounds like it came from a product page, so here is a more grounded version: Kafka is a durable, ordered, replayable log that multiple services can write to and read from independently, at different speeds, without any of them needing to know the others exist.
The origin story is practical. LinkedIn built Kafka in 2010 because they were drowning in data that needed to move between systems: activity feeds, metrics, notifications. None of the existing tools handled the throughput or durability they needed. Kafka was open-sourced in 2011, donated to Apache in 2012, and by 2014 several of its original creators had left LinkedIn to found Confluent.
Today Kafka sits at the core of some of the most data-intensive systems in the world. Uber uses it to process trip data and dispatch events in real time. Netflix relies on it for operational monitoring and internal pipelines. Financial institutions use Kafka for trade event streaming, fraud detection, and audit logging where every event must be persisted and replayable.
Kafka vs Traditional Message Queues
Most engineers encounter message queues before they encounter Kafka, and the mental model they bring from RabbitMQ or SQS can actively mislead them.
In a traditional queue, a message has a lifecycle: produced, delivered, acknowledged, deleted. The queue is a transient holding area. Once a consumer processes a message, the system forgets it ever existed.
Kafka works differently. Messages are retained for a configurable period regardless of consumption. Consumers track their own offsets, effectively bookmarks into the log. Kafka does not care what consumers do with messages.
This allows multiple independent consumer groups to read the same data at different speeds. It allows replay. It allows new services to catch up on historical data. These properties are why Kafka shows up so often in financial and data-heavy systems.
Kafka is not simpler than a queue. It requires more infrastructure, more configuration, and more operational knowledge. But once you need durable event history, multiple consumers, or replayability, traditional queues start feeling like the wrong tool.
ZooKeeper: The Necessary Crutch That Overstayed Its Welcome
Early Kafka needed a place to store cluster metadata: broker registrations, topic configuration, partition leadership, and controller election state. Building a distributed coordination system from scratch was not realistic, so Kafka used Apache ZooKeeper.
ZooKeeper solved a real problem. It provided strong consistency, leader election, and coordination. Kafka brokers registered themselves in ZooKeeper, and the controller used it to manage cluster state.
Over time, problems accumulated.
ZooKeeper was a second distributed system to deploy, secure, monitor, and debug. At scale, it became a performance ceiling. Controller failover required reading all metadata from ZooKeeper, which could take tens of seconds in large clusters. Operational complexity became a barrier.
KRaft: Kafka Learns to Govern Itself
KRaft (Kafka Raft) is the result of KIP-500. It replaces ZooKeeper with a Raft-based consensus system built directly into Kafka.
In KRaft mode, a subset of Kafka nodes act as controllers. These controllers maintain a metadata log stored in an internal Kafka topic called __cluster_metadata. Raft keeps this log consistent across controller nodes.
If the active controller fails, another controller already has the metadata locally and can take over almost immediately. Failover times drop from tens of seconds to milliseconds.
Kafka nodes can run as brokers only, controllers only, or both. For production, controllers and brokers are often separated. For small or staging clusters, combined mode is simpler and perfectly acceptable.
kafka-storage random-uuid is not a standard UUID. It is a Kafka-specific value. Do not generate one manually unless you know exactly what you are doing. Why Kubernetes Makes Kafka Hard
Kubernetes is built around stateless workloads. Pods are disposable. Identity does not matter.
Kafka is stateful. Each broker has:
-
a persistent node ID
-
local disk with partition data
-
advertised network identity
This is why Kafka must run as a StatefulSet, not a Deployment. StatefulSets give each pod a stable identity like kafka-0, kafka-1, kafka-2.
Kafka on Kubernetes also requires two services:
-
A headless service for stable DNS identities
-
A ClusterIP service for client traffic
They serve different purposes and are both required.
The Hidden 20% That Makes or Breaks Everything
This is where many guides quietly assume knowledge you might not have. Let’s make it explicit.
- publishNotReadyAddresses: The DNS Deadlock
Kubernetes normally publishes pod DNS records only after readiness probes pass. Kafka brokers cannot become ready until they can form a KRaft quorum. Quorum requires DNS. DNS requires readiness.
That is a deadlock.
The fix is:
publishNotReadyAddresses: true
This allows DNS records to exist before readiness succeeds.
- podManagementPolicy: Parallel StatefulSets default to creating pods one at a time. KRaft requires multiple controllers to start together. When the controllers start they sent message across the network as a way to form quorum(decide which broker is the leader and which brokers are the followers) Set:
podManagementPolicy: Parallel
Without this, the cluster never forms quorum.
- enableServiceLinks: false
Kubernetes injects environment variables for every service. If you have a service named kafka, Kubernetes injects KAFKA_PORT.
The Confluent image interprets this as Kafka configuration and behaves unpredictably.
Disable it:
enableServiceLinks: false
- Two Cluster ID Variables
Confluent’s entrypoint requires both:
-
CLUSTER_ID -
KAFKA_CLUSTER_ID
They must contain the same value.
How All of This Comes Together in Practice
Up to this point, we have talked about KRaft, StatefulSets, headless services, and controller quorum as concepts. Let’s ground them in a concrete configuration.
The following setup describes a three-node Kafka KRaft cluster running in Kubernetes. Every design choice in this configuration directly addresses one of the failure modes discussed earlier.
Step 1: One Cluster ID, Forever
KRaft clusters require a stable cluster ID that never changes upon broker(pod in this context) restart.
We generate it once and store it in a ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: kafka-cluster-config
namespace: infra
data:
cluster.id: "3F148FdeTqGDnpoKGdyX4A"
This value is injected into:
-
the init container (for storage formatting)
-
the Kafka container (for runtime validation)
Both CLUSTER_ID and KAFKA_CLUSTER_ID reference the same value to satisfy the Confluent entrypoint and Kafka itself.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: kafka
namespace: infra
spec:
serviceName: kafka-headless
podManagementPolicy: Parallel
replicas: 3
selector:
matchLabels:
app: kafka
template:
metadata:
labels:
app: kafka
spec:
enableServiceLinks: false # prevents k8s injecting KAFKA_PORT and similar vars
securityContext:
fsGroup: 1000
initContainers:
- name: init-kafka
image: confluentinc/cp-kafka:8.0.0
command:
- /bin/bash
- -c
- |
if [ ! -f /var/lib/kafka/data/meta.properties ]; then
NODE_ID=$((${HOSTNAME##*-} + 1))
echo "Generating minimal kafka properties for storage format..."
cat > /tmp/kafka-format.properties << EOF
process.roles=broker,controller
node.id=${NODE_ID}
controller.quorum.voters=1@kafka-0.kafka-headless.infra.svc.cluster.local:9093,[email protected]:9093,[email protected]:9093
controller.listener.names=CONTROLLER
listeners=PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:9093
advertised.listeners=PLAINTEXT://${HOSTNAME}.kafka-headless.infra.svc.cluster.local:9092
log.dirs=/var/lib/kafka/data
EOF
echo "What's inside"
cat /tmp/kafka-format.properties
echo "Formatting KRaft storage with cluster ID: $KAFKA_CLUSTER_ID"
kafka-storage format \
--config /tmp/kafka-format.properties \
--cluster-id "$KAFKA_CLUSTER_ID" \
--ignore-formatted
else
echo "Storage already formatted, skipping..."
fi
env:
- name: KAFKA_CLUSTER_ID
valueFrom:
configMapKeyRef:
name: kafka-cluster-config
key: cluster.id
volumeMounts:
- name: kafka-data
mountPath: /var/lib/kafka/data
securityContext:
runAsUser: 1000
containers:
- name: kafka
image: confluentinc/cp-kafka:8.0.0
ports:
- containerPort: 9092
- containerPort: 9093
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: KAFKA_CLUSTER_ID
valueFrom:
configMapKeyRef:
name: kafka-cluster-config
key: cluster.id
- name: CLUSTER_ID
valueFrom:
configMapKeyRef:
name: kafka-cluster-config
key: cluster.id
# KRaft
- name: KAFKA_PROCESS_ROLES
value: "broker,controller"
- name: KAFKA_CONTROLLER_QUORUM_VOTERS
value: "[email protected]:9093,[email protected]:9093,[email protected]:9093"
# Listeners
- name: KAFKA_LISTENERS
value: "PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:9093"
- name: KAFKA_LISTENER_SECURITY_PROTOCOL_MAP
value: "PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT"
- name: KAFKA_INTER_BROKER_LISTENER_NAME
value: "PLAINTEXT"
- name: KAFKA_CONTROLLER_LISTENER_NAMES
value: "CONTROLLER"
# Retention — 1 day for cost saving
- name: KAFKA_LOG_RETENTION_HOURS
value: "24"
- name: KAFKA_LOG_RETENTION_BYTES
value: "1073741824" # 1GB cap per partition as safety net
- name: KAFKA_LOG_SEGMENT_BYTES
value: "268435456" # 256MB segments
- name: KAFKA_LOG_RETENTION_CHECK_INTERVAL_MS
value: "300000"
# Replication
- name: KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR
value: "3"
- name: KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR
value: "3"
- name: KAFKA_TRANSACTION_STATE_LOG_MIN_ISR
value: "2"
- name: KAFKA_MIN_INSYNC_REPLICAS
value: "2"
- name: KAFKA_DEFAULT_REPLICATION_FACTOR
value: "3"
# Performance
- name: KAFKA_NUM_PARTITIONS
value: "3"
- name: KAFKA_AUTO_CREATE_TOPICS_ENABLE
value: "true"
- name: KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS
value: "3000"
# JVM — minimal footprint
- name: KAFKA_HEAP_OPTS
value: "-Xmx256m -Xms256m"
- name: KAFKA_JVM_PERFORMANCE_OPTS
value: "-XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:+DisableExplicitGC"
command:
- /bin/bash
- -c
- |
export KAFKA_NODE_ID=$((${POD_NAME##*-} + 1))
export KAFKA_ADVERTISED_LISTENERS="PLAINTEXT://${POD_NAME}.kafka-headless.${POD_NAMESPACE}.svc.cluster.local:9092"
exec /etc/confluent/docker/run
resources:
requests:
memory: 384Mi
cpu: 150m
limits:
memory: 512Mi
cpu: 350m
volumeMounts:
- name: kafka-data
mountPath: /var/lib/kafka/data
readinessProbe:
tcpSocket:
port: 9092
initialDelaySeconds: 60
periodSeconds: 10
failureThreshold: 6
# Remove livenessProbe entirely OR set it very conservatively
livenessProbe:
tcpSocket:
port: 9092
initialDelaySeconds: 300 # 5 full minutes before first liveness check
periodSeconds: 30
failureThreshold: 10 # must fail 10 times (~5 more minutes) before killing
volumes:
- name: kafka-data
emptyDir:
sizeLimit: 5Gi # caps disk usage per pod
Step 2: Stable Network Identity
The headless service provides stable DNS names:
apiVersion: v1
kind: Service
metadata:
name: kafka-headless
namespace: infra
spec:
clusterIP: None
publishNotReadyAddresses: true # the bone of contention
selector:
app: kafka
ports:
- name: internal
port: 9092
- name: controller
port: 9093
This allows pods like kafka-0.kafka-headless.infra.svc.cluster.local to exist immediately, enabling quorum formation during startup.
The separate ClusterIP service exists purely for client access.
Step 3: Parallel Startup and Identity
By default StatefulSets in Kubernetes waits for each pod to start and move into the ready state before starting the next pod, this happens until the number of Replicas is achieved.
The StatefulSet uses:
podManagementPolicy: Parallel
to ensure that all pods in the Statefulset starts up at the same time instead of the default way of waiting for each pod to start and move into the ready state before starting ythe next pod.
Each pod derives its Kafka node.id from its ordinal:
NODE_ID=$((${HOSTNAME##*-} + 1))
This ensures:
-
stable identity
-
consistent controller quorum mapping
-
repeatable restarts
Step 4: Formatting Storage Exactly Once
The init container formats the data directory only if meta.properties does not exist.
initContainers:
- name: init-kafka
image: confluentinc/cp-kafka:8.0.0
command:
- /bin/bash
- -c
- |
if [ ! -f /var/lib/kafka/data/meta.properties ]; then
NODE_ID=$((${HOSTNAME##*-} + 1))
echo "Generating minimal kafka properties for storage format..."
cat > /tmp/kafka-format.properties << EOF
process.roles=broker,controller
node.id=${NODE_ID}
controller.quorum.voters=1@kafka-0.kafka-headless.infra.svc.cluster.local:9093,[email protected]:9093,[email protected]:9093
controller.listener.names=CONTROLLER
listeners=PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:9093
advertised.listeners=PLAINTEXT://${HOSTNAME}.kafka-headless.infra.svc.cluster.local:9092
log.dirs=/var/lib/kafka/data
EOF
echo "What's inside"
cat /tmp/kafka-format.properties
echo "Formatting KRaft storage with cluster ID: $KAFKA_CLUSTER_ID"
kafka-storage format \
--config /tmp/kafka-format.properties \
--cluster-id "$KAFKA_CLUSTER_ID" \
--ignore-formatted
else
echo "Storage already formatted, skipping..."
fi
env:
- name: KAFKA_CLUSTER_ID
valueFrom:
configMapKeyRef:
name: kafka-cluster-config
key: cluster.id
volumeMounts:
- name: kafka-data
mountPath: /var/lib/kafka/data
securityContext:
runAsUser: 1000
This guarantees:
-
idempotent restarts so that quorum mapping can be maintained
-
no accidental re-initialization
-
no split-brain metadata
Step 5: Conservative Probes and Safety Nets
Liveness probes are intentionally delayed by five minutes. Kafka is allowed to breathe, form quorum, and stabilize before Kubernetes intervenes.
A PodDisruptionBudget ensures at least two brokers are always available.
Closing
What a Healthy Cluster Looks Like
When everything is configured correctly, all nodes start, discover each other via DNS, elect a Raft leader, and bring brokers online.
This command confirms success:
kafka-metadata-quorum \
--bootstrap-server kafka-0.kafka-headless.infra.svc.cluster.local:9092 \
describe --status
A healthy cluster shows:
-
LeaderId not -1
-
MaxFollowerLag of 0
-
all voters present
That is what “done” looks like.
