Apache Kafka is a distributed log-based stream processing platform that combines Pubsub semantics with durable ordered storage. Producers append records to named topics, while consumers read those partitions at their own pace, enabling loose coupling between services.
Core pieces
- Brokers store topic partitions and replicate logs across the cluster for durability.
- Topics + partitions provide scalability by sharding ordered logs; consumers track offsets per partition.
- Consumer groups balance load by ensuring only one consumer in a group processes a partition at a time.
- Schema registry / serializers (e.g. Avro, Protobuf) keep producers and consumers compatible.
Typical uses
- Event streaming between microservices, change data capture pipelines, and ETL into data lakes/warehouses.
- Buffering bursty workloads so downstream systems ingest at stable rates.
- Real-time analytics when paired with Kafka Streams, ksqlDB, or external processors (Flink, Spark, Beam).
Ecosystem
- Kafka Connect manages scalable source/sink connectors without bespoke consumers or producers.
- Kafka Streams embeds stateful stream processing inside applications without a separate cluster.
- ksqlDB provides a SQL-like interface for defining persistent queries over topics.
Operational notes
- Requires ZooKeeper (Kafka < 3.3) or the built-in KRaft controller quorum (Kafka 3.3+) for metadata.
- Plan for disk-heavy workloads; retention policies are per-topic (time and/or size), and compaction keeps the latest value per key.
- Monitor broker health via lag metrics, ISR size, and controller elections; automate rebalancing when adding brokers.
See also: