In this article we will be looking at what is Apache Kafka , its components and simple steps on setting up Apache Kafka with Zookeeper using Docker.

Introduction

Apache Kafka is an open-source distributed event streaming platform developed by the Apache Software Foundation. It is designed for building real-time data pipelines and streaming applications. Kafka is known for its high throughput, fault tolerance, durability, and low latency, making it a popular choice for handling large volumes of data in real-time or near-real-time scenarios.

Here are some key features and components of Apache Kafka:

Publish-Subscribe Messaging System: Kafka follows a publish-subscribe model where producers publish messages to topics, and consumers subscribe to topics to receive those messages. This decouples producers and consumers, allowing for scalability and flexibility.

Topics: Messages in Kafka are organized into topics, which act as message categories. Producers write messages to topics, and consumers read messages from topics. Topics can have multiple partitions, which enable parallel processing and distribution of data.

Brokers: Kafka brokers are servers that store and manage the data. Brokers work together to form a Kafka cluster. They are responsible for receiving, storing, and serving messages to consumers.

Partitions: Each topic can be divided into multiple partitions, which allow Kafka to horizontally distribute the data across brokers. Partitions help in achieving high throughput and parallel processing.

Replication: Kafka provides data replication for fault tolerance. Each partition can have one or more replicas distributed across different brokers. This ensures data durability and availability in case of broker failures.

Producers: Producers are responsible for sending messages to Kafka topics. They can publish messages to one or more topics and choose which partition to write to.
Consumers: Consumers are applications or processes that subscribe to Kafka topics and read messages. Kafka supports both real-time stream processing and batch processing through its consumer APIs.

Consumer Groups: Consumers can be organized into consumer groups, where each group processes messages independently. Kafka ensures that each message is consumed by only one consumer within a consumer group, allowing for parallel processing of data.

Connectors: Kafka Connect is a framework for building and running connectors that move data between Kafka and other data sources or sinks (e.g., databases, file systems, cloud services). It simplifies the integration of external systems with Kafka.

Stream Processing: Kafka Streams is a library for building real-time stream processing applications. It enables developers to process and transform data streams from Kafka topics and produce results back to Kafka or other sinks.

Scalability: Kafka is highly scalable and can handle large volumes of data. You can add more brokers to a cluster to increase capacity as needed.

Durability and Data Retention: Kafka retains data for a configurable period, allowing consumers to replay past messages. Data can be archived to long-term storage for compliance or analytics.

Kafka is widely used in various industries, including finance, e-commerce, social media, and more, for use cases such as log aggregation, real-time monitoring, data warehousing, and event-driven architectures. Its ability to handle real-time data streams efficiently has made it a fundamental component in many modern data processing and analytics pipelines.

Installing and Running Kafka Using Docker

Pre-requisite: Install Docker & Docker Compose

First of all you must install docker using one of the following method based on your operating system.

Install Docker Desktop on Mac.
Install Docker Desktop on Windows.
Install Docker on Linux (choose your distro on the lefthand side menu).

After successfully installing Docker, you can follow these steps to install/setup docker compose:

For Mac & Windows, if you have installed Docker Desktop, then Docker Compose is included as part of those desktop installs.
For Linux, follow the steps here (and do all the steps)

How to run Kafka with Docker and Docker Compose

Create a folder in your machine to keep the setup releted to the Kafka docker compose file.

Create a new file (docker-compose.yml) in the dir and use the following content in it.

				
					# docker-compose.yml
version: "3.7"
services:
  zookeeper:
    restart: always
    image: docker.io/bitnami/zookeeper:3.8
    ports:
      - "2181:2181"
    volumes:
      - "zookeeper-volume:/bitnami"
    environment:
      - ALLOW_ANONYMOUS_LOGIN=yes
  kafka:
    restart: always
    image: docker.io/bitnami/kafka:3.3
    ports:
      - "9093:9093"
    volumes:
      - "kafka-volume:/bitnami"
    environment:
      - KAFKA_BROKER_ID=1
      - KAFKA_CFG_ZOOKEEPER_CONNECT=zookeeper:2181
      - ALLOW_PLAINTEXT_LISTENER=yes
      - KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP=CLIENT:PLAINTEXT,EXTERNAL:PLAINTEXT
      - KAFKA_CFG_LISTENERS=CLIENT://:9092,EXTERNAL://:9093
      - KAFKA_CFG_ADVERTISED_LISTENERS=CLIENT://kafka:9092,EXTERNAL://localhost:9093
      - KAFKA_CFG_INTER_BROKER_LISTENER_NAME=CLIENT
    depends_on:
      - zookeeper
volumes:
  kafka-volume:
  zookeeper-volume:

On the above config file we are using docker images of Kafka and zookeper from bitnami.

What is Zookeeper?

Apache ZooKeeper is a distributed coordination service often used in conjunction with Apache Kafka to manage and maintain the metadata and configuration information required by Kafka brokers and to ensure the reliability and stability of a Kafka cluster. While Kafka itself handles the distributed storage and processing of data, ZooKeeper plays a critical role in managing the distributed infrastructure and providing coordination services.

Here are some of the key responsibilities of Apache ZooKeeper in an Apache Kafka ecosystem:

Cluster Coordination: ZooKeeper helps in maintaining a reliable and consistent view of the Kafka cluster by electing leaders, coordinating distributed operations, and ensuring that all brokers are aware of the cluster’s state.

Leader Election: Kafka uses ZooKeeper for leader election among broker nodes. Each partition in Kafka has a leader, and ZooKeeper helps in selecting and maintaining the leader, ensuring that one broker is responsible for reading and writing data to a partition at any given time.

Configuration Management: Kafka stores configuration information, such as topic and partition details, in ZooKeeper. This centralizes the configuration management, making it easier to apply changes consistently across the Kafka cluster.

Dynamic Broker Registration: ZooKeeper assists in dynamic broker registration and discovery. When a Kafka broker starts or stops, it registers or deregisters itself with ZooKeeper, allowing other brokers and clients to discover the current set of active brokers.

Health Monitoring: ZooKeeper provides a way to monitor the health of Kafka brokers. By regularly checking the status of ZooKeeper nodes, administrators can detect broker failures and take appropriate actions, such as reassigning partitions.

Synchronization and Locking: ZooKeeper offers synchronization primitives like locks and semaphores, which Kafka can use to coordinate activities among brokers and clients. These primitives are helpful in scenarios where distributed coordination is required.

Metadata Storage: Kafka stores critical metadata, such as topic and partition information, broker configurations, and access control lists (ACLs), in ZooKeeper. This metadata is essential for the proper functioning of the Kafka cluster.

It’s important to note that starting with Apache Kafka version 2.8.0 (as of my last knowledge update in September 2021), Kafka is working toward reducing its dependency on ZooKeeper through a feature called KRaft mode. In KRaft mode, Kafka aims to replace ZooKeeper with an internal metadata quorum, making Kafka clusters more self-contained and simplifying their operational complexity. However, the adoption of KRaft mode may depend on the specific version of Kafka you are using, so it’s essential to refer to the official Kafka documentation and release notes for the most up-to-date information regarding ZooKeeper’s role in your Kafka deployment.