Apache Kafka Tutorial - Architecture
Overview
In this tutorial we will be looking at Apache Kafka Architecture, internal working of Apache Kafka and it's components.
Apache Kafka Tutorial :
What is Apache Kafka
Apache Kafka is A high-throughput distributed streaming platform. It's a publish-subscribe messaging rethought as a distributed commit log
A streaming platform has three key capabilities:- Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system
- Store streams of records in a fault-tolerant durable way
- Process streams of records as they occur.
Traditional Messaging Limitations and Challenges:
- Limited Scalability : To collect and distribute data as messages relies on the role of a messaging broker, which is oftentimes a bottleneck.
- Smaller messages : Larger messages can put severe strain on message brokers, and this is a challenge because you may not be able to control messages coming from some systems.
- Required rapid consumption :A messaging environment is dependent on the ability for message consumers to actually consume at a reasonable rate.
- Not fault tolerant (application) :If the consumer loses the message or processes it incorrectly, it is extremely difficult to get it back to reprocess.
There are typical enterprise challenges when it comes to handling growing data sets, moving faster and faster through systems.
Is there a better way to handle faster growing messages in distributed application?
Well,
it just so happens that in 2010 LinkedIn asked that same question and this is where a Kafka comes
in, where it started as an internal
project. Incidentally, you may be wondering why LinkedIn named their solution Kafka. Then, in the
year 2011 Kafka was made public.
It refers to the German language writer, Franz Kafka, whose work was so nice and inspired an adjective based on his name. They named their solution after the author, whose name would best describe the solution they were hoping to escape from.
Apache Kafka's Architecture
Apache Kafka is truly a messaging system. More specifically, it is a publish subscribe messaging system in a pub subsystem, there are publishers of messages and subscribers of messages. A publisher creates some data and sends it to a specific location where an interested and authorized subscriber can retrieve the message and process it.
In Kafka, we call these traditional publishers something slightly different a producers, and the subscribers we call consumers. Now the producer sends its messages to a specific location. In Kafka, this location is referred to as a topic, which is really a collection or grouping of messages.
The same goes for consumers, consumers retrieve messages based on the topic it is interested in.
The messages and their topics need to be kept somewhere, after all, they are physical containers of data. The place where Kafka keeps and maintains topics is called the broker.
The Kafka broker is a software process also referred to as an executable or daemon service that runs on a machine, a physical machine or a virtual machine. A synonym for a broker is also a server.
The Apache Kafka Cluster
How the Kafka broker handles messages in their topics is what gives Kafka its high throughput capabilities. Achieving high throughput is largely a function of how well a system can distribute its load and efficiently process it on multiple nodes in parallel.
With Apache Kafka, you can see scale out the number of brokers as much as needed to achieve the levels of throughput required, without affecting existing producer and consuming applications.
A Kafka cluster is just a grouping of brokers (worker nodes).
A distributed system is one that consists of multiple independent resources, also known as workers or nodes, sometimes even called worker nodes. Obviously, the reason there are multiple nodes is to spread the work around. Also amongst all of the available working nodes will need coordination to ensure consistency and optimal progress.
Apache Zookeeper
To manage the cluster we make use of Apache Zookeeper. Apache Zookeeper is a coordination service for distributed application that enables synchronization across a cluster. ZooKeeper serves as a centralized service for metadata about vast clusters of distributed nodes needing Bootstrap and runtime configuration information, health and synchronization status, and cluster and quorum group membership, including the roles of elected nodes.
Take a look at our suggested posts:
Kafka Components
Message
At a high level, a Kafka message has a timestamp that it's set when the message is received by a Kafka broker. Furthermore, a message received gets a unique identifier. The combination of the timestamp and its identifier form its placement in the sequence of messages received within a topic.
The message itself has a binary payload of data, which is what the producers and consumers really care about. From the consumer's perspective, they simply read messages from a topic.
Topics
In Kafka, a topic is a logical entity, something that virtually spans across the entire cluster of
brokers. It's category name or collection point for messages that producers send messages to and
consumers retrieve messages from, like mailbox.
Behind the scenes, for each topic, the Kafka cluster is maintaining one or more physical log
files
Event sourcing
When a producer sends a message to a Kafka topic, the messages are appended to a time ordered sequential stream. Each message represents an event or fact, make available to potential consumers. These events are immutable. Once they are received into a topic, they cannot be changed. It would be the job of the consumer to reconcile between the messages when it reads them and processes them. This style of maintaining data as events is an architectural style known as event sourcing
The Offset
The Offset is a placeholder:
- Last read message position
- It is maintained by the Kafka Consumer
- Corresponds to the message Identifier
You can think of it like a bookmark that maintains the last read position.In the case of a Kafka topic, it is the last read message
When a consumer wants to read from a topic, it must establish a connection with a broker. Upon the connection, the consumer will decide what messages it wants to consume. And as it reads through the sequence of messages, it will inevitably come to the last message in the topic and move its offset accordingly.
Now, if another consumer wants to read message from the same topic, but another consumer could be at a different place in the topic. It could've already read the messages from the beginning and simply is waiting for more messages to arrive so it can read and process them. The key here is that it knows where it left off and can choose to advance its position, stay put, or go back and reread another previously read message, all without the producer, brokers, or other consumers needing to know or care. When other messages arrive, the connected consumer will receive an event indicating there is a new message, and it can advance its position once it retrieves the new message. When the last message in the topic has been read and processed, the consumer can set its offset and at that point is caught up.
Kafka Partition
- Each Topic has one or more partition.
- A partition is the basis for which Kafka can:
- Scale : Apache Kafka can be scaled horizontally. This increases the throughput.
- Become fault-tolerant : Apache Kafka maintains replicated copies of data. So if any broker in the cluster goes down, it does not affect the working of Apache Kafka Cluster. This is done by setting the replication value to greater than 1.
- Achieve higher levels of throughput - Each partition is maintained on at least one or more Brokers.
Partitioning trade-offs
- The more partition the greater the Zookeeper overhead
- With large partition numbers ensure ZK capacity - Message ordering become complex
- Single partition for global ordering
- Consumer handling for ordering - The more partition, the longer the leader fail-over time.
Message Retention Policy
- Apache Kafka retains all published messages regardless of consumption.
- Retention period is configurable
- Default is 168 hours or seven days - Retention period is defined on per topic basis.
- Physical storage resources can constrain message retention.