Top Apache Kafka Interview Questions and Answers (2024)
Without a doubt Kafka is considered to have been the ultimate choice of data processing
pipelines over the years, the open-source message broker project developed by Apache Software
Foundation.
Apache Kafka is quickly becoming the chosen messaging platform for a distributed
messaging network which is unbelievably scalable. The ability to ingeste data at a lightening
pace makes it an ideal option to create
complex pipelines for data processing.
Kafka's success has brought a variety of work opportunities and career prospects around
it.In this post, questions from Kafka Interviews will be answered for Experienced and Freshers.
We're trying to share our experience
and learn how to help you make progress in your career.
Apache Kafka Tutorial :
- Apache Kafka Architecture
- Install Apache Kafka (Zookeeper+Broker) / Kafka Tool
- Spring Boot Apache Kafka String Message Example
- Spring Boot Apache Kafka JSON Message Example
- Apache Kafka Interview Questions and Answers
Q: What is Kafka ?
Ans:
Apache Kafka is a high-throughput, fault-tolerant and highly scalable distributed messaging system designed by LinkedIn. It's a publish-subscribe messaging rethought as distributed commit log
It was the idea of Jay Kreps, Neha Narkhede and Jun Rao, a group of LinkedIn engineers who were working on data streaming tasks. The event data from the LinkedIn websites and the entire infrastructure were ingested into the architecture of Lambda to be harnessed by Hadoop and other real-time processing systems.The company has encountered low latency problems in the processing of real-time events during this process. Kafka's answer was. It was planned in 2010, and released in 2011. Using Kafka, LinkedIn was able to send large streams of message to clusters in Hadoop. It later became a project by Apache.
Q: What are challenges with traditional messaging system ?
Ans:
Businesses have multiple options when it comes to choosing a messaging system. However,
traditional messaging systems come with several challenges. Here are the most common ones.
- Single Point of Failure : The traditional messaging systems are designed for a topology of the Hub and Spoke. All messages are stored in this design on a central server, or broker. So, at any given point in time each client application connects to one server or broker. This design can turn out to be a single point of failure, as all the topics are queued in the central hub. Even if you add a standby to the primary broker, the client application will only be able to connect to one node.
- Difficulties in horizontal scaling : While Hub and spoken architectures have developed into multi-node networks, horizontal scaling is not allowed by the architecture. A single client application connects at a given time to a single node. When you increase the number of nodes, the amount of internode traffic that writes and reads processes also increases.
- Monolithic architecture : Traditional messaging systems are designed to address a monolithic architecture's data challenges. Nevertheless, most companies now operate a clustered, distributed computing system. Large Commodity Hardware clusters can not be scaled horizontally in this design. Moreover, messages will wait in queues.
Checkout our related posts :
Q: What is difference between Message Oriented Middleware vs Kafka ?
Ans:
The main distinction between Message Middleware and Kafka is that consumers would never immediately receive messages. They have to explicitly ask for a message when they are ready to handle.
Until Apache Kafka was implemented, Message Oriented Middleware (MOM) such as Apache Qpid, RabbitMQ, Microsoft Message Queue, and IBM MQ Series were used to share messages across multiple components; While these products are ideal for the implementation of the publisher / subscriber (Publisher / Sub) model, they are not explicitly designed to handle large streams of data from thousands of publishers. Most of the MOM software has an asynchronous contact broker that discloses the Advanced Message Queuing Protocol (AMQP).Kafka is built from the ground up to deal with millions of events created in rapid succession of firehose-style ones. It guarantees low latency, "at-least-once," delivery of messages to consumers. Kafka also advocates data retention for offline users, ensuring that the data can be stored in either real-time or offline mode.
Kafka is designed to be a distributed commit log, building further on the persistence and retention. Unlike relational databases, it can have a permanent record of all transactions which can be played back to restore a system's state. The main point to remember is that the data is stored in an order that can be deterministically read for long term. Kafka provides redundancy due to the distributed architecture which ensures high data availability even when one of the servers faces disruption.
Q: Elaborate on architecture of Apache Kafka?
Ans:
The Kafka architecture consists of four basic components namely Brokers, Consumers, Producers and Zookeepers. The basic element is a Message which is actually a payload of bytes. Streams of messages belonging to a specific category is called a Topic. Each message is identified by its index called offset.
- Topics : Kafka stores data in Topics. Each topic is partitioned as a set of segment files that are equal in size. Messages in the partition are organized into an immutable order sequence. Each topic has a minimum of one partition. The backup of a partition is called a Replica.
- Brokers : A broker is a stateless server that stores the published messages. A set of servers are also called clusters. One instance of Kafka broker can handle TB of messages and hundreds of thousands of reads and writes per second without impacting the performance. As brokers are stateless, a zookeeper is used to maintain the cluster state.
- Zookeepers : The role of a zookeeper is to co-ordinate and manage Kafka brokers. When a new broker is started or an existing broker is stopped, the zookeeper sends a notification to producers and consumers. Based on these notifications, producers and consumers communicate with brokers accordingly. Many distributed systems such as Apache Hadoop, Neo4J and HBase are using Zookeeper to successfully run their projects.
- Producers : A producer is the element that publishes messages to a topic. In technical terms, producers push data to brokers. Producers doesn't need an acknowledgement from a broker. They continuously push data to brokers as fast as the broker can handle. Every time a new broker is started, Kafka producers search for that new broker and send messages to that broker.
- Consumers : A consumer is a Kafka element that subscribes to one or more topics to consume messages from brokers. Consumers use partition offset values to maintain how many messages have been consumed. This partition offset value is obtained from the zookeeper. When a particular message offset is acknowledged by a consumer, it means all the prior messages are consumed.
Q: How does Apache Kafka messaging system work?
Ans:
Kafka supports both publish-subscribe messaging system as well as queue based messaging system. Here is an example of workflow in the publish-subscribe system
- Producers are responsible for pushing data to brokers. Producers regularly send messages to topics.
- Brokers categorize these messages into specific topics and stores them in corresponding partitions. All the partitions share equal number of messages. When there are two messages, each partition stores one message.
- To pull data from these partitions, consumers first subscribe to a topic. Then the offset of the topic is provided to the consumer. This offset value is also saved in the zookeeper ensemble.
- Consumers request for messages in regular intervals. When new messages are posted by producers, they are forwarded to consumers.
- Consumers process these messages and send an acknowledgement to the broker after the processing of the messages is done
- After an acknowledgement is received, the broker updates the offset value with a new one.
- This workflow is repeated till the consumer stops requesting for message
- Consumers can rewind to any offset value to consume desired messages.
Q: How does Queue Messaging system work?
Ans:
The queue messaging system is similar to publish-subscribe system. The only difference is that instead of one, consumers come in groups to pull data from partitions. They are segregated into groups and data is published.
- Producers are responsible for pushing data to brokers. Producers regularly send messages to topics.
- Brokers categorize these messages into specific topics and stores them in corresponding partitions. All the partitions share equal number of messages. When there are two messages, each partition stores one message.
- To pull data from these partitions, consumers first subscribe to a topic. Then the offset of the topic is provided to the consumer. This offset value is also saved in the zookeeper ensemble.
- Brokers supply messages to the consumer until another consumer subscribes to the same topic. With the arrival of a new consumer, the broker shares the messages between the consumers. This process will be carried out until the number of consumers doesn't exceed the threshold for that partition. This group is called a consumer group
- After maximum number of consumers subscribe to a topic, the broker doesn't entertain new consumers till one of the consumer unsubscribes from that topic.
- Consumers request for messages in regular intervals. When new messages are posted by producers, they are forwarded to consumers.
Q: Explain one Apache Kafka Use Case?
Ans:
Here is an example of how Apache Kafka helps you to process streaming data in real time. Consider an instance wherein you have tomonitor trending twitter feeds and hashtags. You need a Kafka producer that collects data from Twitter, processes it, extracts hashtags and sends this data to the Kafka Ecosystem. You can install a data processing engine such as Apache Storm or Apache Spark.
Firstly, the Kafka producer collects data from the Twitter feed using the Twitter Streaming API. You can use any programming language to access the Twitter streaming API and get details of various subsets of public and protected Twitter data. You also need to sign up for a Twitter developer account. After signing up, you can get OAuth authentication details such as Customerkey, CustomerSecret, AccessToken and AccessTokenSecret. The Kafka producer can receive twitter feed and process it to extract hashtags. This information is sent to the Apache Storm/Spark ecosystem
Apache Kafka is used by some of the popular applications such as Twitter, Netflix, Oracle, Mozilla and LinkedIn.Q: Why Kafka is better than other messaging systems?
Ans:
There are several messaging systems that are alternate to Kafka. RabbitMQ, ActiveMQ and ZeroMQ are some of the popular ones
- RabbitMQ : RabbitMQ is a powerful messaging broker that was actually designed to implement Advanced Message Queuing Protocol (AMQP). This open-source tool is developed in Erlang and is easy to install and use. RabbitMQ supports both message persistence and replication. RabbitMQ comes with excellent routing capabilities based on rules. The performance is good. RabbitMQ is broker-centric which means it focuses on deliver guarantees between the consumer and the producers. Advanced capabilities such as routing, persistent messaging and load balancing can be performed with a few lines of code. However, RabbitMQ messaging system is not distributed. When you have an infrastructure that scales massively, RabbitMQ won't be able to match that capability
- ActiveMQ : ActiveMQ is another popular messaging system that is easy to implement and
use. It enjoys largest number of installations. The deployment supports both P2P and broker
topologies. With a few lines of
code, you can implement advanced capabilities. It uses Java Message Service specification.
ActiveMQ offers numerous options when it comes to clustering and distribution. ActiveMQ
enjoys strong documentation and
active support. It is highly scalable and handles tens of thousands of messages per second.
ActiveMQ is reliable and delivers high performance. While ActiveMQ offers more features, it is more suited for simple queue service (SQS). When it comes to distributed systems that massively scale up and down, ActiveMQfaces tough competition from newer technologies that deliver better performance and features. ActiveMQ writes messages to a journal before shipping them to consumers. It means the number of messages that it can store depends on the disk capacity. In case of high memory consumption, ActiveMQ pauses producers until the space is freed. In a distributed system wherein producers also act as consumers, the entire system can be locked up.
- ZeroMQ : ZeroMQ is a light weight messaging system that comes with strong documentation and active support. The tool is especially useful for instances wherein low latency and high throughout are required. It offers all advanced capabilities similar to RabbitMQ. However, the downside is that you have to combine various pieces of frameworks to create those solutions. While ZeroMQ offers strong documentation, there is a bit of learning curve.