Prepare in advance for your Kafka interview with the best possible Apache Kafka interview questions and answers compiled by our experts that will help you crack your Kafka interview and land a good job as an Apache Kafka Developer, Big Data Developer, etc. The following Apache Kafka interview questions discuss the key features of Kafka, how it differs from other messaging frameworks, partitions, broker and its usage, etc. Prepare well and crack your interview with ease and confidence!
Kafka is a messaging framework developed by apache foundation, which is to create the create the messaging system along with can provide fault tolerant cluster along with the low latency system, to ensure end to end delivery.
Below are the bullet points:
Kafka required other component such as the zookeeper to create a cluster and act as a coordination server
Kafka provide a reliable delivery for messages from sender to receiver apart from that it has other key features as well.
To utilize all this key feature, we need to configure the Kafka cluster properly along with the zookeeper configuration.
Now a days kafka is a key messaging framework, not because of its features even for reliable transmission of messages from sender to receiver, however, below are the key points which should consider.
Considering the above features Kafka is one of the best options to use in Bigdata Technologies to handle the large volume of messages for a smooth delivery.
There is plethora of use case, where Kafka fit into the real work application, however I listed below are the real work use case which is frequently using.
Above are the use case where predominately require a Kafka framework, apart from that there are other cases which depends upon the requirement and design.
Let’s talk about some modern source of data now a days which is a data—transactional data such as orders, inventory, and shopping carts — is being augmented with things such as clicking, likes, recommendations and searches on a web page. All this data is deeply important to analyze the consumers behaviors, and it can feed a set of predictive analytics engines that can be the differentiator for companies.
So, when we need to handle this kind of volume of data, we need Kafka to solve this problem.
Kafka process diagram comprises the below essential component which is require to setup the messaging infrastructure.
Communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol. This protocol is versioned and maintains backwards compatibility with older version
Topic is a logical feed name to which records are published. Topics in Kafka supports multi-subscriber model, so that topic can have zero, one, or many consumers that subscribe to the data written to it.
Every partition has an ordered and immutable sequence of records which is continuously appended to—a structured commit log. The Kafka cluster durably persists all published records—whether they have been consumed—using a configurable retention period.
Kafka topic is shared into the partitions, which contains messages in an unmodifiable sequence.
The offset is a unique identifier of a record within a partition. It denotes the position of the consumer in the partition. Consumers can read messages starting from a specific offset and can read from any offset point they choose.
Topic can also have multiple partition logs like the click-topic has in the image to the right. This allows for multiple consumers to read from a topic in parallel.
The answer to this question encompasses two main aspects – Partitions in a topic and Consumer Groups.
A Kafka topic is divided into partitions. The message sent by the producer is distributed among the topic’s partitions based on the message key. Here we can assume that the key is such that messages would get equally distributed among the partitions.
Consumer Group is a way to bunch together consumers so as to increase the throughput of the consumer application. Each consumer in a group latches to a partition in the topic. i.e. if there are 4 partitions in the topic and 4 consumers in the group then each consumer would read from a single partition. However, if there are 6 partitions and 4 consumers, then the data would be read in parallel from 4 partitions only. Hence its ideal to maintain a 1 to 1 mapping of partition to the consumer in the group.
Now in order to scale up processing at the consumer end, two things can be done:
Doing this would help read data from the topic in parallel and hence scale up the consumer from 2500 messages/sec to 10000 messages per second.
Dumb broker/Smart producer implies that the broker does not attempt to track which messages have been read by each consumer and only retain unread messages; rather, the broker retains all messages for a set amount of time, and consumers are responsible to track what all messages have been read.
Apache Kafka employs this model only wherein the broker does the work of storing messages for a time (7 days by default), while consumers are responsible for keeping track of what all messages they have read using offsets.
The opposite of this is the Smart Broker/Dumb Consumer model wherein the broker is focused on the consistent delivery of messages to consumers. In such a case, consumers are dumb and consume at a roughly similar pace as the broker keeps track of consumer state. This model is followed by RabbitMQ.
Kafka is a distributed system wherein data is stored across multiple nodes in the cluster. There is a high probability that one or more nodes in the cluster might fail. Fault tolerance means that the data is the system is protected and available even when some of the nodes in the cluster fail.
One of the ways in which Kafka provides fault tolerance is by making a copy of the partitions. The default replication factor is 3 which means for every partition in a topic, two copies are maintained. In case one of the broker fails, data can be fetched from its replica. This way Kafka can withstand N-1 failures, N being the replication factor.
Kafka also follows the leader-follower model. For every partition, one broker is elected as the leader while others are designated, followers. A leader is responsible for interacting with the producer/consumer. If the leader node goes down, then one of the remaining followers is elected as a leader.
Kafka also maintains a list of In Sync replicas. Say the replication factor is 3. That means there will be a leader partition and two follower partitions. However, the followers may not be in sync with the leader. The ISR shows the list of replicas that are in sync with the leader.
As we already know, a Kafka topic is divided into partitions. The data inside each partition is ordered and can be accessed using an offset. Offset is a position within a partition for the next message to be sent by the consumer. There are two types of offsets maintained by Kafka:
There are two ways to commit an offset:
Prior to Kafka v0.9, Zookeeper was being used to store topic offset, however from v0.9 onwards, the information regarding offset on a topic’s partition is stored on a topic called _consumer_offsets.
An ack or acknowledgment is sent by a broker to the producer to acknowledge receipt of the message. Ack level can be set as a configuration parameter in the Producer and it defines the number of acknowledgments the producer requires the leader to have received before considering a request complete. The following settings are allowed:
In this case, the producer doesn’t wait for any acknowledgment from the broker. No guarantee can be that the broker has received the record.
In this case, the leader writes the record to its local log file and responds back without waiting for acknowledgment from all its followers. In this case, the message can get lost only if the leader fails just after acknowledging the record but before the followers have replicated it, then the record would be lost.
In this case, a set leader waits for all entire sets of in sync replicas to acknowledge the record. This ensures that the record does not get lost as long as one replica is alive and provides the strongest possible guarantee. However it also considerably lessens the throughput as a leader must wait for all followers to acknowledge before responding back.
acks=1 is usually the preferred way of sending records as it ensures receipt of record by a leader, thereby ensuring high durability and at the same time ensures high throughput as well. For highest throughput set acks=0 and for highest durability set acks=all.
Producer is a client who send or publish the record. Producer applications write data to topics and consumer applications read from topics.
Messages sent by a producer to a topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
Consumer is a subscriber who consume the messages which predominantly stores in a partition. Consumer is a separate process and can be separate application altogether which run in individual machine.
If all the consumer falls into the same consumer group, then by using load balancer the message will be distributed over the consumer instances, if consumer instances falls in different group, than each message will be broadcast to all consumer group.
The working principle of Kafka follows the below order.
Apart from other benefits, below are the key advantages of using Kafka messaging framework.
Considering all the above advantages, Kafka is one of the most popular frameworks utilize in Micro service architecture, Big Data architecture, Enterprise Integration architecture, publish-subscribe architecture.
Considering the advantages, to setup and configure the Kafka ecosystem is bit difficult and one needs a good knowledge to implement, apart from that I listed some more use case.
Zookeeper is a distributed open source configuration, synchronization service along with the naming registry for distributed application.
Zookeeper is a separate component, which is not a mandatory component to implement with Kafka, however when we need to implement cluster, we have to setup as a coordination server.
Zookeeper plays a significant role when it comes to cluster management like fault tolerant and identify when one cluster down its replicate the messages to other cluster.
groupId = org.apache.spark artifactId = spark-streaming-kafka-0-10_2.11 version = 2.2.0 groupId = org.apache.zookeeper artifactId = zookeeper version=3.4.5
This dependency comes with child dependency which will download and add to the application as a part of parent dependency.
Kafka follows a pub-sub mechanism wherein producer writes to a topic and one or more consumers read from that topic. However, Reads in Kafka always lag behind Writes as there is always some delay between the moment a message is written and the moment it is consumed. This delta between Latest Offset and Consumer Offset is called Consumer Lag
There are various open source tools available to measure consumer lag e.g. LinkedIn Burrow. Confluent Kafka comes with out of the box tools to measure lag.
With Kafka messaging system, three different types of semantics can be achieved.
At max once: Wherein a messaging system will never duplicate a message but might miss out on some messages occasionally.
At least once: Wherein a messaging system will never miss a message but might duplicate some messages occasionally.
Exactly once: Where in it will deliver all the messages without any duplication.
Kafka transactions help achieve exactly once semantic between Kafka brokers and clients. In order to achieve this we need to set below properties at producer end – enable.idempotence=true and transactional.id=<some unique id>. We also need to call initTransaction to prepare the producer to use transactions. With these properties set, if the producer (characterized by producer id> accidentally sends the same message to Kafka more than once, then Kafka broker detects and de-duplicates it.
Kafka is a durable, distributed and scalable messaging system designed to support high volume transactions. Use cases that require a publish-subscribe mechanism at high throughput are a good fit for Kafka. In case you need a point to point or request/reply type communication then other messaging queues like RabbitMQ can be considered.
Kafka is a good fit for real-time stream processing. It uses a dumb broker smart consumer model with the broker merely acting as a message store. So a scenario wherein the consumer cannot be smart and requires a broker to smart instead is not a good fit for Kafka. In such a case, RabbitMQ can be used which uses a smart broker model with the broker responsible for consistent delivery of messages at a roughly similar pace.
Also in cases where protocols like AMQP, MQTT, and features like message routing are needed, in those cases, RabbitMQ is a better alternative over Kafka.
A producer publishes messages to one or more Kafka topics. The message contains information related to what topic and partition should the message be published to.
There are three different types of producer APIs –
Kafka messages are key-value pairs. The key is used for partitioning messages being sent to the topic. When writing a message to a topic, the producer has an option to provide the message key. This key determines which partition of the topic the message goes to. If the key is not specified, then the messages are sent to partitions of the topic in round robin fashion.
Note that Kafka orders messages only inside a partition, hence choosing the right partition key is an important factor in application design.
Kafka supports data replication within the cluster to ensure high availability. But enterprises often need data availability guarantees to span the entire cluster and even withstand site failures.
The solution to this is Mirror Maker – a utility that helps replicate data between two Kafka clusters within the same or different data centers.
MirrorMaker is essentially a Kafka consumer and producer hooked together. The origin and destination clusters are completely different entities and can have a different number of partitions and offsets, however, the topic names should be the same between source and a destination cluster. The MirrorMaker process also retains and uses the partition key so that ordering is maintained within the partition.
Apache Kafka is an open-source stream-processing software program developed by Linkedin and donated to the Apache Software Foundation.
The increase in popularity of Apache Kafka has led to an extensive increase in demand for professionals who are certified in the field of Apache Kafka. It is a highly appealing option for data integration as it contributes various unique attributes like unifies, low-latency, high-throughput platform to handle real-time data feeds. Other features such as scalability, low latency, data partitioning and its ability to handle numerous diverse consumers makes it more desirable for cases related to data integration. To mention, Apache Kafka has a market share of about 9.1%. It is the best opportunity to move ahead in your career.
There are many companies who use Apache Kafka. According to cwiki.apache.org, the top companies that use Kafka are LinkedIn, Yahoo, Twitter, Netflix, etc.
According to indeed.com, the average salary for apache kafka architect for Senior Technical Lead ranges from $101,298 per year to $148,718 per year for Enterprise Architect.
With a lot of research, we have brought you a few apache kafka interview questions that you might encounter in your upcoming interview. These apache kafka interview questions and answers for experienced and freshers alone will help you crack the apache kafka interview and give you an edge over your competitors. So, in order to succeed in the interview, you need to read, re-read and practice these apache kafka interview questions as much as possible.
If you wish to make a career and have Apache Kafka interviews lined up, then you need not fret. Take a look at the set of Apache Kafka interview questions assembled by experts. These kafka interview questions for experienced as well as freshers with detailed answers will guide you in a whole new manner to crack the Apache Kafka interviews. Stay focused on the essential interview questions on Kafka and prepare well to get acquainted with the types of questions that you may come across in your interview on Apache Kafka.
Hope these Kafka Interview Questions will help you to crack the interview. All the best!