# Kafka Interview Questions Big Data

Prepare in advance for your Kafka interview with the best possible Apache Kafka interview questions and answers compiled by our experts that will help you crack your Kafka interview and land a good job as an Apache Kafka Developer, Big Data Developer, etc. The following Apache Kafka interview questions discuss the key features of Kafka, how it differs from other messaging frameworks, partitions, broker and its usage, etc. Prepare well and crack your interview with ease and confidence!

• 4.5 Rating
• 30 Question(s)

## Beginner

Kafka is a messaging framework developed by apache foundation, which is to create the create the messaging system along with can provide fault tolerant cluster along with the low latency system, to ensure end to end delivery.

Below are the bullet points:

• Kafka is a messaging system, which has provided fault tolerant capability to prevent the message loss.
• Design on public-subscribe model.
• Kafka cab support both Java and Scala.
• Kafka was originated at LinkedIn and later became an open sourced Apache project in 2011
• Work seamlessly with spark and other big data technology.
• Support cluster mode operation
• Kafka messaging system can be use for web service architecture or big data architecture.
• Kafka ease to code and configure as compare to other messaging framework.

Kafka required other component such as the zookeeper to create a cluster and act as a coordination server

Kafka provide a reliable delivery for messages from sender to receiver apart from that it has other key features as well.

• Kafka is designed for achieving high throughput and fault tolerant messaging services.
• Kafka provides build in patriation called as a Topic.
• Also provide the feature of replication.
• Kafka provides a queue, which can handle the high volume of data and eventually transfer the message from one sender to receiver.
• Kafka also persisted the message in the disk along with has ability to replicate the messages across the cluster
• Kafka work with zookeeper for coordination and synchronization with other services.
• Kafka has good inbuilt support Apache Spark.

To utilize all this key feature, we need to configure the Kafka cluster properly along with the zookeeper configuration.

Now a days kafka is a key messaging framework, not because of its features even for reliable transmission of messages from sender to receiver, however, below are the key points which should consider.

• Reliability − Kafka provides a reliable delivery from publisher to a subscriber with zero message loss..
• Scalability −Kafka achieve this ability by using clustering along with the zookeeper coordination server
• Durability −By using distributed log, the messages can persist on disk.
• Performance − Kafka provides high throughput and low latency across the publish and subscribe application.

Considering the above features Kafka is one of the best options to use in Bigdata Technologies to handle the large volume of messages for a smooth delivery.

There is plethora of use case, where Kafka fit into the real work application, however I listed below are the real work use case which is frequently using.

• Metrics: Use for monitoring operation data, which can use for analysis or doing statistical operation on gather the data from distributed system
• Log Aggregation solution: can be used across an organization to collect logs from multiple services, which consume by consumer services to perform the analytical operation.
• Stream Processing: Kafka’s strong durability is also very useful in the context of stream processing.
• Asynchronous communication: In microservices, keeping this huge system synchronous is not desirable, because it can render the entire application unresponsive. Also, it can defeat the whole purpose of dividing into microservices in the first place. Hence, having Kafka at that time makes the whole data flow easier. Because it is distributed, highly fault-tolerant and it has constant monitoring of broker nodes through services like Zookeeper. So, it makes it efficient to work.
• Chat bots: Chat bots is one of the popular use cases when we require reliable messaging services for a smooth delivery.
• Multi-tenant solution. Multi-tenancy is enabled by configuring which topics can produce or consume data. There are also operations support for quotas

Above are the use case where predominately require a Kafka framework, apart from that there are other cases which depends upon the requirement and design.

Let’s talk about some modern source of data now a days which is a data—transactional data such as orders, inventory, and shopping carts — is being augmented with things such as clicking, likes, recommendations and searches on a web page. All this data is deeply important to analyze the consumers behaviors, and it can feed a set of predictive analytics engines that can be the differentiator for companies.

• Support low latency message delivery.
• Handling the real time traffic.
• Assurance for fault tolerant.
• Easy to integrate with Spark application to process a high volume of messaging data.
• Has an ability to create a cluster of messaging container which monitor and supervise by coordination server like Zookeeper.

So, when we need to handle this kind of volume of data, we need Kafka to solve this problem.

Kafka process diagram comprises the below essential component which is require to setup the messaging infrastructure.

• Topic
• Broker
• Zookeeper
• Partition
• Producer
• Consume

Communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol. This protocol is versioned and maintains backwards compatibility with older version

Topic is a logical feed name to which records are published. Topics in Kafka supports multi-subscriber model, so that topic can have zero, one, or many consumers that subscribe to the data written to it.

• Topic is a specific category which keep the stream of messages.
• Topic split into partition.
• For each Kafka, there at least one partition should be there.
• Each partition contains message or payload in a non-modified ordered sequence.
• Each message with in a partition has an identifier, which is called as a offset.
• A topic has a name, and it must be unique across the cluster.
• Producer need topic to publish the payload.
• Consumer pulled the same payload from the consumer.
• For every Topic the cluster maintain the log look like below.

Every partition has an ordered and immutable sequence of records which is continuously appended to—a structured commit log. The Kafka cluster durably persists all published records—whether they have been consumed—using a configurable retention period.

Kafka topic is shared into the partitions, which contains messages in an unmodifiable sequence.

• Partition is a logical grouping of data.
• Partitions allow you to parallelize a topic by splitting the data in a topic across multiple brokers.
• There are one or more than one partition can group in topic.
• Partition allow to parallelize the topic by splitting a data in a multiple topic across the multiple cluster.
• Each partition has an identifier called offset.
• Each partition can be placed on a separate machine to allow for multiple consumer to read the topic parallel.

The offset is a unique identifier of a record within a partition. It denotes the position of the consumer in the partition. Consumers can read messages starting from a specific offset and can read from any offset point they choose.

• Partition offset has a unique sequence id called as offset.
• Each partition should have a partition offset.

Topic can also have multiple partition logs like the click-topic has in the image to the right. This allows for multiple consumers to read from a topic in parallel.

• Broker are the system which is responsible to maintaining the publish data.
• Each broker may have one or more than one partition.
• Kafka contain multiple broker to main the load balancer.
• Kafka broker are stateless
• E.g.: Let’s say there are N partition in a topic and there is N broker, then each broker has 1 partition.

## Intermediate

The answer to this question encompasses two main aspects – Partitions in a topic and Consumer Groups.

A Kafka topic is divided into partitions. The message sent by the producer is distributed among the topic’s partitions based on the message key. Here we can assume that the key is such that messages would get equally distributed among the partitions.

Consumer Group is a way to bunch together consumers so as to increase the throughput of the consumer application. Each consumer in a group latches to a partition in the topic. i.e. if there are 4 partitions in the topic and 4 consumers in the group then each consumer would read from a single partition. However, if there are 6 partitions and 4 consumers, then the data would be read in parallel from 4 partitions only. Hence its ideal to maintain a 1 to 1 mapping of partition to the consumer in the group.

Now in order to scale up processing at the consumer end, two things can be done:

1. No of partitions in the topic can be increased (say from existing 1 to 4).
2. A consumer group can be created with 4 instances of the consumer attached to it.

Doing this would help read data from the topic in parallel and hence scale up the consumer from 2500 messages/sec to 10000 messages per second.

Dumb broker/Smart producer implies that the broker does not attempt to track which messages have been read by each consumer and only retain unread messages; rather, the broker retains all messages for a set amount of time, and consumers are responsible to track what all messages have been read.

Apache Kafka employs this model only wherein the broker does the work of storing messages for a   time (7 days by default), while consumers are responsible for keeping track of what all messages they have read using offsets.

The opposite of this is the Smart Broker/Dumb Consumer model wherein the broker is focused on the consistent delivery of messages to consumers. In such a case, consumers are dumb and consume at a roughly similar pace as the broker keeps track of consumer state. This model is followed by RabbitMQ.

Kafka is a distributed system wherein data is stored across multiple nodes in the cluster. There is a high probability that one or more nodes in the cluster might fail. Fault tolerance means that the data is the system is protected and available even when some of the nodes in the cluster fail.

One of the ways in which Kafka provides fault tolerance is by making a copy of the partitions. The default replication factor is 3 which means for every partition in a topic, two copies are maintained. In case one of the broker fails, data can be fetched from its replica. This way Kafka can withstand N-1 failures, N being the replication factor.

Kafka also follows the leader-follower model. For every partition, one broker is elected as the leader while others are designated, followers. A leader is responsible for interacting with the producer/consumer. If the leader node goes down, then one of the remaining followers is elected as a leader.

Kafka also maintains a list of In Sync replicas. Say the replication factor is 3. That means there will be a leader partition and two follower partitions. However, the followers may not be in sync with the leader. The ISR shows the list of replicas that are in sync with the leader.

As we already know, a Kafka topic is divided into partitions. The data inside each partition is ordered and can be accessed using an offset. Offset is a position within a partition for the next message to be sent by the consumer. There are two types of offsets maintained by Kafka:

Current Offset

1. It is a pointer to the last record that Kafka has sent in the most recent poll. This offset thus ensures that the consumer does not get the same record twice.

Committed Offset

1. It is a pointer to the last record that a consumer has successfully processed. It plays an important role in case of partition rebalancing – when a new consumer gets assigned to a partition – the new consumer can use committed offset to determine where to start reading records from

There are two ways to commit an offset:

1. Auto-commit: Enabled by default and can be turned off by setting property – enable.auto.commit - to false. Though convenient, it might cause duplicate records to get processed.
2. Manual-commit: This implies that auto-commit has been turned off and offset will be manually committed when the record has been processed.

Prior to Kafka v0.9, Zookeeper was being used to store topic offset, however from v0.9 onwards, the information regarding offset on a topic’s partition is stored on a topic called _consumer_offsets.

An ack or acknowledgment is sent by a broker to the producer to acknowledge receipt of the message. Ack level can be set as a configuration parameter in the Producer and it defines the number of acknowledgments the producer requires the leader to have received before considering a request complete. The following settings are allowed:

• acks=0

In this case, the producer doesn’t wait for any acknowledgment from the broker. No guarantee can be that the broker has received the record.

• acks=1

In this case, the leader writes the record to its local log file and responds back without waiting for acknowledgment from all its followers. In this case, the message can get lost only if the leader fails just after acknowledging the record but before the followers have replicated it, then the record would be lost.

• acks=all

In this case, a set leader waits for all entire sets of in sync replicas to acknowledge the record. This ensures that the record does not get lost as long as one replica is alive and provides the strongest possible guarantee. However it also considerably lessens the throughput as a leader must wait for all followers to acknowledge before responding back.

acks=1 is usually the preferred way of sending records as it ensures receipt of record by a leader, thereby ensuring high durability and at the same time ensures high throughput as well. For highest throughput set acks=0 and for highest durability set acks=all.

• Kafka cluster is a group of more than one broker.
• Kafka cluster has a zero downtime, when we do the expansion of cluster.
• This cluster use to manage the persistence and replication of message data.
• This cluster offer’s strong durability due to cluster centric design.
• In the Kafka cluster, one of the brokers serves as the controller, which is responsible for managing the states of partitions and replicas and for performing administrative tasks like reassigning partitions.

Producer is a client who send or publish the record. Producer applications write data to topics and consumer applications read from topics.

• Producer is a publisher to publish the message in one or more Kafka topic.
• Producer sends data to the broker service.
• Whenever the producer publishes the message, the broker just appends the message to the last segment of the partition.
• Producer can send the message as per the desire topic as well.

Messages sent by a producer to a topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.

Consumer is a subscriber who consume the messages which predominantly stores in a partition. Consumer is a separate process and can be separate application altogether which run in individual machine.

• Consumer can subscribe one and more than one topic.
• Consumer also maintain the counter for message as per the offset value.
• If consumer acknowledge a specific message offset, that means it consume all the previous message.
• Consumer work on asynchronous pull request to the broker to ready with byte or data for consumption.
• Consumer offset value is notified by zookeeper.

If all the consumer falls into the same consumer group, then by using load balancer the message will be distributed over the consumer instances, if consumer instances falls in different group, than each message will be broadcast to all consumer group.

The working principle of Kafka follows the below order.

• Producers send message to a topic at regular intervals.
• Broker in kafka responsible to  stores the messages which is available in  partitions configured for that topic.
• Kafka ensure that if producer publish the two messages, than both the message should be accept by consumer.
• Consumer pull the message from the allocated topic.
• Once consumer digest the topic than Kafka push the offset value to the zookeeper.
• Consumer continuously sending the signal to Kafka approx every 100ms, waiting for the messages.
• Consumer send the acknowledgement ,when message get received.
• When Kafka receives an acknowledgement, it modified the offset value to the new value and send to the  Zookeeper. Zookeeper maintain this offset value so that consumer can read next message correctly even during server outrages.
• This flow is continuing repeating until the request will be live.

Apart from other benefits, below are the key advantages of using Kafka messaging framework.

• Low Latency.
• High throughput.
• Fault tolerant.
• Durability.
• Scalability.
• Support for real time streaming
• High concurrency.
• Message broker capabilities.
• Persistent capability.

Considering all the above advantages, Kafka is one of the most popular frameworks utilize in Micro service architecture, Big Data architecture, Enterprise Integration architecture, publish-subscribe architecture.

Considering the advantages, to setup and configure the Kafka ecosystem is bit difficult and one needs a good knowledge to implement, apart from that I listed some more use case.

• Lack of monitoring tool.
• Wildcard option is not available to select topic.
• For coordinating between the cluster, we need third party services called Zookeeper.
• Need deep understanding to handle the cluster-based infrastructure of Kafka along with Zookeeper.

Zookeeper is a distributed open source configuration, synchronization service along with the naming registry for distributed application.

Zookeeper is a separate component, which is not a mandatory component to implement with Kafka, however when we need to implement cluster, we have to setup as a coordination server.

• Selecting a controller
• Cluster management
• Topic configurator
• Quotas
• Who is allowing to read and write Topic?

Zookeeper plays a significant role when it comes to cluster management like fault tolerant and identify when one cluster down its replicate the messages to other cluster.

groupId = org.apache.spark
artifactId = spark-streaming-kafka-0-10_2.11
version = 2.2.0
groupId = org.apache.zookeeper
artifactId = zookeeper
version=3.4.5

This dependency comes with child dependency which will download and add to the application as a part of parent dependency.

• import org.apache.kafka.clients.consumer.ConsumerRecord
• import org.apache.kafka.common.serialization.StringDeserializer
• import org.apache.spark.streaming.kafka010._
• import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
• import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe

Kafka follows a pub-sub mechanism wherein producer writes to a topic and one or more consumers read from that topic. However, Reads in Kafka always lag behind Writes as there is always some delay between the moment a message is written and the moment it is consumed. This delta between Latest Offset and Consumer Offset is called Consumer Lag

There are various open source tools available to measure consumer lag e.g. LinkedIn Burrow. Confluent Kafka comes with out of the box tools to measure lag.

With Kafka messaging system, three different types of semantics can be achieved.

At max once: Wherein a messaging system will never duplicate a message but might miss out on some messages occasionally.

At least once: Wherein a messaging system will never miss a message but might duplicate some messages occasionally.

Exactly once: Where in it will deliver all the messages without any duplication.

Kafka transactions help achieve exactly once semantic between Kafka brokers and clients. In order to achieve this we need to set below properties at producer end – enable.idempotence=true and transactional.id=<some unique id>. We also need to call initTransaction to prepare the producer to use transactions. With these properties set, if the producer (characterized by producer id> accidentally sends the same message to Kafka more than once, then Kafka broker detects and de-duplicates it.

Kafka is a durable, distributed and scalable messaging system designed to support high volume transactions. Use cases that require a publish-subscribe mechanism at high throughput are a good fit for Kafka. In case you need a point to point or request/reply type communication then other messaging queues like RabbitMQ can be considered.

Kafka is a good fit for real-time stream processing. It uses a dumb broker smart consumer model with the broker merely acting as a message store. So a scenario wherein the consumer cannot be smart and requires a broker to smart instead is not a good fit for Kafka. In such a case, RabbitMQ can be used which uses a smart broker model with the broker responsible for consistent delivery of messages at a roughly similar pace.

Also in cases where protocols like AMQP, MQTT, and features like message routing are needed, in those cases, RabbitMQ is a better alternative over Kafka.

A producer publishes messages to one or more Kafka topics. The message contains information related to what topic and partition should the message be published to.

There are three different types of producer APIs –

1. Fire and forget – The simplest approach, it involves calling send() method of producer API to send the message to the key. In this case, the application doesn’t care whether the message is successfully received by the broker or not.
2. Synchronous producer – In this method, the calling application waits until it gets a response. In the case of success, we get a RecordMetadata object, and in the event of failure, we get an exception. However, note that this will limit your throughput because you are waiting for every message to get acknowledged.
3. Asynchronous producer – A better and faster way of sending messages to Kafka, this involves providing a callback function to receive the acknowledgment. The application doesn’t wait for success/failure and the callback function is invoked when the message is successfully acknowledged or in case of a failure.

Kafka messages are key-value pairs. The key is used for partitioning messages being sent to the topic. When writing a message to a topic, the producer has an option to provide the message key. This key determines which partition of the topic the message goes to. If the key is not specified, then the messages are sent to partitions of the topic in round robin fashion.

Note that Kafka orders messages only inside a partition, hence choosing the right partition key is an important factor in application design.

Kafka supports data replication within the cluster to ensure high availability. But enterprises often need data availability guarantees to span the entire cluster and even withstand site failures.

The solution to this is Mirror Maker – a utility that helps replicate data between two Kafka clusters within the same or different data centers.

MirrorMaker is essentially a Kafka consumer and producer hooked together. The origin and destination clusters are completely different entities and can have a different number of partitions and offsets, however, the topic names should be the same between source and a destination cluster. The MirrorMaker process also retains and uses the partition key so that ordering is maintained within the partition.

## Description

Apache Kafka is an open-source stream-processing software program developed by Linkedin and donated to the Apache Software Foundation.

The increase in popularity of Apache Kafka has led to an extensive increase in demand for professionals who are certified in the field of Apache Kafka. It is a highly appealing option for data integration as it contributes various unique attributes like unifies, low-latency, high-throughput platform to handle real-time data feeds. Other features such as scalability, low latency, data partitioning and its ability to handle numerous diverse consumers makes it more desirable for cases related to data integration. To mention, Apache Kafka has a market share of about 9.1%. It is the best opportunity to move ahead in your career.

There are many companies who use Apache Kafka. According to cwiki.apache.org, the top companies that use Kafka are LinkedIn, Yahoo, Twitter, Netflix, etc.

According to indeed.com, the average salary for apache kafka architect for Senior Technical Lead ranges from $101,298 per year to$148,718 per year for Enterprise Architect.

With a lot of research, we have brought you a few apache kafka interview questions that you might encounter in your upcoming interview. These apache kafka interview questions and answers for experienced and freshers alone will help you crack the apache kafka interview and give you an edge over your competitors. So, in order to succeed in the interview, you need to read, re-read and practice these apache kafka interview questions as much as possible.

If you wish to make a career and have Apache Kafka interviews lined up, then you need not fret. Take a look at the set of Apache Kafka interview questions assembled by experts. These kafka interview questions for experienced as well as freshers with detailed answers will guide you in a whole new manner to crack the Apache Kafka interviews. Stay focused on the essential interview questions on Kafka and prepare well to get acquainted with the types of questions that you may come across in your interview on Apache Kafka.

Hope these Kafka Interview Questions will help you to crack the interview. All the best!