# Kafka Interview Questions Big Data

Prepare in advance for your Kafka interview with the best possible Apache Kafka interview questions and answers compiled by our experts that will help you crack your Kafka interview and land a good job as an Apache Kafka Developer, Big Data Developer, etc. The following Apache Kafka interview questions discuss the key features of Kafka, how it differs from other messaging frameworks, partitions, broker and its usage, etc. Prepare well and crack your interview with ease and confidence!

• 4.5 Rating
• 20 Question(s)

## Beginner

Kafka is a messaging framework developed by apache foundation, which is to create the create the messaging system along with can provide fault tolerant cluster along with the low latency system, to ensure end to end delivery.

Below are the bullet points:

• Kafka is a messaging system, which has provided fault tolerant capability to prevent the message loss.
• Design on public-subscribe model.
• Kafka cab support both Java and Scala.
• Kafka was originated at LinkedIn and later became an open sourced Apache project in 2011
• Work seamlessly with spark and other big data technology.
• Support cluster mode operation
• Kafka messaging system can be use for web service architecture or big data architecture.
• Kafka ease to code and configure as compare to other messaging framework.

Kafka required other component such as the zookeeper to create a cluster and act as a coordination server

Kafka provide a reliable delivery for messages from sender to receiver apart from that it has other key features as well.

• Kafka is designed for achieving high throughput and fault tolerant messaging services.
• Kafka provides build in patriation called as a Topic.
• Also provide the feature of replication.
• Kafka provides a queue, which can handle the high volume of data and eventually transfer the message from one sender to receiver.
• Kafka also persisted the message in the disk along with has ability to replicate the messages across the cluster
• Kafka work with zookeeper for coordination and synchronization with other services.
• Kafka has good inbuilt support Apache Spark.

To utilize all this key feature, we need to configure the Kafka cluster properly along with the zookeeper configuration.

Now a days kafka is a key messaging framework, not because of its features even for reliable transmission of messages from sender to receiver, however, below are the key points which should consider.

• Reliability − Kafka provides a reliable delivery from publisher to a subscriber with zero message loss..
• Scalability −Kafka achieve this ability by using clustering along with the zookeeper coordination server
• Durability −By using distributed log, the messages can persist on disk.
• Performance − Kafka provides high throughput and low latency across the publish and subscribe application.

Considering the above features Kafka is one of the best options to use in Bigdata Technologies to handle the large volume of messages for a smooth delivery.

There is plethora of use case, where Kafka fit into the real work application, however I listed below are the real work use case which is frequently using.

• Metrics: Use for monitoring operation data, which can use for analysis or doing statistical operation on gather the data from distributed system
• Log Aggregation solution: can be used across an organization to collect logs from multiple services, which consume by consumer services to perform the analytical operation.
• Stream Processing: Kafka’s strong durability is also very useful in the context of stream processing.
• Asynchronous communication: In microservices, keeping this huge system synchronous is not desirable, because it can render the entire application unresponsive. Also, it can defeat the whole purpose of dividing into microservices in the first place. Hence, having Kafka at that time makes the whole data flow easier. Because it is distributed, highly fault-tolerant and it has constant monitoring of broker nodes through services like Zookeeper. So, it makes it efficient to work.
• Chat bots: Chat bots is one of the popular use cases when we require reliable messaging services for a smooth delivery.
• Multi-tenant solution. Multi-tenancy is enabled by configuring which topics can produce or consume data. There are also operations support for quotas

Above are the use case where predominately require a Kafka framework, apart from that there are other cases which depends upon the requirement and design.

Let’s talk about some modern source of data now a days which is a data—transactional data such as orders, inventory, and shopping carts — is being augmented with things such as clicking, likes, recommendations and searches on a web page. All this data is deeply important to analyze the consumers behaviors, and it can feed a set of predictive analytics engines that can be the differentiator for companies.

• Support low latency message delivery.
• Handling the real time traffic.
• Assurance for fault tolerant.
• Easy to integrate with Spark application to process a high volume of messaging data.
• Has an ability to create a cluster of messaging container which monitor and supervise by coordination server like Zookeeper.

So, when we need to handle this kind of volume of data, we need Kafka to solve this problem.

Kafka process diagram comprises the below essential component which is require to setup the messaging infrastructure.

• Topic
• Broker
• Zookeeper
• Partition
• Producer
• Consume

Communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol. This protocol is versioned and maintains backwards compatibility with older version

Topic is a logical feed name to which records are published. Topics in Kafka supports multi-subscriber model, so that topic can have zero, one, or many consumers that subscribe to the data written to it.

• Topic is a specific category which keep the stream of messages.
• Topic split into partition.
• For each Kafka, there at least one partition should be there.
• Each partition contains message or payload in a non-modified ordered sequence.
• Each message with in a partition has an identifier, which is called as a offset.
• A topic has a name, and it must be unique across the cluster.
• Producer need topic to publish the payload.
• Consumer pulled the same payload from the consumer.
• For every Topic the cluster maintain the log look like below.

Every partition has an ordered and immutable sequence of records which is continuously appended to—a structured commit log. The Kafka cluster durably persists all published records—whether they have been consumed—using a configurable retention period.

Kafka topic is shared into the partitions, which contains messages in an unmodifiable sequence.

• Partition is a logical grouping of data.
• Partitions allow you to parallelize a topic by splitting the data in a topic across multiple brokers.
• There are one or more than one partition can group in topic.
• Partition allow to parallelize the topic by splitting a data in a multiple topic across the multiple cluster.
• Each partition has an identifier called offset.
• Each partition can be placed on a separate machine to allow for multiple consumer to read the topic parallel.

The offset is a unique identifier of a record within a partition. It denotes the position of the consumer in the partition. Consumers can read messages starting from a specific offset and can read from any offset point they choose.

• Partition offset has a unique sequence id called as offset.
• Each partition should have a partition offset.

Topic can also have multiple partition logs like the click-topic has in the image to the right. This allows for multiple consumers to read from a topic in parallel.

• Broker are the system which is responsible to maintaining the publish data.
• Each broker may have one or more than one partition.
• Kafka contain multiple broker to main the load balancer.
• Kafka broker are stateless
• E.g.: Let’s say there are N partition in a topic and there is N broker, then each broker has 1 partition.

• Kafka cluster is a group of more than one broker.
• Kafka cluster has a zero downtime, when we do the expansion of cluster.
• This cluster use to manage the persistence and replication of message data.
• This cluster offer’s strong durability due to cluster centric design.
• In the Kafka cluster, one of the brokers serves as the controller, which is responsible for managing the states of partitions and replicas and for performing administrative tasks like reassigning partitions.

Producer is a client who send or publish the record. Producer applications write data to topics and consumer applications read from topics.

• Producer is a publisher to publish the message in one or more Kafka topic.
• Producer sends data to the broker service.
• Whenever the producer publishes the message, the broker just appends the message to the last segment of the partition.
• Producer can send the message as per the desire topic as well.

Messages sent by a producer to a topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.

Consumer is a subscriber who consume the messages which predominantly stores in a partition. Consumer is a separate process and can be separate application altogether which run in individual machine.

• Consumer can subscribe one and more than one topic.
• Consumer also maintain the counter for message as per the offset value.
• If consumer acknowledge a specific message offset, that means it consume all the previous message.
• Consumer work on asynchronous pull request to the broker to ready with byte or data for consumption.
• Consumer offset value is notified by zookeeper.

If all the consumer falls into the same consumer group, then by using load balancer the message will be distributed over the consumer instances, if consumer instances falls in different group, than each message will be broadcast to all consumer group.

The working principle of Kafka follows the below order.

• Producers send message to a topic at regular intervals.
• Broker in kafka responsible to  stores the messages which is available in  partitions configured for that topic.
• Kafka ensure that if producer publish the two messages, than both the message should be accept by consumer.
• Consumer pull the message from the allocated topic.
• Once consumer digest the topic than Kafka push the offset value to the zookeeper.
• Consumer continuously sending the signal to Kafka approx every 100ms, waiting for the messages.
• Consumer send the acknowledgement ,when message get received.
• When Kafka receives an acknowledgement, it modified the offset value to the new value and send to the  Zookeeper. Zookeeper maintain this offset value so that consumer can read next message correctly even during server outrages.
• This flow is continuing repeating until the request will be live.

Apart from other benefits, below are the key advantages of using Kafka messaging framework.

• Low Latency.
• High throughput.
• Fault tolerant.
• Durability.
• Scalability.
• Support for real time streaming
• High concurrency.
• Message broker capabilities.
• Persistent capability.

Considering all the above advantages, Kafka is one of the most popular frameworks utilize in Micro service architecture, Big Data architecture, Enterprise Integration architecture, publish-subscribe architecture.

Considering the advantages, to setup and configure the Kafka ecosystem is bit difficult and one needs a good knowledge to implement, apart from that I listed some more use case.

• Lack of monitoring tool.
• Wildcard option is not available to select topic.
• For coordinating between the cluster, we need third party services called Zookeeper.
• Need deep understanding to handle the cluster-based infrastructure of Kafka along with Zookeeper.

Zookeeper is a distributed open source configuration, synchronization service along with the naming registry for distributed application.

Zookeeper is a separate component, which is not a mandatory component to implement with Kafka, however when we need to implement cluster, we have to setup as a coordination server.

• Selecting a controller
• Cluster management
• Topic configurator
• Quotas
• Who is allowing to read and write Topic?

Zookeeper plays a significant role when it comes to cluster management like fault tolerant and identify when one cluster down its replicate the messages to other cluster.

groupId = org.apache.spark
artifactId = spark-streaming-kafka-0-10_2.11
version = 2.2.0
groupId = org.apache.zookeeper
artifactId = zookeeper
version=3.4.5

This dependency comes with child dependency which will download and add to the application as a part of parent dependency.

• import org.apache.kafka.clients.consumer.ConsumerRecord
• import org.apache.kafka.common.serialization.StringDeserializer
• import org.apache.spark.streaming.kafka010._
• import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
• import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe

## Description

Apache Kafka is an open-source stream-processing software program developed by Linkedin and donated to the Apache Software Foundation.

The increase in popularity of Apache Kafka has led to an extensive increase in demand for professionals who are certified in the field of Apache Kafka. It is a highly appealing option for data integration as it contributes various unique attributes like unifies, low-latency, high-throughput platform to handle real-time data feeds. Other features such as scalability, low latency, data partitioning and its ability to handle numerous diverse consumers makes it more desirable for cases related to data integration. To mention, Apache Kafka has a market share of about 9.1%. It is the best opportunity to move ahead in your career.

There are many companies who use Apache Kafka. According to cwiki.apache.org, the top companies that use Kafka are LinkedIn, Yahoo, Twitter, Netflix, etc.

According to indeed.com, the average salary for apache kafka architect for Senior Technical Lead ranges from $101,298 per year to$148,718 per year for Enterprise Architect.

With a lot of research, we have brought you a few apache kafka interview questions that you might encounter in your upcoming interview. These apache kafka interview questions and answers for experienced and freshers alone will help you crack the apache kafka interview and give you an edge over your competitors. So, in order to succeed in the interview, you need to read, re-read and practice these apache kafka interview questions as much as possible.

If you wish to make a career and have Apache Kafka interviews lined up, then you need not fret. Take a look at the set of Apache Kafka interview questions assembled by experts. These kafka interview questions for experienced as well as freshers with detailed answers will guide you in a whole new manner to crack the Apache Kafka interviews. Stay focused on the essential interview questions on Kafka and prepare well to get acquainted with the types of questions that you may come across in your interview on Apache Kafka.

Hope these Kafka Interview Questions will help you to crack the interview. All the best!