- High-throughput distributed messaging system
- Open Sourced by LinkedIn
- Implemented in Scala, some Java
Additional Kafka concepts are defined here and by the Apache Kafka Project – http://kafka.apache.org/documentation.html.
Typical use cases of Kafka include:
- Real-time streaming and data pipeline applications are built over Kafka to ingest data continuously and churn out data to be consumed by the next phase.
- Kafka is used for log aggregation as an alternative to flume and scribe.
- Kafka can be used to “buffer” or “cache” data to decouple the upstream and downstream systems.
Why Use It
- Fast – a single Kafka node can handle tens of thousands of reads/writes per second
- Scalable – elastically & transparently expand without downtime
- Durable – messages are replicated to prevent data loss
When to Use It
- Kafka Supported Messaging Patterns
- Near real-time data feed
- Batch Consumer (Very large buffer)
- Message Replays
- Data Loss
- Kafka at its heart relies primarily on client implementations and settings to provide reliability. In particular, Kafka has no concept of end-to-end guarantees.
Before deploying a Kafka instance, read the messaging platform guidelines set forth by the Tech Council
Kafka Support Model
- Technical Consultancy Support
- Pack Creation
- Tools to help Pack customers’ manage their Kafka instance
- Upgrades & Patches to the Kafka Pack
- We do not provide – operational support – Each team owns/administers/manages their Instance
Page Purpose: Provide an introduction and summary of high-level Kafka concepts. Note that Kafka is an open-source technology with a robust community and documentation library.
Kafka Apache Community Documentation is located here: http://kafka.apache.org/documentation.html
- Topic: Kafka stores each data feed in its own category called topics. Every client pushes data into and retrieves data from a topic. A topic serves a similar purpose as does a queue in a publisher-subscriber system, but is not completely similar in design to a queue.
- Partition: Every topic is divided into a number of partitions. The partitions is a sequential, immutable sequence of messages that is continually appended to a commit log. The messages in the partitions are each assigned a sequential id number called the offset that uniquely identifies each message within the partition. Partitions are the main building blocks of a topic and is used to distribute and load balance the topic’s data. It is also the basic block of replication to handle machine failures.
- Producer: Producer publish messages to a topic. A producer can be a synchronous (will wait for message acknowledgement from the system) or asynchronous producer. Producer data is published to a topic’s partition based on partition key (by default it is null, so data is distributed randomly, but also evenly). Kafka producers have a variety of configuration to provide better throughput, efficiency and durability. More details later.
- Consumer: Consumer consumes messages from a topic. Kafka consumer follow a pull model, where a consumer requests data from the Kafka server. This benefits the consumer as it can consume at its own pace and does not lose data during peak periods (as long as the data retention is set long enough at Kafka). Multiple consumer threads on separate machines can consume from the same topic by forming a “consumer group” (“consumer group” or “consumer” may be used interchangeably, which can include one or more consumer threads). Each consumer group is uniquely identified by a “Consumer ID”. If two consumer threads have the same consumer ID, they belong to the same consumer group. The benefit of having “consumer group” is that the consuming workload could be easily and evenly distributed over multiple machines, so consumer scalability will never be a problem. Note that each consumer thread is assigned to consume from only one partition, hence the message order of a Kafka topic is only guaranteed at the partition-level, meaning that there will be no absolute guarantee on message order under an entire Kafka topic (unless only 1 partition of a topic is used, but the parallelism is sacrificed). Also note that in order to saturate Kafka, it is the consumer group’s responsibility to make enough number of consumer threads, which should not be less than the number of partitions of the topic to be consumed.