Project Metamorphosis: Wir präsentieren die Event-Streaming-Plattform der nächsten Generation. Mehr erfahren

Kafka as your Data Lake - is it Feasible?

For a long time we discuss how much data we can keep in Kafka. Can we store data forever or do we remove data after a while and maybe having the history in a data lake on Object Storage or HDFS? With the advent of Tiered Storage in Confluent Enterprise Platform, storing data much longer in Kafka is much very feasible. So can we replace a traditional data lake with just Kafka? Maybe at least for the raw data? But what about accessing the data, for example using SQL?

KSQL allows for processing data in a streaming fashion using an SQL like dialect. But what about reading all data of a topic? You can reset the offset and still use KSQL. But there is another family of products, so-called query engines for Big Data. They originate from the idea of reading Big Data sources such as HDFS, object storage or HBase, using the SQL language. Presto, Apache Drill and Dremio are the most popular solutions in that space. Lately these query engines also added support for Kafka topics as a source of data. With that you can read a topic as a table and join it with information available in other data sources. The idea of course is not real-time streaming analytics but batch analytics directly on the Kafka topic, without having to store it in a big data storage.

This talk answers, how well these tools support Kafka as a data source. What serialization formats do they support? Is there some form of predicate push-down supported or do we have to always read the complete topic? How performant is a query against a topic, compared to a query against the same data sitting in HDFS or an object store? And finally, will this allow us to replace our data lake or at least part of it by Apache Kafka?


Guido Schmutz