3 Data Engineering Trends Using Kafka, Flink, and Iceberg

Apache Kafka, Apache Flink, and Apache Iceberg are popular technologies in the data ecosystem. Kafka allows you to move data in real time, Flink allows you to process data as needed, and Iceberg helps you access and query stored data in a structured and explorable way. All three technologies have tremendous influence in building data systems.

ⓒ Getty Images Bank

The open source software community continues to add new features to each of the three tools, while often collaborating with each other. This means that best practices are constantly evolving. It also means that data professionals must be aware of industry trends, such as the recent growing interest in data governance.

Introducing three trends I have recently observed in the Kafka, Flink, and Iceberg communities. Each of these presents new ways for engineers to manage data and meet application requirements.

Reimagining Microservices as Flink Streaming Applications

A common way to process data is to use a microservice to pull data from Kafka, process it using the same or a different microservice, and then dump it back into Kafka or another queue. But when you use Flink with Kafka, you can do all of this and get a more reliable solution with lower latency, built-in fault tolerance, and event guarantees.

ⓒ Confluent

Flink can be configured to receive incoming data using a continuous push process rather than individual pulls. Additionally, by using Flink instead of microservices, you can take advantage of all of Flink’s built-in precision, such as exactly-once semantics. Flink’s two-phase commit protocol allows developers to guarantee event processing exactly once, end-to-end. For example, this means that an event entered into Kafka is processed exactly once by Kafka and Flink. The types of microservices that Flink can best replace are microservices related to data processing that update operational analysis status.

Quickly apply AI models to data with SQL using Flink

Kafka and Flink together allow you to move and process data in real time and create high-quality, reusable data streams. These capabilities are essential for real-time complex AI applications that require reliable, readily available data for real-time decision making. Think of the RAG pattern, whichever model you use, you can improve responses and mitigate hallucinations by supplementing it with timely, high-quality context.

Flink SQL allows you to call any model (e.g. OpenAI, Azure OpenAI, Amazon Bedrock) by writing a simple SQL statement. In reality, all AI models can be configured with the REST API for Flink AI to be used when processing data streams. This allows you to use custom, self-developed AI models.

There are countless use cases for AI, but it is commonly used for classification, clustering, and regression. For example, it can be used for sentiment analysis of text or scoring of sales leads.

In addition to its AI capabilities, Flink works well with everyone’s favorite streaming technology, Kafka. This is one of the reasons why Flink continues to remain popular as the community’s choice for streaming processing.

Leveraging community-built Apache Iceberg tools

As more developers and organizations use Iceberg to manage large analytical datasets, especially data stored in data lakes and data warehouses, community contributions to Iceberg have grown significantly. For example, a migration tool has been built to easily move the Iceberg catalog from one cloud service provider to another. There are also tools to analyze the state of a specific Iceberg instance.

Another contribution from the community is the Puffin format, a blob that can add statistics and additional metadata to data managed by Iceberg tables. The ability to send Iceberg data back to Flink is also a result of contributions from Flink and Iceberg community committers.

As more contributors and solution companies join the broader Iceberg community, the value of your data will become more accessible than ever, no matter where it is in your data architecture. When combined with a Shift Left approach to Kafka/Flink applications and governance, Iceberg tables can help dramatically accelerate and scale how you build real-time analytics use cases.

To keep your Kafka, Flink, and Iceberg technologies up to date, you should keep an eye on the constant stream of KIP, Flip, and Iceberg PRs coming from their respective communities. The advantage in the core capabilities of the three technologies and the technological synergy between them means that it is well worth keeping abreast of trends and technologies in this growing field.

*Adi Polak is Director of Developer Experience Engineering at Confluent.
editor@itworld.co.kr

Source: www.itworld.co.kr