Headless data architecture formalizes the data access layer at the center of the organization. This architecture encompasses both streams and tables, providing consistent data access for operational and analytical use cases. Streams provide low-latency capabilities for timely responses to events, while tables provide high-latency but highly batch-efficient query capabilities. Simply select the processing head that best suits your needs and connect it to your data.
Building a headless data architecture requires understanding what you’re already doing deep inside your data analytics plane and shifting it to the left. Take the work you’re already doing downstream—cleaning, structuring, and mapping—and push it upstream to your source systems. Data consumers can rely on a single, standardized set of data, delivered via streams and tables, for all their operations, analytics, and beyond.
We implement headless data architecture using a shift left approach. The following chart shows the overall concept. By shifting work to the left, we can drastically reduce downstream costs.
The shift left approach provides a simpler and more cost-effective way to create, access, and use data, especially compared to traditional multi-hop approaches.
Multi-Hop and Medallion Data Architecture
Most organizations have an extract-transform-load (ETL) data pipeline, a data lake, a data warehouse, or a data lakehouse. Data analysts on the analytics plane need different specialized tools than software developers on the operations plane. This overall structure of “moving data from left to right” is commonly referred to as a multi-hop data architecture.
Medallion architecture is the most popular form of multi-hop architecture. Medallion architecture has three levels of data quality, represented by the colors of Olympic medals (Bronze, Silver, Gold). The Bronze tier acts as a data landing zone, Silver acts as a refined and well-defined data tier (tier 2), and Gold acts as a business-level aggregated data set (tier 3).
The data landed in Step 1 is generally unstructured raw data, and is cleaned, graphed, and standardized before being created in Step 2. Here, it is further aggregated, grouped, denormalized, and processed to form a Step 3 business-specific data set, which is used for dashboards and reports and provides training data for AI and machine learning models.
Problems with multi-hop architecture
First, multi-hop architectures are slow because most are implemented as batch processes that are triggered periodically. Data must move from the source to the bronze before the next hop can begin.
For example, if data is pulled into the bronze tier at 15-minute intervals, each subsequent hop is also limited to 15-minute intervals, because the speed at which data moves to the next stage is determined by the slowest part. Even if the interval is reduced to 1 minute per hop, it still takes at least 3 minutes to make data available in the gold tier, excluding processing time.
Second, multi-hop architectures are expensive, because each hop is another copy of the data, and therefore requires processing power to load it, process it, and write it to the next step in the hop. This cost quickly adds up.
Third, multi-hop architectures tend to be brittle, since different people often own different steps in the workflow, different source databases, and different end-use cases. Very strong coordination is needed to prevent the architecture from breaking down, which is difficult to scale in practice.
Fourth, since data analysts are responsible for their own data acquisition, multiple similar data pipelines can be created. To avoid the problem of fragmented ownership, if each team builds its own customized pipeline, this can lead to a proliferation of similar pipelines. The problem of similar but different pipelines and data sets becomes more common as companies grow in size. It can be difficult to find all the available data sets.
Here comes the fifth problem: similar but different data sets. Why do you have multiple data sets? Which one should you use? Is this data set still maintained, or is it a zombie data set that is still updated periodically but is not supervised by anyone? This problem arises when you have a critical computation that relies on multiple data sets that should be the same but are not, and the results of the computations conflict. When you provide conflicting reports, dashboards, or metrics to your customers, you lose trust and, in the worst case, you can ruin your business or get yourself into a lawsuit.
Even if all of the above were addressed (reducing latency, lowering costs, eliminating redundant pipelines and data sets, eliminating troubleshooting, etc.), it would still leave nothing for operations to use. Operations would still have to do their own cleaning, structuring, remodeling, and distribution upstream of the ETL, because all the cleaning, structuring, remodeling, and distribution would only be useful to those in the data analytics domain.
Shift Left for Headless Data Architecture
Building a headless data architecture requires rethinking how organizations circulate, share, and manage data. In other words, shift left. Extract ETL->Bronze->Silver work from downstream and place it upstream, inside a data product, much closer to the source.
While data sets generated by periodic ETL have a freshness of at most a few minutes, a stream-first approach provides sub-second data freshness for data products. Shifting left makes data more accessible, cheaper, and faster across the enterprise.
Building a Headless Data Architecture with Data Products
The top level of logical data in a headless data architecture is a data product. Some of you may already be familiar with this concept from the data mesh approach. In a headless data architecture, a data product consists of a stream (based on Apache Kafka) and a related table (based on Apache Iceberg). Data written to a stream is automatically added to the table, so that data can be accessed as a Kafka topic or an Iceberg table.
The following figure shows a stream/table data product generated from a source system. First, data is written to a stream. Then, data can be optionally transformed from the stream and finally materialized into an Iceberg table.
You can use streams (Kafka topics) to drive low-latency business operations such as order management, dispatching, and financial transactions. You can also connect batch query heads to Iceberg tables to compute high-latency workloads such as daily reporting, customer analytics, and regular AI training.
Data products are trusted sets of data that are intended to be shared and reused with other teams and services, and formalize responsibilities, skills, and processes to streamline the process of getting the data needed for work and services. Data products are also called “reusable data assets,” but no matter what you call them, the essence is the same: standardized, trusted data that can be shared and reused.
The logic for creating data products varies greatly depending on the source system. For example:
- Event-driven applications write output directly to Kafka topics, which can be easily materialized as Iceberg tables. Data product creation logic can be minimized, for example by masking or completely removing confidential fields.
- A typical request/response application uses change data capture (CDC) to extract data from the underlying database, transform it into events, and write it to a Kafka topic. CDC events have a well-defined schema based on the source table. Additional data transformation can be done using the connector itself or something more powerful like FlinkSQL.
- SaaS applications may need to periodically poll an endpoint using Kafka Connect to write to a stream.
The nice thing about stream-first data products is that you just write to the stream and there are no other requirements. You don’t have to manage distributed transactions to write to the stream and the table simultaneously (which can be quite hard to do right and can be slow). Instead, you create an additional dedicated Iceberg table from the stream, using Kafka Connect or a proprietary SaaS stream-to-table solution like Confluent Tableflow. Fault tolerance and exactly-once writes help maintain data integrity, so you get the same results whether you read from the stream or the table.
Selecting a Dataset for Shift Left
Shift Left is not an either/or approach, but rather a highly modular and incremental approach. You can choose which loads to shift to the left and which to leave as is. You can set up and validate a parallel shift left solution, and then switch over to the existing work once you are satisfied with the results. The process is as follows:
- Select data sets that are commonly used in the analysis plane. The more frequently they are used, the better candidates they are for shift left. Business-critical data that leaves little room for error (e.g., billing information) are also good candidates for shift left.
- Identify the source of the data on the operational plane. This is the system required to create the data stream. Note that if this system is already event-based, stream usage may already be possible, in which case you can skip to step 4 below.
- Create a source-to-stream workflow in parallel with your existing ETL pipeline. You may need to transform database data into an event stream using a Kafka connector (e.g. CDC). Or you may choose to generate events directly into the stream. Just be sure to write the entire dataset to ensure consistency with the source database.
- Create a table from the stream. You can create an Iceberg table using Kafka Connect, or you can use an automated third-party proprietary service that provides Iceberg tables. Note that if you use Kafka Connect, a copy of the data is written to the Iceberg table. It is expected that third-party services will soon be available that provide the ability to scan Kafka topics to Iceberg tables without creating another copy of the data.
- Connect the table to your existing data lake along with the data in the Silver tier. Now you can verify that the new Iceberg table matches the data in your existing data set. If you are satisfied with the results, you can migrate your data analysis jobs from the existing table created in batch mode, deprecate the table, and remove it at your convenience.
Other headless data architecture considerations
As we discussed in the previous article, you can connect Iceberg tables to any compatible analytics endpoint without copying data. The same goes for data streams. In both cases, you simply select a processing head and connect it to the table or stream as needed.
Shift Left also enables some powerful features that are not available in the typical copy-paste, multi-hop, and medallion architectures: the ability to manage stream evolution and table evolution together at a single logical point, and ensure that stream evolution does not corrupt the Iceberg table.
Since the work has moved to the left outside of the data analysis area, data validation and testing can be integrated into the source application deployment pipeline. This ensures that no breakage occurs before the code goes into production, and that the breakage is not detected downstream until long afterward.
Finally, since the table is derived from a stream, it only needs to be modified at one point. Anything written to the stream is propagated to the table. Streaming applications can automatically receive modified data and modify it themselves. However, periodic batch jobs that use the table will need to be found and re-run. However, this is something that would have to be done anyway in a traditional multi-hop architecture.
Headless data architecture enables powerful data access across the organization. Starting point is shift left.
editor@itworld.co.kr
Source: www.itworld.co.kr