Scalytics | Storage-Native Kafka: Why Streaming Data Belongs in S3

The Architecture Shift That Already Happened Once

In 2003, Google published the GFS paper. By 2006, Yahoo had open-sourced Hadoop. The core idea was simple: networks were slow, so bring compute to where the data lives. HDFS stored blocks on DataNodes. MapReduce ran computations locally. Storage and compute were tightly coupled because that was the only way to get acceptable performance.

Then networks got 400 times faster. S3 arrived, and suddenly local disk replication looked expensive and fragile by comparison. Spark proved you could separate compute from storage entirely. Data lakes followed the economics to object storage, and the shared-nothing architecture that made Hadoop successful became the constraint that killed it.

This same shift is now happening to data streaming.

‍

What Apache Kafka Made Possible

Apache Kafka changed how organizations think about data. Before Kafka, data integration meant point-to-point connections, batch ETL jobs, and brittle synchronization logic. Kafka introduced the distributed commit log as a first-class architectural primitive. Producers and consumers decoupled. Events became the source of truth. Real-time processing became practical at scale.

The Kafka protocol is now the standard interface for streaming data. Confluent, Amazon MSK, Redpanda, and dozens of other implementations speak it. Kafka Streams, Apache Flink, and Spark Structured Streaming build on it. The ecosystem is mature and battle-tested.

But the protocol is not the constraint. The storage architecture is.

Traditional Kafka deployments couple storage and compute on the same broker nodes. Data lives on local disks. Scaling requires partition rebalancing. Recovery means moving data between nodes. Retention is limited by disk capacity. The operational complexity grows with cluster size, and the cost of long-term retention becomes prohibitive.

The protocol layer is commodity infrastructure. The innovation is happening at the storage layer.

‍

Storage-Native: The Next Architecture

WarpStream proved the concept in 2023: you can build a Kafka-compatible system that writes directly to object storage with no local disks. Their Zero Disk Architecture eliminates interzone networking costs, removes partition rebalancing, and simplifies operations dramatically. Confluent responded with Freight. Aiven proposed KIP-1150. The entire industry is moving in this direction.

KafScale implements this architecture as open source and takes it one step further by opening the storage layer for direct access.

The difference matters. In most storage-native implementations, all reads still go through the broker layer. The storage format is an implementation detail. Consumers use the Kafka protocol for everything, whether they need millisecond latency or are running a weekly batch job.

KafScale stores data in S3 using a documented segment format (.kfs + .index). This format is part of the public specification. Any processor that understands the format can read directly from S3 without connecting to a broker.

Think about what this enables:

For streaming workloads, producers and consumers use the Kafka protocol through KafScale brokers. Low latency. Familiar APIs. Standard consumer groups. The real-time path works exactly as expected.

For analytical workloads, processors read .kfs segments directly from S3. No broker load. No consumer group coordination. No protocol overhead. The batch path bypasses the infrastructure designed for real-time.

The streaming path and the analytical path share the same data but use different access patterns optimized for their requirements.

Traditional Kafka vs. Storage-Native Streaming

Traditional

Broker-Centric

All reads go through broker

Storage tied to broker nodes

Scaling requires data movement

Batch analytics competes with streaming

KafScale

Storage-Native

Batch reads bypass broker

Data lives in S3 natively

Stateless brokers scale instantly

Separate paths for stream and batch

‍

The Hadoop Parallel Is Exact

The evolution from Hadoop to modern data lakes provides a precise template for what is now happening in streaming.

In the Hadoop era from roughly 2006 to 2015, storage and compute were inseparable. Data was stored as HDFS blocks on DataNodes. Metadata lived in a centralized NameNode. MapReduce and later Spark read directly from HDFS. Hive tables were built on top of those same HDFS files. Compute lived where the data lived because networks were slow and durability was expensive.

Storage-native streaming follows the same pattern, with modern primitives.

Event data is stored as .kfs segments in S3. Topic and partition metadata is coordinated through etcd. Processing engines read directly from object storage instead of broker disks. Analytical tables are materialized as Apache Iceberg on top of the same event data. Brokers are stateless. Processors scale independently.

The architectural insight is identical.

When storage becomes cheap, durable, and accessible over fast networks, coupling compute to storage nodes creates unnecessary constraints. Disaggregation wins.

Hadoop did not decline because its ideas were wrong. HDFS solved real problems. MapReduce enabled computations that were previously impractical at scale. But the architecture assumed constraints that stopped being true. Object storage changed the economics. Spark proved that compute could be separated from storage. The ecosystem moved on.

Kafka’s broker-centric architecture assumes similar constraints.

Local disks for durability. Replication for fault tolerance. Partition assignment to specific nodes. These choices made sense when object storage was slow and expensive. They become operational liabilities when S3 provides eleven nines of durability at roughly $0.02 per gigabyte per month.

The same transition that moved batch processing from HDFS to S3 is now moving streaming from broker disks to object storage.

Same Pattern, Different Era

What happened to batch processing is now happening to streaming

Hadoop Era

2006 - 2015

STORAGE

HDFS blocks on DataNodes

METADATA

NameNode

COMPUTE

MapReduce / Spark on HDFS

TABLES

Hive tables on HDFS

KafScale Era

2025+

STORAGE

.kfs segments in S3

METADATA

etcd (topics, partitions)

COMPUTE

Iceberg Processor on S3

TABLES

Iceberg tables on .kfs data

The same architectural shift that moved batch processing from HDFS to S3 is now moving streaming from broker disks to object storage.

‍

The Iceberg Processor: Bypassing the Broker

The KafScale Iceberg Processor demonstrates storage-native access in practice. It reads .kfs segments from S3, queries etcd for topic and partition metadata, converts data to Parquet, and writes to Apache Iceberg tables.

The broker is not involved in this path.

This architecture provides concrete operational benefits:

No broker load for batch reads. Historical analysis, backfills, and AI training jobs read from S3 without consuming broker CPU or network bandwidth. Streaming workloads continue unaffected.

No consumer group coordination. The processor tracks offsets in etcd. No rebalance storms when processors scale. No coordinator overhead. No session timeouts to tune.

Unified batch and stream on one storage layer. Write once to the event log. Serve both real-time consumers and batch analytics from the same source. The dual-write patterns that plague Lambda architectures disappear.

Cost reduction at scale. S3 storage costs roughly 10x less than equivalent broker disk capacity. For workloads with long retention requirements, the savings compound. Analytical reads from S3 avoid broker compute costs entirely.

The Iceberg Processor outputs data in Apache Iceberg format. Query engines like Trino, Spark, and Flink read Iceberg tables natively. The streaming event log becomes directly accessible to the entire analytical ecosystem without intermediate pipelines.

The Unified Stack

From streaming ingestion to AI agent context

AI Agents + Analytics

Query, replay, and reason over events

Consumers

Apache Wayang

Cross-platform optimization and federated execution

Orchestration

Iceberg Processor

Direct S3 access, Parquet conversion, table writes

Processing

KafScale Brokers

Kafka-compatible API, stateless, real-time path

Ingestion

S3 + etcd

.kfs segments + metadata coordination

Storage

Real-time
via Broker

Batch/AI
Direct to S3

‍

Why AI Agents Need This Architecture

Event sourcing is emerging as the foundation for agentic systems. When AI agents make decisions, they need context. That context comes from historical events. They need to understand what happened, in what order, and why the current state exists.

Traditional stream processing optimizes for latency. Milliseconds matter for fraud detection or trading systems. But AI agents reasoning over business context have different requirements. They need completeness. They need replay capability. They need to reconcile current state with historical actions.

Research from the Apache Flink community (FLIP-531) and platforms like Akka confirms this pattern. Agentic systems are distributed systems. Communication with LLMs is event-based. The ability to reproduce state at any point in time is foundational.

The insight from event sourcing applies directly: since everything in an event sourced system is captured as an immutable event, you can reliably reproduce the state of any agent at any given point in time. You know what happened and you know why.

Storage-native streaming makes this practical. The immutable log in S3 becomes the system of record that agents query, replay, and reason over. The Iceberg Processor converts that log to tables that analytical tools understand. Agents get complete historical context without competing with streaming workloads for broker resources.

Two-second latency for analytical access is acceptable when the alternative is incomplete context or degraded streaming performance. AI agents do not need sub-millisecond reads. They need the full picture.

‍

Apache Wayang as the Orchestration Layer

The storage-native architecture creates new possibilities for cross-platform optimization. When streaming data is accessible directly from S3, query optimizers can treat it like any other data source. Predicates push down to storage. Engine selection happens based on workload characteristics rather than system boundaries.

Apache Wayang provides the abstraction layer for this optimization. Instead of writing separate pipelines for Spark, Flink, and SQL engines, users express logic once in platform-agnostic operators. Wayang's cost-based optimizer determines the execution plan.

Benchmark results demonstrate that hybrid execution outperforms single-engine execution by significant margins for mixed workloads. Academic evaluations show speedups of up to one order of magnitude when the optimizer selects specialized engines for each stage and minimizes data movement.

For streaming intelligence workloads, this means:

Use PostgreSQL to filter data close to where it is stored
Use Spark for compute-heavy transformations that benefit from distributed parallelism
Use Java Streams for small collections or iterative control logic
Use the Iceberg Processor for converting streams to analytical tables

The optimizer handles plan selection. Engineers focus on business logic. The underlying complexity of coordinating multiple engines disappears behind a unified abstraction.

Scalytics Federated extends this capability to distributed environments where data cannot be centralized. When streaming data is stored in S3 with a documented format, federated processors access it without requiring broker connectivity across network boundaries. The event log becomes a portable, queryable artifact that works within data residency constraints.

‍

The Open Source Foundation

KafScale is available under an open source license. The project includes:

Kafka-compatible brokers with S3-native storage
The .kfs segment format specification
etcd-based metadata management
The Iceberg Processor for lakehouse integration

The storage format is documented and stable. Teams build custom processors that read directly from S3. They integrate with existing Iceberg catalogs. They add output formats as requirements evolve.

This follows the pattern established by Apache Wayang and the broader open source data ecosystem. Open formats and open protocols create flexibility. Vendor-neutral foundations reduce lock-in risk. The community extends capabilities that no single vendor would prioritize.

‍

The Industry Is Converging

Multiple vendors and open source projects are moving toward storage-native streaming:

Confluent announced Tableflow for direct Iceberg integration and is developing Freight for object storage.

Apache Flink is exploring Fluss, a streaming-native lakehouse layer.

Apache Kafka has multiple KIPs under discussion (1150, 1176, 1183) for diskless or tiered architectures.

WarpStream proved the Zero Disk Architecture commercially viable and was acquired by Confluent.

AutoMQ open-sourced a Kafka implementation on S3 with their S3Stream engine.

The common thread: streaming data belongs in open formats on object storage. Processing should decouple from broker infrastructure. The event log should be directly accessible to analytical workloads.

KafScale's contribution is opening the storage layer completely. The .kfs format is public. The Iceberg Processor demonstrates broker-bypass access. The architecture enables use cases that closed storage formats cannot support.

‍

What Changes for Data Teams

Storage-native streaming affects how teams design and operate data infrastructure:

Retention becomes a business decision, not a capacity constraint. S3 storage is cheap. Keep events for months or years without provisioning additional broker capacity. Historical replay for debugging, compliance, or reprocessing becomes practical.

Scaling separates into independent dimensions. Add broker capacity for streaming throughput. Add processors for analytical workload. Neither affects the other. Capacity planning simplifies.

Recovery accelerates. Stateless brokers have no data to restore. They reconnect to S3 and etcd, then resume serving requests. No partition reassignment. No data synchronization. Failover happens in seconds.

Analytical access stops competing with streaming. Batch jobs, AI training, and ad-hoc queries read from S3 directly. Streaming latency remains stable regardless of analytical load.

The Lambda architecture dies. One write path to the event log. Multiple read paths optimized for different access patterns. No dual-write synchronization. No consistency bugs between batch and streaming views.

‍

The Evolution

The shift from broker-centric to storage-native streaming follows the same trajectory as the shift from HDFS to cloud data lakes. The timeline compresses because the ecosystem learned from the first transition.

Organizations running Kafka today face a choice: continue investing in broker-centric operations or start building toward storage-native architectures. The protocol compatibility of systems like KafScale means existing producers and consumers work unchanged. Migration happens at the infrastructure layer without application rewrites.

The streaming path through brokers remains essential for low-latency workloads. The analytical path through direct S3 access enables use cases that broker-mediated access cannot support efficiently.

Both paths share the same data. Both paths use open formats. The architecture adapts to requirements rather than forcing requirements to adapt to architectural constraints.

Storage-native streaming is not a future possibility. The components exist. The economics are clear. The industry is converging. The question is execution.

‍

About Scalytics

Scalytics architects and troubleshoots mission-critical streaming, federated execution, and AI systems for scaling SMEs. When Kafka pipelines fall behind, SAP IDocs block processing, lakehouse sinks break, or AI pilots collapse under real load, we step in and make them run.

Our founding team created Apache Wayang (now an Apache Top-Level Project), the federated execution framework that orchestrates Spark, Flink, and TensorFlow where data lives and reduces ETL movement overhead.

We also invented and actively maintain KafScale (S3-Kafka-streaming platform), a Kafka-compatible, stateless data and large object streaming system designed for Kubernetes and object storage backends. Elastic compute. No broker babysitting. No lock-in.

Our mission: Data stays in place. Compute comes to you. From data lakehousese to private AI deployment and distributed ML - all designed for security, compliance, and production resilience.

Questions? Join our open Slack community or schedule a consult.