The Architecture Shift That Already Happened Once
In 2003, Google published the GFS paper. By 2006, Yahoo had open-sourced Hadoop. The core idea was simple: networks were slow, so bring compute to where the data lives. HDFS stored blocks on DataNodes. MapReduce ran computations locally. Storage and compute were tightly coupled because that was the only way to get acceptable performance.
Then networks got 400 times faster. S3 arrived, and suddenly local disk replication looked expensive and fragile by comparison. Spark proved you could separate compute from storage entirely. Data lakes followed the economics to object storage, and the shared-nothing architecture that made Hadoop successful became the constraint that killed it.
This same shift is now happening to data streaming.
What Apache Kafka Made Possible
Apache Kafka changed how organizations think about data. Before Kafka, data integration meant point-to-point connections, batch ETL jobs, and brittle synchronization logic. Kafka introduced the distributed commit log as a first-class architectural primitive. Producers and consumers decoupled. Events became the source of truth. Real-time processing became practical at scale.
The Kafka protocol is now the standard interface for streaming data. Confluent, Amazon MSK, Redpanda, and dozens of other implementations speak it. Kafka Streams, Apache Flink, and Spark Structured Streaming build on it. The ecosystem is mature and battle-tested.
But the protocol is not the constraint. The storage architecture is.
Traditional Kafka deployments couple storage and compute on the same broker nodes. Data lives on local disks. Scaling requires partition rebalancing. Recovery means moving data between nodes. Retention is limited by disk capacity. The operational complexity grows with cluster size, and the cost of long-term retention becomes prohibitive.
The protocol layer is commodity infrastructure. The innovation is happening at the storage layer.
Storage-Native: The Next Architecture
WarpStream proved the concept in 2023: you can build a Kafka-compatible system that writes directly to object storage with no local disks. Their Zero Disk Architecture eliminates interzone networking costs, removes partition rebalancing, and simplifies operations dramatically. Confluent responded with Freight. Aiven proposed KIP-1150. The entire industry is moving in this direction.
KafScale implements this architecture as open source and takes it one step further by opening the storage layer for direct access.
The difference matters. In most storage-native implementations, all reads still go through the broker layer. The storage format is an implementation detail. Consumers use the Kafka protocol for everything, whether they need millisecond latency or are running a weekly batch job.
KafScale stores data in S3 using a documented segment format (.kfs + .index). This format is part of the public specification. Any processor that understands the format can read directly from S3 without connecting to a broker.
Think about what this enables:
For streaming workloads, producers and consumers use the Kafka protocol through KafScale brokers. Low latency. Familiar APIs. Standard consumer groups. The real-time path works exactly as expected.
For analytical workloads, processors read .kfs segments directly from S3. No broker load. No consumer group coordination. No protocol overhead. The batch path bypasses the infrastructure designed for real-time.
The streaming path and the analytical path share the same data but use different access patterns optimized for their requirements.
The Hadoop Parallel Is Exact
The evolution from Hadoop to modern data lakes provides a precise template for what happens next in streaming.
Hadoop Era (2006-2015)Storage-Native Streaming (2025+)HDFS blocks on DataNodes.kfs segments in S3NameNode for metadataetcd for topics and partitionsMapReduce/Spark reading HDFSIceberg Processor reading S3Hive tables on HDFSIceberg tables on .kfs dataCompute tied to storage nodesStateless brokers, independent processors
The architectural insight is identical: when storage becomes cheap, durable, and accessible over fast networks, coupling compute to storage nodes creates unnecessary constraints. Disaggregation wins.
Hadoop's decline was not about the ideas being wrong. HDFS solved real problems. MapReduce enabled computations that were previously impossible. But the architecture assumed constraints that stopped being true. S3 changed the economics. Spark proved you could separate compute from storage. The ecosystem moved on.
Kafka's broker-centric architecture assumes similar constraints. Local disks for durability. Replication for fault tolerance. Partition assignment to specific nodes. These assumptions made sense when object storage was slow and expensive. They create unnecessary operational burden when S3 provides 11 nines of durability at $0.02 per gigabyte per month.
The Iceberg Processor: Bypassing the Broker
The KafScale Iceberg Processor demonstrates storage-native access in practice. It reads .kfs segments from S3, queries etcd for topic and partition metadata, converts data to Parquet, and writes to Apache Iceberg tables.
The broker is not involved in this path.
This architecture provides concrete operational benefits:
No broker load for batch reads. Historical analysis, backfills, and AI training jobs read from S3 without consuming broker CPU or network bandwidth. Streaming workloads continue unaffected.
No consumer group coordination. The processor tracks offsets in etcd. No rebalance storms when processors scale. No coordinator overhead. No session timeouts to tune.
Unified batch and stream on one storage layer. Write once to the event log. Serve both real-time consumers and batch analytics from the same source. The dual-write patterns that plague Lambda architectures disappear.
Cost reduction at scale. S3 storage costs roughly 10x less than equivalent broker disk capacity. For workloads with long retention requirements, the savings compound. Analytical reads from S3 avoid broker compute costs entirely.
The Iceberg Processor outputs data in Apache Iceberg format. Query engines like Trino, Spark, and Flink read Iceberg tables natively. The streaming event log becomes directly accessible to the entire analytical ecosystem without intermediate pipelines.
Why AI Agents Need This Architecture
Event sourcing is emerging as the foundation for agentic systems. When AI agents make decisions, they need context. That context comes from historical events. They need to understand what happened, in what order, and why the current state exists.
Traditional stream processing optimizes for latency. Milliseconds matter for fraud detection or trading systems. But AI agents reasoning over business context have different requirements. They need completeness. They need replay capability. They need to reconcile current state with historical actions.
Research from the Apache Flink community (FLIP-531) and platforms like Akka confirms this pattern. Agentic systems are distributed systems. Communication with LLMs is event-based. The ability to reproduce state at any point in time is foundational.
The insight from event sourcing applies directly: since everything in an event sourced system is captured as an immutable event, you can reliably reproduce the state of any agent at any given point in time. You know what happened and you know why.
Storage-native streaming makes this practical. The immutable log in S3 becomes the system of record that agents query, replay, and reason over. The Iceberg Processor converts that log to tables that analytical tools understand. Agents get complete historical context without competing with streaming workloads for broker resources.
Two-second latency for analytical access is acceptable when the alternative is incomplete context or degraded streaming performance. AI agents do not need sub-millisecond reads. They need the full picture.
Apache Wayang as the Orchestration Layer
The storage-native architecture creates new possibilities for cross-platform optimization. When streaming data is accessible directly from S3, query optimizers can treat it like any other data source. Predicates push down to storage. Engine selection happens based on workload characteristics rather than system boundaries.
Apache Wayang provides the abstraction layer for this optimization. Instead of writing separate pipelines for Spark, Flink, and SQL engines, users express logic once in platform-agnostic operators. Wayang's cost-based optimizer determines the execution plan.
Benchmark results demonstrate that hybrid execution outperforms single-engine execution by significant margins for mixed workloads. Academic evaluations show speedups of up to one order of magnitude when the optimizer selects specialized engines for each stage and minimizes data movement.
For streaming intelligence workloads, this means:
- Use PostgreSQL to filter data close to where it is stored
- Use Spark for compute-heavy transformations that benefit from distributed parallelism
- Use Java Streams for small collections or iterative control logic
- Use the Iceberg Processor for converting streams to analytical tables
The optimizer handles plan selection. Engineers focus on business logic. The underlying complexity of coordinating multiple engines disappears behind a unified abstraction.
Scalytics Federated extends this capability to distributed environments where data cannot be centralized. When streaming data is stored in S3 with a documented format, federated processors access it without requiring broker connectivity across network boundaries. The event log becomes a portable, queryable artifact that works within data residency constraints.
The Open Source Foundation
KafScale is available under an open source license. The project includes:
- Kafka-compatible brokers with S3-native storage
- The .kfs segment format specification
- etcd-based metadata management
- The Iceberg Processor for lakehouse integration
The storage format is documented and stable. Teams build custom processors that read directly from S3. They integrate with existing Iceberg catalogs. They add output formats as requirements evolve.
This follows the pattern established by Apache Wayang and the broader open source data ecosystem. Open formats and open protocols create flexibility. Vendor-neutral foundations reduce lock-in risk. The community extends capabilities that no single vendor would prioritize.
The Industry Is Converging
Multiple vendors and open source projects are moving toward storage-native streaming:
Confluent announced Tableflow for direct Iceberg integration and is developing Freight for object storage.
Apache Flink is exploring Fluss, a streaming-native lakehouse layer.
Apache Kafka has multiple KIPs under discussion (1150, 1176, 1183) for diskless or tiered architectures.
WarpStream proved the Zero Disk Architecture commercially viable and was acquired by Confluent.
AutoMQ open-sourced a Kafka implementation on S3 with their S3Stream engine.
The common thread: streaming data belongs in open formats on object storage. Processing should decouple from broker infrastructure. The event log should be directly accessible to analytical workloads.
KafScale's contribution is opening the storage layer completely. The .kfs format is public. The Iceberg Processor demonstrates broker-bypass access. The architecture enables use cases that closed storage formats cannot support.
What Changes for Data Teams
Storage-native streaming affects how teams design and operate data infrastructure:
Retention becomes a business decision, not a capacity constraint. S3 storage is cheap. Keep events for months or years without provisioning additional broker capacity. Historical replay for debugging, compliance, or reprocessing becomes practical.
Scaling separates into independent dimensions. Add broker capacity for streaming throughput. Add processors for analytical workload. Neither affects the other. Capacity planning simplifies.
Recovery accelerates. Stateless brokers have no data to restore. They reconnect to S3 and etcd, then resume serving requests. No partition reassignment. No data synchronization. Failover happens in seconds.
Analytical access stops competing with streaming. Batch jobs, AI training, and ad-hoc queries read from S3 directly. Streaming latency remains stable regardless of analytical load.
The Lambda architecture dies. One write path to the event log. Multiple read paths optimized for different access patterns. No dual-write synchronization. No consistency bugs between batch and streaming views.
The Evolution
The shift from broker-centric to storage-native streaming follows the same trajectory as the shift from HDFS to cloud data lakes. The timeline compresses because the ecosystem learned from the first transition.
Organizations running Kafka today face a choice: continue investing in broker-centric operations or start building toward storage-native architectures. The protocol compatibility of systems like KafScale means existing producers and consumers work unchanged. Migration happens at the infrastructure layer without application rewrites.
The streaming path through brokers remains essential for low-latency workloads. The analytical path through direct S3 access enables use cases that broker-mediated access cannot support efficiently.
Both paths share the same data. Both paths use open formats. The architecture adapts to requirements rather than forcing requirements to adapt to architectural constraints.
Storage-native streaming is not a future possibility. The components exist. The economics are clear. The industry is converging. The question is execution.
Related Articles
About Scalytics
Scalytics Federated provides federated data processing across Spark, Flink, PostgreSQL, and cloud-native engines through a single abstraction layer. Our cost-based optimizer selects the right engine for each operation, reducing processing time while eliminating vendor lock-in.
Scalytics Copilot extends this foundation with private AI deployment: running LLMs, RAG pipelines, and ML workloads entirely within your security perimeter. Data stays where it lives. Models train where data resides. No extraction, no exposure, no third-party API dependencies.
For organizations in healthcare, finance, and government, this architecture isn't optional, it's how you deploy AI while remaining compliant with HIPAA, GDPR, and DORA.Explore our open-source foundation: Scalytics Community Edition
Questions? Reach us on Slack or schedule a conversation.
