Summary
Enterprise data architectures have evolved through distinct generations, each shaped by operational constraints observed at scale. Apache Hadoop enabled distributed batch processing but imposed high latency and rigid execution models. Apache Spark improved performance and developer productivity through in-memory computation, while still assuming a single execution platform. Apache Wayang, which graduated to an Apache Software Foundation Top-Level Project in December 2025, represents the next architectural step: cost-based, cross-platform data processing designed for heterogeneous, regulated, and distributed environments.
This article traces that evolution using peer-reviewed research, benchmark results, Apache project records, and independent industry coverage.
TL;DR: Hadoop solved scale. Spark solved speed. Apache Wayang eliminates the glue code, duplicate pipelines, and data movement that result when analytics spans multiple platforms. Scalytics operationalizes this for enterprise environments.
Why This Evolution Matters
Large-scale data processing frameworks evolve in response to concrete system limitations encountered in production. Each generation solved a specific class of problems while revealing new constraints as data architectures grew more complex.
Apache Hadoop enabled distributed batch processing on commodity hardware. Apache Spark improved performance and programmability through in-memory execution. Apache Wayang emerged from academic research to address a later-stage problem: how to execute analytics efficiently across multiple heterogeneous data processing platforms without binding applications to a single system.
For architects, this evolution explains why established platforms increasingly struggle in diverse environments. For senior decision-makers, it explains rising infrastructure cost, growing integration complexity, and governance challenges as organizations accumulate specialized systems that were never designed to work together. This shift toward cross-platform data processing reflects how modern big data architectures operate in practice rather than in isolation.
Apache Hadoop: The First Generation of Big Data Systems
Apache Hadoop is based on the MapReduce programming model introduced by Google engineers Jeff Dean and Sanjay Ghemawat in 2004. Doug Cutting implemented the model in what became the Apache Hadoop project, with its first public release in September 2007.
Hadoop combined three foundational components:
- The Hadoop Distributed File System for distributed storage on commodity hardware
- MapReduce for parallel computation using a map-shuffle-reduce paradigm
- Built-in fault tolerance through data replication and automatic task re-execution
This architecture enabled organizations to process petabyte-scale datasets and established the foundation of the big data ecosystem. Early large-scale adopters included Yahoo, Facebook, and LinkedIn.
Hadoop's design assumptions were explicit. Workloads were batch-oriented. Latency was secondary to throughput. Intermediate results were written to disk between computation stages.
Documented Limitations of Apache Hadoop MapReduce
Extensive industry and academic analysis has identified structural limitations in Hadoop MapReduce.
Disk-centric execution. Each MapReduce stage reads data from disk, processes it, and writes results back to disk before the next stage begins. For multi-stage pipelines and iterative algorithms, this creates substantial I/O overhead. IBM and AWS comparative analyses document how this disk-bound execution model limits performance for workloads requiring repeated data passes.
Batch-only processing. Hadoop MapReduce does not support native streaming or continuous processing. Data must be collected, stored, and processed in discrete batches, typically following an ETL cycle; a constraint we examine in ETL vs ELT: The Data Wrangling Showdown.
Inefficient iterative algorithms. Machine learning algorithms such as k-means clustering and gradient descent require multiple passes over the same dataset. MapReduce handles these workloads inefficiently because each iteration requires a full disk read and write cycle.
High end-to-end latency. The map-shuffle-reduce execution cycle introduces significant latency even for moderately complex jobs. Comparative benchmarks show workloads completing in minutes on newer engines can take hours on MapReduce.
Low-level programming abstraction. Developers must explicitly manage map and reduce logic, typically in Java, with no interactive execution model and limited high-level abstractions. DataCamp's analysis identifies this as a major productivity barrier.
As organizations began adopting machine learning, interactive analytics, and real-time use cases, these limitations became operational bottlenecks.
Apache Spark: The Second Generation of Big Data Processing
Apache Spark was developed at UC Berkeley's AMPLab beginning in 2009 and open-sourced in 2010. Spark was designed specifically to address Hadoop MapReduce's performance and usability constraints.
The central abstraction introduced by Spark is the Resilient Distributed Dataset (RDD). RDDs are immutable, partitioned collections that can be cached in memory and reconstructed using lineage information in case of failure. Instead of persisting intermediate results to disk, Spark retains data in memory across operations.
What Apache Spark Solved
In-memory computation. By processing data in RAM rather than disk, Spark removes the primary I/O bottleneck of MapReduce. Peer-reviewed benchmarks show Spark executing iterative algorithms up to two orders of magnitude faster than Hadoop MapReduce.
Unified analytics engine. Spark provides a single platform supporting batch processing, SQL queries, stream processing, machine learning, and graph analytics. This reduced the need to maintain separate systems for different workload types.
Multi-language APIs. Spark supports Scala, Python, Java, and R, expanding accessibility beyond Java-only MapReduce development.
Interactive execution. Spark's interactive shells enable exploratory data analysis and iterative development workflows that were impractical with batch-oriented MapReduce.
The ACM paper "Apache Spark: A Unified Engine for Big Data Processing" documents adoption across more than one thousand organizations, with publicly reported deployments exceeding eight thousand nodes. As of 2025, the project has more than 36,000 GitHub stars and over 1,900 contributors.
As a result, Apache Spark replaced MapReduce as the dominant execution engine in many large-scale analytics environments. For organizations evaluating distributed processing approaches, we compare architectural tradeoffs in Data Federation vs Data Centralization.
The Emerging Limitation: Single-Platform Assumptions
While Apache Spark addressed performance and programmability, it did not address a growing architectural reality encountered in most enterprise environments.
Modern data landscapes do not consist of a single processing system.
Organizations typically operate:
- Relational databases such as PostgreSQL and MySQL
- Distributed batch engines such as Apache Spark
- Stream processing systems such as Apache Flink and Kafka Streams
- Specialized systems for machine learning and graph analytics
- Data warehouses such as Snowflake, BigQuery, or Redshift
- Multiple storage backends across cloud and on-premise infrastructure
This fragmentation creates the data silo problem we analyze in Solving the AI Data Dilemma.
Research published in SIGMOD Record 2023 documents that users frequently run analytics at higher cost and lower efficiency because selecting and integrating the appropriate execution platforms is complex and error-prone.
When workloads span multiple systems, developers must manually orchestrate execution, manage data movement, and maintain platform-specific code paths. This challenge is commonly referred to in academic literature as the zoo of platforms problem.
Scenarios Where Single-Platform Models Break Down
The SIGMOD 2023 Apache Wayang paper identifies four recurring scenarios:
Platform independence. The same application must run on different platforms depending on data size, cost constraints, or deployment environment.
Opportunistic cross-platform execution. Performance gains are achieved by executing different stages of a pipeline on different platforms, such as large-scale preprocessing on Spark followed by local execution for final aggregation.
Mandatory cross-platform execution. Modern applications combine streaming ingestion, batch transformation, database queries, and machine learning, exceeding the capabilities of any single platform.
Polystore environments. Data resides in multiple heterogeneous storage systems that cannot be practically consolidated.
Neither Hadoop nor Spark addresses these scenarios directly.
Apache Wayang: A Research-Driven Framework for Cross-Platform Data Processing
Apache Wayang originated from research conducted at the Qatar Computing Research Institute and the Hasso Plattner Institute. The work was first presented at SIGMOD 2016 under the name Rheem and later published at VLDB 2018.
The central research question differed from that of Hadoop or Spark: how can data processing be optimized across multiple heterogeneous platforms without binding applications to a single execution engine?
This question emerged from observing enterprise environments where the optimal platform varies based on workload characteristics, data location, cost constraints, and regulatory requirements.
Platform-Independent Data Processing
Apache Wayang introduces a platform-agnostic abstraction layer for data analytics.
Users define data processing tasks using Java, Scala, Python, or SQL without specifying where execution should occur. Wayang constructs a logical plan, referred to as a Wayang plan, represented as a directed dataflow graph. Vertices represent platform-independent operators, and edges represent data flow.
This architecture explicitly separates:
- Application logic from platform-specific implementations
- Optimization decisions from user code
- Logical plans from physical execution
The framework currently supports Apache Spark, Apache Flink, PostgreSQL, GraphX, Giraph, and a native Java Streams executor. New platforms can be added by implementing operator mappings rather than rewriting applications.
Cross-Platform Cost-Based Optimization
The core technical contribution of Apache Wayang is its cross-platform optimizer, documented in the VLDB Journal 2020.
Given a Wayang plan, the optimizer:
- Expands the plan by introducing all platform-specific execution alternatives for each operator
- Estimates costs including execution time, memory usage, and data movement between platforms
- Enumerates possible execution plans while applying lossless pruning techniques to manage the exponential search space
- Selects an optimal plan based on configurable objectives such as runtime, monetary cost, or energy consumption
Unlike workflow orchestrators that operate at the job level, Wayang performs optimization at the operator level. A single pipeline can execute different stages on different platforms, determined automatically by the optimizer.
For enterprises, this directly addresses a common failure mode: overprovisioning expensive platforms for workloads that do not require them while underutilizing existing systems better suited for specific stages.
Empirical Evidence and Benchmarks
Benchmark results published by the Apache Wayang project and peer-reviewed papers demonstrate measurable improvements.
Automatic platform selection. In WordCount benchmarks across dataset sizes from one gigabyte to eight hundred gigabytes, Wayang automatically routes small datasets to lightweight Java execution and larger datasets to distributed engines, avoiding unnecessary overhead.
Hybrid execution. For stochastic gradient descent, Wayang uses Apache Spark for preprocessing and data preparation, then transitions to local Java execution as the dataset size decreases. The benchmark documentation notes that this optimization is not readily apparent without specialized expertise but is implemented automatically by Wayang.
Cross-platform performance. The VLDB 2018 paper reports tasks executing more than one order of magnitude faster when using cross-platform execution compared to single-platform approaches.
Federated query execution. TPC-H benchmarks across HDFS, PostgreSQL, and object storage show Wayang executing joins and aggregations across systems without explicit user coordination or manual data movement. This federated execution model forms the foundation of Scalytics Federated, our enterprise platform for cross-platform analytics.
From Research to Production: Scalytics
Apache Wayang provides the research-backed foundation for cross-platform data processing. Scalytics, founded by the original inventors of Apache Wayang, operationalizes these capabilities for enterprise environments.
The New Stack reported in December 2025 that Scalytics uses Apache Wayang as the basis for federated data processing, enabling what it describes as a virtual data lake across heterogeneous systems.
While Wayang handles cross-platform optimization and execution, production deployments require additional capabilities that the open-source framework does not address: governance, security isolation, team collaboration, and operational control.
Operational Control and Repeatability
Enterprise data processing requires visibility across the full lifecycle of analytics jobs. Scalytics Federated adds centralized job management, execution monitoring, and scheduling for cross-platform workloads.
Jobs can be versioned, parameterized, and embedded into larger workflows while maintaining consistent behavior across executions. This repeatability is critical for regulated environments where analytics results must be explainable and reproducible over time; a requirement we address in detail in our DORA compliance guide.
Governance and Security Isolation
Access to data systems and execution platforms must be tightly controlled. Scalytics introduces role-based access control governing who can create, modify, execute, or reuse analytics jobs.
Credentials and connection details for underlying platforms remain isolated within the system. One team can define a job accessing sensitive systems while other teams execute that job without direct access to credentials or platform logins. This approach aligns with shift-left data architecture principles that move governance earlier in the pipeline.
Low-Code Pipeline Definition and Team Collaboration
While Apache Wayang exposes programmatic APIs, Scalytics Federated adds a low-code interface for composing and managing pipelines. Technical teams encapsulate complex logic once; other teams reuse these pipelines safely without embedding platform credentials into application code.
A job defined by one team can be referenced and reused by others as part of larger workflows; creating a shared catalog of trusted analytics building blocks. For organizations struggling with data silos, this eliminates duplicated logic across teams while maintaining governance.
Preserving Wayang's Cross-Platform Optimization
These enterprise features build on top of Apache Wayang's optimizer rather than replacing it. Wayang continues to determine how and where each operator executes across platforms. Scalytics focuses on managing how those executions are defined, governed, shared, and operated at scale.
This separation ensures research-backed optimization logic remains intact while making cross-platform data processing usable in production environments.
Why This Matters Now
The progression from Apache Hadoop to Apache Spark to Apache Wayang reflects a structural shift in data architecture design.
Hadoop addressed the challenge of processing data too large for a single machine.
Spark addressed the challenge of processing that data efficiently and interactively.
Wayang addresses the challenge of processing data that lives across multiple systems without forcing consolidation that may be impractical, costly, or restricted by regulation.
As data environments become more distributed, heterogeneous, and regulated, cross-platform data processing moves from an optimization to a necessity. Apache Wayang, now an Apache Software Foundation Top-Level Project, provides a research-validated foundation for this next generation of data architecture.
For a deeper exploration of how federated approaches enable AI workloads without centralization, see Federated Learning: The Ultimate Data Privacy Solution.
References
Primary Research Papers
- Apache Wayang: A Unified Data Analytics Framework (SIGMOD Record 2023)
- RHEEM: Enabling Cross-Platform Data Processing (VLDB 2018)
- RHEEMix in the Data Jungle: A Cost-Based Optimizer (VLDB Journal 2020)
- Apache Spark: A Unified Engine for Big Data Processing (ACM)
- MapReduce: Simplified Data Processing on Large Clusters (Google Research)
Apache Project Resources
- Apache Wayang Official Site
- Apache Wayang Benchmarks
- ASF Top-Level Project Announcement (December 2025)
Industry Analysis
About Scalytics
Scalytics Federated provides federated data processing across Spark, Flink, PostgreSQL, and cloud-native engines through a single abstraction layer. Our cost-based optimizer selects the right engine for each operation, reducing processing time while eliminating vendor lock-in.
Scalytics Copilot extends this foundation with private AI deployment: running LLMs, RAG pipelines, and ML workloads entirely within your security perimeter. Data stays where it lives. Models train where data resides. No extraction, no exposure, no third-party API dependencies.
For organizations in healthcare, finance, and government, this architecture isn't optional, it's how you deploy AI while remaining compliant with HIPAA, GDPR, and DORA.Explore our open-source foundation: Scalytics Community Edition
Questions? Reach us on Slack or schedule a conversation.
