Scalytics | Big Data Processing Architectures Explained: Apache Hadoop, Spark, and Wayang

Summary

Enterprise data architectures have evolved through distinct generations, each shaped by operational constraints observed at scale. Apache Hadoop enabled distributed batch processing but imposed high latency and rigid execution models. Apache Spark improved performance and developer productivity through in-memory computation, while still assuming a single execution platform. Apache Wayang, which graduated to an Apache Software Foundation Top-Level Project in December 2025, represents the next architectural step: cost-based, cross-platform data processing designed for heterogeneous, regulated, and distributed environments.

This article traces that evolution using peer-reviewed research, benchmark results, Apache project records, and independent industry coverage.

TL;DR: Hadoop solved scale. Spark solved speed. Apache Wayang eliminates the glue code, duplicate pipelines, and data movement that result when analytics spans multiple platforms. Scalytics operationalizes this for enterprise environments.

‍

Why This Evolution Matters

Large-scale data processing frameworks evolve in response to concrete system limitations encountered in production. Each generation solved a specific class of problems while revealing new constraints as data architectures grew more complex.

Apache Hadoop enabled distributed batch processing on commodity hardware. Apache Spark improved performance and programmability through in-memory execution. Apache Wayang emerged from academic research to address a later-stage problem: how to execute analytics efficiently across multiple heterogeneous data processing platforms without binding applications to a single system.

For architects, this evolution explains why established platforms increasingly struggle in diverse environments. For senior decision-makers, it explains rising infrastructure cost, growing integration complexity, and governance challenges as organizations accumulate specialized systems that were never designed to work together. This shift toward cross-platform data processing reflects how modern big data architectures operate in practice rather than in isolation.

‍

Gen 1: 2007

🐘

Hadoop

✅ Solved: SCALE

Enabled distributed batch processing on commodity hardware.

The Limitation:
High latency due to disk-centric I/O. Rigid MapReduce model.

Gen 2: 2010

⚡

Spark

✅ Solved: SPEED

Improved performance via in-memory computation and unified APIs.

The Limitation:
Assumes a single execution platform. Creates "glue code" in hybrid environments.

Gen 3: 2025

🎭

Wayang

✅ Solves: SILOS

Cross-platform optimization for heterogeneous environments.

The Breakthrough:
Decouples logic from execution. Optimizes cost & performance across platforms.

‍

Apache Hadoop: The First Generation of Big Data Systems

Apache Hadoop is based on the MapReduce programming model introduced by Google engineers Jeff Dean and Sanjay Ghemawat in 2004. Doug Cutting implemented the model in what became the Apache Hadoop project, with its first public release in September 2007.

Hadoop combined three foundational components:

The Hadoop Distributed File System for distributed storage on commodity hardware
MapReduce for parallel computation using a map-shuffle-reduce paradigm
Built-in fault tolerance through data replication and automatic task re-execution

‍

This architecture enabled organizations to process petabyte-scale datasets and established the foundation of the big data ecosystem. Early large-scale adopters included Yahoo, Facebook, and LinkedIn.

Hadoop's design assumptions were explicit. Workloads were batch-oriented. Latency was secondary to throughput. Intermediate results were written to disk between computation stages.

‍

Documented Limitations of Apache Hadoop MapReduce

Extensive industry and academic analysis has identified structural limitations in Hadoop MapReduce.

Disk-centric execution. Each MapReduce stage reads data from disk, processes it, and writes results back to disk before the next stage begins. For multi-stage pipelines and iterative algorithms, this creates substantial I/O overhead. IBM and AWS comparative analyses document how this disk-bound execution model limits performance for workloads requiring repeated data passes.

Batch-only processing. Hadoop MapReduce does not support native streaming or continuous processing. Data must be collected, stored, and processed in discrete batches, typically following an ETL cycle; a constraint we examine in ETL vs ELT: The Data Wrangling Showdown.

Inefficient iterative algorithms. Machine learning algorithms such as k-means clustering and gradient descent require multiple passes over the same dataset. MapReduce handles these workloads inefficiently because each iteration requires a full disk read and write cycle.

High end-to-end latency. The map-shuffle-reduce execution cycle introduces significant latency even for moderately complex jobs. Comparative benchmarks show workloads completing in minutes on newer engines can take hours on MapReduce.

Low-level programming abstraction. Developers must explicitly manage map and reduce logic, typically in Java, with no interactive execution model and limited high-level abstractions. DataCamp's analysis identifies this as a major productivity barrier.

As organizations began adopting machine learning, interactive analytics, and real-time use cases, these limitations became operational bottlenecks.

‍

Apache Spark: The Second Generation of Big Data Processing

Apache Spark was developed at UC Berkeley's AMPLab beginning in 2009 and open-sourced in 2010. Spark was designed specifically to address Hadoop MapReduce's performance and usability constraints.

The central abstraction introduced by Spark is the Resilient Distributed Dataset (RDD). RDDs are immutable, partitioned collections that can be cached in memory and reconstructed using lineage information in case of failure. Instead of persisting intermediate results to disk, Spark retains data in memory across operations.

‍

What Apache Spark Solved

In-memory computation. By processing data in RAM rather than disk, Spark removes the primary I/O bottleneck of MapReduce. Peer-reviewed benchmarks show Spark executing iterative algorithms up to two orders of magnitude faster than Hadoop MapReduce.

Unified analytics engine. Spark provides a single platform supporting batch processing, SQL queries, stream processing, machine learning, and graph analytics. This reduced the need to maintain separate systems for different workload types.

Multi-language APIs. Spark supports Scala, Python, Java, and R, expanding accessibility beyond Java-only MapReduce development.

Interactive execution. Spark's interactive shells enable exploratory data analysis and iterative development workflows that were impractical with batch-oriented MapReduce.

The ACM paper "Apache Spark: A Unified Engine for Big Data Processing" documents adoption across more than one thousand organizations, with publicly reported deployments exceeding eight thousand nodes. As of 2025, the project has more than 36,000 GitHub stars and over 1,900 contributors.

As a result, Apache Spark replaced MapReduce as the dominant execution engine in many large-scale analytics environments. For organizations evaluating distributed processing approaches, we compare architectural tradeoffs in Data Federation vs Data Centralization.

‍

The Emerging Limitation: Single-Platform Assumptions

While Apache Spark addressed performance and programmability, it did not address a growing architectural reality encountered in most enterprise environments.

Modern data landscapes do not consist of a single processing system.

Organizations typically operate:

Relational databases such as PostgreSQL and MySQL
Distributed batch engines such as Apache Spark
Stream processing systems such as Apache Flink and Kafka Streams
Specialized systems for machine learning and graph analytics
Data warehouses such as Snowflake, BigQuery, or Redshift
Multiple storage backends across cloud and on-premise infrastructure

‍

This fragmentation creates the data silo problem we analyze in Solving the AI Data Dilemma.

Research published in SIGMOD Record 2023 documents that users frequently run analytics at higher cost and lower efficiency because selecting and integrating the appropriate execution platforms is complex and error-prone.

When workloads span multiple systems, developers must manually orchestrate execution, manage data movement, and maintain platform-specific code paths. This challenge is commonly referred to in academic literature as the zoo of platforms problem.

‍

Scenarios Where Single-Platform Models Break Down

The SIGMOD 2023 Apache Wayang paper identifies four recurring scenarios:

Platform independence. The same application must run on different platforms depending on data size, cost constraints, or deployment environment.

Opportunistic cross-platform execution. Performance gains are achieved by executing different stages of a pipeline on different platforms, such as large-scale preprocessing on Spark followed by local execution for final aggregation.

Mandatory cross-platform execution. Modern applications combine streaming ingestion, batch transformation, database queries, and machine learning, exceeding the capabilities of any single platform.

Polystore environments. Data resides in multiple heterogeneous storage systems that cannot be practically consolidated.

Neither Hadoop nor Spark addresses these scenarios directly.

‍

Apache Wayang: A Research-Driven Framework for Cross-Platform Data Processing

Apache Wayang originated from research conducted at the Qatar Computing Research Institute and the Hasso Plattner Institute. The work was first presented at SIGMOD 2016 under the name Rheem and later published at VLDB 2018.

The central research question differed from that of Hadoop or Spark: how can data processing be optimized across multiple heterogeneous platforms without binding applications to a single execution engine?

This question emerged from observing enterprise environments where the optimal platform varies based on workload characteristics, data location, cost constraints, and regulatory requirements.

‍

Platform-Independent Data Processing

Apache Wayang introduces a platform-agnostic abstraction layer for data analytics.

Users define data processing tasks using Java, Scala, Python, or SQL without specifying where execution should occur. Wayang constructs a logical plan, referred to as a Wayang plan, represented as a directed dataflow graph. Vertices represent platform-independent operators, and edges represent data flow.

This architecture explicitly separates:

Application logic from platform-specific implementations
Optimization decisions from user code
Logical plans from physical execution

‍

The framework currently supports Apache Spark, Apache Flink, PostgreSQL, GraphX, Giraph, and a native Java Streams executor. New platforms can be added by implementing operator mappings rather than rewriting applications.

‍

How Apache Wayang Optimizes Data

1. Your Code

Platform-agnostic Java, Scala, Python, or SQL.

→

2. Wayang Optimizer

Analyzes cost, data location, and latency requirements.

→

3. Execution

Routes specific tasks to the best engine automatically.

Spark

Flink

Postgres

Java

‍

Cross-Platform Cost-Based Optimization

The core technical contribution of Apache Wayang is its cross-platform optimizer, documented in the VLDB Journal 2020.

Given a Wayang plan, the optimizer:

Expands the plan by introducing all platform-specific execution alternatives for each operator
Estimates costs including execution time, memory usage, and data movement between platforms
Enumerates possible execution plans while applying lossless pruning techniques to manage the exponential search space
Selects an optimal plan based on configurable objectives such as runtime, monetary cost, or energy consumption

‍

Unlike workflow orchestrators that operate at the job level, Wayang performs optimization at the operator level. A single pipeline can execute different stages on different platforms, determined automatically by the optimizer.

For enterprises, this directly addresses a common failure mode: overprovisioning expensive platforms for workloads that do not require them while underutilizing existing systems better suited for specific stages.

‍

Empirical Evidence and Benchmarks

Benchmark results published by the Apache Wayang project and peer-reviewed papers demonstrate measurable improvements.

Automatic platform selection. In WordCount benchmarks across dataset sizes from one gigabyte to eight hundred gigabytes, Wayang automatically routes small datasets to lightweight Java execution and larger datasets to distributed engines, avoiding unnecessary overhead.

Hybrid execution. For stochastic gradient descent, Wayang uses Apache Spark for preprocessing and data preparation, then transitions to local Java execution as the dataset size decreases. The benchmark documentation notes that this optimization is not readily apparent without specialized expertise but is implemented automatically by Wayang.

Cross-platform performance. The VLDB 2018 paper reports tasks executing more than one order of magnitude faster when using cross-platform execution compared to single-platform approaches.

Federated query execution. TPC-H benchmarks across HDFS, PostgreSQL, and object storage show Wayang executing joins and aggregations across systems without explicit user coordination or manual data movement. This federated execution model forms the foundation of Scalytics Federated, our enterprise platform for cross-platform analytics.

‍

From Research to Production: Scalytics

Apache Wayang provides the research-backed foundation for cross-platform data processing. Scalytics, founded by the original inventors of Apache Wayang, operationalizes these capabilities for enterprise environments.

The New Stack reported in December 2025 that Scalytics uses Apache Wayang as the basis for federated data processing, enabling what it describes as a virtual data lake across heterogeneous systems.

While Wayang handles cross-platform optimization and execution, production deployments require additional capabilities that the open-source framework does not address: governance, security isolation, team collaboration, and operational control.

‍

Scalytics Federated

Governance & Optimization Layer

Send Compute ↓ (Wayang Plan)

Get Insights ↑ (Aggregated Results)

☁️

Cloud Lakes

S3 / ADLS / GCS

🔐

On-Prem / Regulated

PostgreSQL / Oracle

🗄️

Warehouses

Snowflake / BigQuery

🏭

Edge / IoT

Kafka / MQTT

‍

Operational Control and Repeatability

Enterprise data processing requires visibility across the full lifecycle of analytics jobs. Scalytics Federated adds centralized job management, execution monitoring, and scheduling for cross-platform workloads.

Jobs can be versioned, parameterized, and embedded into larger workflows while maintaining consistent behavior across executions. This repeatability is critical for regulated environments where analytics results must be explainable and reproducible over time; a requirement we address in detail in our DORA compliance guide.

‍

Governance and Security Isolation

Access to data systems and execution platforms must be tightly controlled. Scalytics introduces role-based access control governing who can create, modify, execute, or reuse analytics jobs.

Credentials and connection details for underlying platforms remain isolated within the system. One team can define a job accessing sensitive systems while other teams execute that job without direct access to credentials or platform logins. This approach aligns with shift-left data architecture principles that move governance earlier in the pipeline.

‍

Low-Code Pipeline Definition and Team Collaboration

While Apache Wayang exposes programmatic APIs, Scalytics Federated adds a low-code interface for composing and managing pipelines. Technical teams encapsulate complex logic once; other teams reuse these pipelines safely without embedding platform credentials into application code.

A job defined by one team can be referenced and reused by others as part of larger workflows; creating a shared catalog of trusted analytics building blocks. For organizations struggling with data silos, this eliminates duplicated logic across teams while maintaining governance.

‍

Preserving Wayang's Cross-Platform Optimization

These enterprise features build on top of Apache Wayang's optimizer rather than replacing it. Wayang continues to determine how and where each operator executes across platforms. Scalytics focuses on managing how those executions are defined, governed, shared, and operated at scale.

This separation ensures research-backed optimization logic remains intact while making cross-platform data processing usable in production environments.

‍

Why This Matters Now

The progression from Apache Hadoop to Apache Spark to Apache Wayang reflects a structural shift in data architecture design.

Hadoop addressed the challenge of processing data too large for a single machine.

Spark addressed the challenge of processing that data efficiently and interactively.

Wayang addresses the challenge of processing data that lives across multiple systems without forcing consolidation that may be impractical, costly, or restricted by regulation.

As data environments become more distributed, heterogeneous, and regulated, cross-platform data processing moves from an optimization to a necessity. Apache Wayang, now an Apache Software Foundation Top-Level Project, provides a research-validated foundation for this next generation of data architecture.

For a deeper exploration of how federated approaches enable AI workloads without centralization, see Federated Learning: The Ultimate Data Privacy Solution.

‍

References

Primary Research Papers

Apache Project Resources

Industry Analysis

About Scalytics

Scalytics architects and troubleshoots mission-critical streaming, federated execution, and AI systems for scaling SMEs. When Kafka pipelines fall behind, SAP IDocs block processing, lakehouse sinks break, or AI pilots collapse under real load, we step in and make them run.

Our founding team created Apache Wayang (now an Apache Top-Level Project), the federated execution framework that orchestrates Spark, Flink, and TensorFlow where data lives and reduces ETL movement overhead.

We also invented and actively maintain KafScale (S3-Kafka-streaming platform), a Kafka-compatible, stateless data and large object streaming system designed for Kubernetes and object storage backends. Elastic compute. No broker babysitting. No lock-in.

Our mission: Data stays in place. Compute comes to you. From data lakehousese to private AI deployment and distributed ML - all designed for security, compliance, and production resilience.

Questions? Join our open Slack community or schedule a consult.