Hadoop, Spark and Wayang: Cross Platform Execution and Why It Matters

Alexander Alten

Distributed data processing has evolved from single engine batch systems to highly specialized compute platforms that address different workload requirements. Hadoop introduced large scale parallelism. Spark accelerated analytics by replacing disk based MapReduce with DAG execution. Flink delivered low latency streaming with strong state management. Each of these engines is powerful within its domain, but each remains bound to an isolated execution model.

Modern data architectures have outgrown this single engine paradigm. Workloads now span transactional databases, cloud warehouses, streaming systems, object stores, ML runtimes, and vector search layers. No single engine is optimal for all of these. This fragmentation creates a new class of problems that Spark or Flink alone cannot solve. Apache Wayang was created to fill this architectural gap.

Wayang introduces a cross platform execution layer that allows pipelines to run across multiple engines without code changes. It provides a cost based optimizer that selects the most efficient engine for each part of a workload. It brings execution to where the data resides and avoids unnecessary transfers. It is a response to the reality that heterogeneous data ecosystems are now the norm, not an exception.

This article explains why Spark is no longer sufficient as a universal execution engine, how Wayang extends the distributed processing stack, and why cross platform execution is becoming an architectural requirement.

The Evolution of Distributed Processing

From centralized batch jobs to cross-platform orchestration.

🐘
Generation 1 (2006+)

Apache Hadoop

Introduced large-scale parallelism. Focused on moving compute to data on disk (MapReduce).

The Limit:
High Latency (Disk I/O)
Generation 2 (2014+)

Apache Spark

Optimized for speed. Replaced disk with memory and DAG execution. Became the single-engine standard.

The Limit:
Data Silos & Movement
🌐
Generation 3 (Now)

Apache Wayang

Optimizes across engines. Orchestrates Spark, Flink, and Databases in a single logical plan.

The Solution:
Cross-Platform Execution

The Limits of Single Engine Distributed Processing

Spark provides strong performance for large batch workloads, iterative machine learning algorithms, and unified analytics through DataFrame and SQL APIs. It works well when the majority of a pipeline can be centralized in a Spark cluster. In practice, this requirement often fails. Several factors contribute to this:

Data fragmentation across systems
Organizations now store operational data in relational databases, analytical data in cloud warehouses, events in streaming systems, and documents in object stores. Forcing all data into a Spark cluster requires constant extraction, loading, and synchronization. This increases cost, delays processing, and introduces compliance concerns.

Engine dependency and lock in
A pipeline written for Spark cannot automatically take advantage of other engines. If a workload contains a streaming section that fits Flink, or a filter that should be executed directly inside a database, Spark is not designed to hand off execution to these systems.

Inefficient data movement
Moving large datasets to a Spark cluster is often the wrong strategy. When a small SQL filter or join could be executed inside a database, shipping the entire dataset to Spark wastes resources. Similarly, when streaming workloads require Flink semantics, forcing them into Spark adds unnecessary latency.

Operational overhead
As pipelines grow more complex, teams often combine Spark with Flink, Kafka, JDBC engines, or Python ML runtimes. The glue code required to coordinate these systems increases complexity and is difficult to optimize consistently.

Spark remains a powerful engine, but it was never designed to act as an execution orchestrator for heterogeneous systems. This architectural gap is the motivation for cross platform execution.

Why Cross Platform Execution Is the Logical Next Step

In heterogeneous architectures, datasets live in different systems for valid reasons: compliance requirements, performance constraints, cost optimizations, domain ownership, or operational workflows. Instead of relocating data, the execution framework must adapt to where data lives.

Cross platform execution provides several key capabilities:

Engine selection
A logical operator is executed on the system best suited for it. Pushdown filters run inside relational databases. Streaming operators run on Flink. Heavy batch transforms run on Spark. The developer writes one pipeline. The optimizer handles engine selection.

Data locality
Operations are pushed to the system holding the data. This reduces transfers, lowers cost, and supports compliance requirements that prohibit unnecessary data duplication.

Unified optimization
Instead of optimizing a plan for one engine, the optimizer considers multiple engines, their costs, and the cost of movement between them. This yields execution plans that outperform single engine strategies.

Workload portability
Pipelines are written in a logical DSL independent of the execution engine. Organizations gain flexibility to switch engines or adopt new ones without rewriting pipelines.

Cross platform execution is not a replacement for Spark or Flink. It is the architectural layer above them that coordinates execution across a distributed ecosystem.

Apache Wayang as the Cross Platform Execution Layer

Apache Wayang, created by the Scalytics team before becoming an Apache project, implements this cross platform model. It provides a unified abstraction over multiple engines and a cost based optimizer that evaluates execution alternatives.

Core capabilities include:

Logical operator model
Pipelines are expressed as logical operators such as Map, Reduce, Join, Transform, and ML tasks. These operators are independent of execution engines.

Execution backends
Wayang currently provides backends for Spark, Flink, Java, JDBC, Python, and other systems. Operators have multiple possible implementations, known as execution operators.

Cost based optimizer
For each operator, Wayang evaluates cost estimates for each backend. It selects the combination of engines that minimize total execution cost, including data movement between systems.

Cross engine planning
A single pipeline can combine Spark for transformations, Flink for streaming segments, and JDBC for relational filters. The plan is optimized end to end.

Data locality preservation
Wherever possible, operations stay inside the system that owns the data. This minimizes transfers and improves compliance characteristics.

This approach solves a class of problems that Spark and Flink cannot address alone.

Spark vs Wayang: A Technical Comparison

Execution Model
Spark: unified compute engine with a DAG scheduler.
Wayang: orchestrator across multiple engines including Spark.

Optimization Scope
Spark: optimization within Spark only.
Wayang: optimization across engines, including the cost of engine transitions.

Data Movement
Spark: often requires centralizing data into a Spark cluster.
Wayang: pushes computation to data sources whenever possible.

Flexibility
Spark: tied to Spark operators and DataFrame semantics.
Wayang: independent logical operator model with multiple execution alternatives.

Streaming Support
Spark: micro batching for structured streaming.
Wayang: can dispatch streaming segments to Flink or other engines natively.

Use Case Breadth
Spark: ETL, batch ML, iterative analytics.
Wayang: heterogeneous pipelines requiring multiple systems and data locality.

Wayang does not replace Spark. It integrates Spark into a broader execution fabric.

Benchmark Wayang for Cross Platform Optimization

The official Apache Wayang benchmark suite demonstrates how its cost based optimizer improves execution performance by selecting the most efficient backend for each operator. These results compare fixed engine plans (Spark only or Java only) with Wayang’s dynamic selection across Spark, Java, and other backends.

1. Avoiding Spark Overhead for Small and Medium Tasks

Benchmarks consistently show that small scale operators such as filters, projections, or lightweight transformations execute faster on the Java backend than on Spark. Spark introduces nontrivial startup, scheduling, and shuffle overhead that dominates execution for these tasks. Wayang’s optimizer detects these cases and assigns such operators to Java, reducing overall latency compared to a Spark only plan.

2. Leveraging Spark for Large Distributed Operators

For high volume operators such as distributed joins or aggregations, Spark’s parallelism yields clear performance advantages. Wayang’s optimizer identifies these operators and dispatches them to Spark while keeping earlier or intermediate stages on more lightweight backends. This selective use of Spark outperforms both Spark only and Java only execution strategies.

3. Cross Engine Plans Often Outperform Single Engine Execution

Benchmarks highlight scenarios where combining engines results in faster execution than using any one engine alone. For example:

  • initial filtering on Java
  • heavy transformation on Spark
  • final reduction or post processing again on Java

This pattern minimizes overhead while still using distributed compute where needed. Wayang’s cost based planner identifies such mixed strategies automatically.

4. Minimizing Data Movement Across Backends

Data transfer between engines is a critical cost factor. Wayang’s optimizer incorporates data movement cost into its planning model. Benchmarks show that plans with unnecessary engine switches are pruned in favor of strategies that preserve data locality. This produces stable performance improvements across heterogeneous workloads.

5. Performance Improves as Workloads Become More Heterogeneous

The more varied the operators and data sources, the larger the performance gap between Wayang and fixed engine plans. This reflects the design goal of Wayang: optimizing pipelines that span multiple systems where no single engine is the right choice for all operations.

These results confirm that cross platform optimization is not theoretical. It provides measurable advantages in latency, resource usage, and execution efficiency compared to any fixed engine approach.

Practical Example: One Pipeline, Multiple Engines

Consider a pipeline that needs to:

  • Filter customer data stored in a cloud warehouse
  • Process clickstream events in real time
  • Apply transformations and joins on large historical datasets
  • Run a model for prediction
  • Write results back to transactional systems

In Spark alone, this would require:

  • Extracting relational data into Spark
  • Handling streaming through micro batches
  • Writing glue code for model integration
  • Managing resource allocation and consistency manually

In Wayang:

  • The relational filter is pushed to the database using JDBC
  • The streaming section is executed on Flink
  • The batch transformations run on Spark
  • The model inference runs on the engine best suited for it
  • Wayang generates a unified plan and coordinates dataflow across all engines

The developer writes one pipeline. The optimizer determines execution.

The Optimizer in Action

How one logical pipeline becomes three physical execution plans.

1. The Logical Plan

User writes 1 pipeline (Java/Scala/Python)
Wayang Cost-Based Optimizer
PostgreSQL
Projection & Filter

Pushed down to source. Reduces data transfer by 90% before it leaves the DB.

Apache Spark
Batch ML Training

Heavy iterative model training sent to cluster for parallel processing.

How Wayang Enables Federated Processing

Federated processing requires computation to occur inside the environments that hold the data. Wayang provides this capability by design. Because it can dispatch operators to different engines and environments, it naturally supports cases where data cannot be moved due to regulation or governance.

Wayang enables:

  • Computation at the data source without relocation
  • Execution across regulated domains
  • Strong separation between logical pipelines and physical execution environments
  • Cross platform AI and analytics without centralization

These characteristics form the technical foundation for federated execution.

Scalytics Federated: Enterprise Implementation of Cross Platform and Federated Execution

Scalytics Federated extends Apache Wayang with enterprise features required for production and regulated environments. It adds:

  • Multi tenant governance
  • Security and access control
  • Observability and lineage for distributed pipelines
  • Federated execution across on prem, cloud, and edge systems
  • Optimized connectors for enterprise systems
  • Integration with private AI workloads

Scalytics Federated operationalizes the cross platform model and provides the controls necessary for ISVs, MSPs, and enterprises that require compliant and scalable distributed analytics.

Summary

Distributed processing has progressed from single engine batch computation to fast analytics and real time streaming. Spark and Flink remain essential tools, but they cannot address the complexity of modern heterogeneous data environments.

Apache Wayang introduces a cross platform execution layer that selects the best engine for each part of a workload, keeps computation close to data, and reduces overall execution cost. Scalytics Federated builds on Wayang to deliver this model at enterprise scale with support for federated processing across regulated domains.

In environments where data is fragmented, regulated, or distributed across multiple systems, cross platform execution is no longer optional. It is the new foundation for scalable analytics and AI.

About Scalytics

Scalytics builds on Apache Wayang, the cross-platform data processing framework created by our founding team and now an Apache Top-Level Project. Where traditional platforms require moving data to centralized infrastructure, Scalytics brings compute to your data—enabling AI and analytics across distributed sources without violating compliance boundaries.

Scalytics Federated provides federated data processing across Spark, Flink, PostgreSQL, and cloud-native engines through a single abstraction layer. Our cost-based optimizer selects the right engine for each operation, reducing processing time while eliminating vendor lock-in.

Scalytics Copilot extends this foundation with private AI deployment: running LLMs, RAG pipelines, and ML workloads entirely within your security perimeter. Data stays where it lives. Models train where data resides. No extraction, no exposure, no third-party API dependencies.

For organizations in healthcare, finance, and government, this architecture isn't optional, it's how you deploy AI while remaining compliant with HIPAA, GDPR, and DORA.Explore our open-source foundation: Scalytics Community Edition

Questions? Reach us on Slack or schedule a conversation.
back to all articles
Unlock Faster ML & AI
Free White Papers. Learn how Scalytics Copilot streamlines data pipelines, empowering businesses to achieve rapid AI success.

Scalytics Copilot:
Real-time intelligence. No data leaks.

Launch your data + AI transformation.

Thank you! Our team will get in touch soon.
Oops! Something went wrong while submitting the form.