Distributed data processing has evolved from single engine batch systems to highly specialized compute platforms that address different workload requirements. Hadoop introduced large scale parallelism. Spark accelerated analytics by replacing disk based MapReduce with DAG execution. Flink delivered low latency streaming with strong state management. Each of these engines is powerful within its domain, but each remains bound to an isolated execution model.
Modern data architectures have outgrown this single engine paradigm. Workloads now span transactional databases, cloud warehouses, streaming systems, object stores, ML runtimes, and vector search layers. No single engine is optimal for all of these. This fragmentation creates a new class of problems that Spark or Flink alone cannot solve. Apache Wayang was created to fill this architectural gap.
Wayang introduces a cross platform execution layer that allows pipelines to run across multiple engines without code changes. It provides a cost based optimizer that selects the most efficient engine for each part of a workload. It brings execution to where the data resides and avoids unnecessary transfers. It is a response to the reality that heterogeneous data ecosystems are now the norm, not an exception.
This article explains why Spark is no longer sufficient as a universal execution engine, how Wayang extends the distributed processing stack, and why cross platform execution is becoming an architectural requirement.
The Limits of Single Engine Distributed Processing
Spark provides strong performance for large batch workloads, iterative machine learning algorithms, and unified analytics through DataFrame and SQL APIs. It works well when the majority of a pipeline can be centralized in a Spark cluster. In practice, this requirement often fails. Several factors contribute to this:
Data fragmentation across systems
Organizations now store operational data in relational databases, analytical data in cloud warehouses, events in streaming systems, and documents in object stores. Forcing all data into a Spark cluster requires constant extraction, loading, and synchronization. This increases cost, delays processing, and introduces compliance concerns.
Engine dependency and lock in
A pipeline written for Spark cannot automatically take advantage of other engines. If a workload contains a streaming section that fits Flink, or a filter that should be executed directly inside a database, Spark is not designed to hand off execution to these systems.
Inefficient data movement
Moving large datasets to a Spark cluster is often the wrong strategy. When a small SQL filter or join could be executed inside a database, shipping the entire dataset to Spark wastes resources. Similarly, when streaming workloads require Flink semantics, forcing them into Spark adds unnecessary latency.
Operational overhead
As pipelines grow more complex, teams often combine Spark with Flink, Kafka, JDBC engines, or Python ML runtimes. The glue code required to coordinate these systems increases complexity and is difficult to optimize consistently.
Spark remains a powerful engine, but it was never designed to act as an execution orchestrator for heterogeneous systems. This architectural gap is the motivation for cross platform execution.
Why Cross Platform Execution Is the Logical Next Step
In heterogeneous architectures, datasets live in different systems for valid reasons: compliance requirements, performance constraints, cost optimizations, domain ownership, or operational workflows. Instead of relocating data, the execution framework must adapt to where data lives.
Cross platform execution provides several key capabilities:
Engine selection
A logical operator is executed on the system best suited for it. Pushdown filters run inside relational databases. Streaming operators run on Flink. Heavy batch transforms run on Spark. The developer writes one pipeline. The optimizer handles engine selection.
Data locality
Operations are pushed to the system holding the data. This reduces transfers, lowers cost, and supports compliance requirements that prohibit unnecessary data duplication.
Unified optimization
Instead of optimizing a plan for one engine, the optimizer considers multiple engines, their costs, and the cost of movement between them. This yields execution plans that outperform single engine strategies.
Workload portability
Pipelines are written in a logical DSL independent of the execution engine. Organizations gain flexibility to switch engines or adopt new ones without rewriting pipelines.
Cross platform execution is not a replacement for Spark or Flink. It is the architectural layer above them that coordinates execution across a distributed ecosystem.
Apache Wayang as the Cross Platform Execution Layer
Apache Wayang, created by the Scalytics team before becoming an Apache project, implements this cross platform model. It provides a unified abstraction over multiple engines and a cost based optimizer that evaluates execution alternatives.
Core capabilities include:
Logical operator model
Pipelines are expressed as logical operators such as Map, Reduce, Join, Transform, and ML tasks. These operators are independent of execution engines.
Execution backends
Wayang currently provides backends for Spark, Flink, Java, JDBC, Python, and other systems. Operators have multiple possible implementations, known as execution operators.
Cost based optimizer
For each operator, Wayang evaluates cost estimates for each backend. It selects the combination of engines that minimize total execution cost, including data movement between systems.
Cross engine planning
A single pipeline can combine Spark for transformations, Flink for streaming segments, and JDBC for relational filters. The plan is optimized end to end.
Data locality preservation
Wherever possible, operations stay inside the system that owns the data. This minimizes transfers and improves compliance characteristics.
This approach solves a class of problems that Spark and Flink cannot address alone.
Spark vs Wayang: A Technical Comparison
Execution Model
Spark: unified compute engine with a DAG scheduler.
Wayang: orchestrator across multiple engines including Spark.
Optimization Scope
Spark: optimization within Spark only.
Wayang: optimization across engines, including the cost of engine transitions.
Data Movement
Spark: often requires centralizing data into a Spark cluster.
Wayang: pushes computation to data sources whenever possible.
Flexibility
Spark: tied to Spark operators and DataFrame semantics.
Wayang: independent logical operator model with multiple execution alternatives.
Streaming Support
Spark: micro batching for structured streaming.
Wayang: can dispatch streaming segments to Flink or other engines natively.
Use Case Breadth
Spark: ETL, batch ML, iterative analytics.
Wayang: heterogeneous pipelines requiring multiple systems and data locality.
Wayang does not replace Spark. It integrates Spark into a broader execution fabric.
Benchmark Wayang for Cross Platform Optimization
The official Apache Wayang benchmark suite demonstrates how its cost based optimizer improves execution performance by selecting the most efficient backend for each operator. These results compare fixed engine plans (Spark only or Java only) with Wayang’s dynamic selection across Spark, Java, and other backends.
1. Avoiding Spark Overhead for Small and Medium Tasks
Benchmarks consistently show that small scale operators such as filters, projections, or lightweight transformations execute faster on the Java backend than on Spark. Spark introduces nontrivial startup, scheduling, and shuffle overhead that dominates execution for these tasks. Wayang’s optimizer detects these cases and assigns such operators to Java, reducing overall latency compared to a Spark only plan.
2. Leveraging Spark for Large Distributed Operators
For high volume operators such as distributed joins or aggregations, Spark’s parallelism yields clear performance advantages. Wayang’s optimizer identifies these operators and dispatches them to Spark while keeping earlier or intermediate stages on more lightweight backends. This selective use of Spark outperforms both Spark only and Java only execution strategies.
3. Cross Engine Plans Often Outperform Single Engine Execution
Benchmarks highlight scenarios where combining engines results in faster execution than using any one engine alone. For example:
- initial filtering on Java
- heavy transformation on Spark
- final reduction or post processing again on Java
This pattern minimizes overhead while still using distributed compute where needed. Wayang’s cost based planner identifies such mixed strategies automatically.
4. Minimizing Data Movement Across Backends
Data transfer between engines is a critical cost factor. Wayang’s optimizer incorporates data movement cost into its planning model. Benchmarks show that plans with unnecessary engine switches are pruned in favor of strategies that preserve data locality. This produces stable performance improvements across heterogeneous workloads.
5. Performance Improves as Workloads Become More Heterogeneous
The more varied the operators and data sources, the larger the performance gap between Wayang and fixed engine plans. This reflects the design goal of Wayang: optimizing pipelines that span multiple systems where no single engine is the right choice for all operations.
These results confirm that cross platform optimization is not theoretical. It provides measurable advantages in latency, resource usage, and execution efficiency compared to any fixed engine approach.
Practical Example: One Pipeline, Multiple Engines
Consider a pipeline that needs to:
- Filter customer data stored in a cloud warehouse
- Process clickstream events in real time
- Apply transformations and joins on large historical datasets
- Run a model for prediction
- Write results back to transactional systems
In Spark alone, this would require:
- Extracting relational data into Spark
- Handling streaming through micro batches
- Writing glue code for model integration
- Managing resource allocation and consistency manually
In Wayang:
- The relational filter is pushed to the database using JDBC
- The streaming section is executed on Flink
- The batch transformations run on Spark
- The model inference runs on the engine best suited for it
- Wayang generates a unified plan and coordinates dataflow across all engines
The developer writes one pipeline. The optimizer determines execution.
How Wayang Enables Federated Processing
Federated processing requires computation to occur inside the environments that hold the data. Wayang provides this capability by design. Because it can dispatch operators to different engines and environments, it naturally supports cases where data cannot be moved due to regulation or governance.
Wayang enables:
- Computation at the data source without relocation
- Execution across regulated domains
- Strong separation between logical pipelines and physical execution environments
- Cross platform AI and analytics without centralization
These characteristics form the technical foundation for federated execution.
Scalytics Federated: Enterprise Implementation of Cross Platform and Federated Execution
Scalytics Federated extends Apache Wayang with enterprise features required for production and regulated environments. It adds:
- Multi tenant governance
- Security and access control
- Observability and lineage for distributed pipelines
- Federated execution across on prem, cloud, and edge systems
- Optimized connectors for enterprise systems
- Integration with private AI workloads
Scalytics Federated operationalizes the cross platform model and provides the controls necessary for ISVs, MSPs, and enterprises that require compliant and scalable distributed analytics.
Summary
Distributed processing has progressed from single engine batch computation to fast analytics and real time streaming. Spark and Flink remain essential tools, but they cannot address the complexity of modern heterogeneous data environments.
Apache Wayang introduces a cross platform execution layer that selects the best engine for each part of a workload, keeps computation close to data, and reduces overall execution cost. Scalytics Federated builds on Wayang to deliver this model at enterprise scale with support for federated processing across regulated domains.
In environments where data is fragmented, regulated, or distributed across multiple systems, cross platform execution is no longer optional. It is the new foundation for scalable analytics and AI.
About Scalytics
Scalytics Federated provides federated data processing across Spark, Flink, PostgreSQL, and cloud-native engines through a single abstraction layer. Our cost-based optimizer selects the right engine for each operation, reducing processing time while eliminating vendor lock-in.
Scalytics Copilot extends this foundation with private AI deployment: running LLMs, RAG pipelines, and ML workloads entirely within your security perimeter. Data stays where it lives. Models train where data resides. No extraction, no exposure, no third-party API dependencies.
For organizations in healthcare, finance, and government, this architecture isn't optional, it's how you deploy AI while remaining compliant with HIPAA, GDPR, and DORA.Explore our open-source foundation: Scalytics Community Edition
Questions? Reach us on Slack or schedule a conversation.
