Apache Wayang: The Complete Guide

Alexander Alten

Apache Wayang provides a systematic abstraction for cross platform data processing. It allows analytical tasks to be expressed once and executed across heterogeneous engines without rewriting logic. Wayang’s cost based optimizer evaluates execution alternatives and often improves performance, especially for mixed workloads. Benchmark results show that hybrid execution can outperform single engine execution by significant margins in certain tasks.

Scalytics Federated builds on Wayang’s abstraction to provide a virtual data layer and virtual data lake for ultra scale analytics in distributed, regulated and multi cloud environments. It operationalizes cross engine and federated execution, allowing organizations to run complex analytics without unnecessary data movement and with full governance.

This combination of abstraction and federated execution forms a robust foundation for modern data architectures.

Apache Wayang: The Foundation for Cross Platform and Federated Data Processing at Ultra Scale

Apache Wayang is an open source project designed to solve one of the most persistent challenges in data engineering: the growing fragmentation of data processing environments. Modern analytics increasingly span multiple engines, storage systems and execution models. Data resides in files, databases, data lakes, object stores and distributed systems. Processing happens on Spark clusters, Flink clusters, stream processors, vector databases, SQL engines, custom Java operators and cloud native compute layers. As organizations adopt the most suitable system for each purpose, their analytical infrastructure becomes complex, heterogeneous and difficult to optimize.

Apache Wayang addresses this fragmentation by introducing a systematic abstraction layer for data processing. Instead of building pipelines tied to specific engines, users express their logic in a platform agnostic Wayang plan. Wayang then maps that plan onto the available execution engines using a cost based optimizer that evaluates performance, data movement, cardinalities and execution alternatives. This capability is not simply a matter of portability. It enables the system to select the most efficient engines for each part of a workload, resulting in better performance, lower operational overhead and long term architectural flexibility.

This article explains the fundamental ideas behind Apache Wayang, how its optimizer works, how hybrid plans can outperform single engine execution and why this abstraction is essential for modern analytical architectures. It also introduces Scalytics Federated, which builds on Wayang’s principles to provide a virtual data layer and virtual data lake for federated, multi engine execution in enterprise environments. The combination of Wayang’s abstraction and Scalytics Federated’s federated execution allows organizations to run workloads across engines, clouds and data boundaries without rewriting pipelines or moving data unnecessarily.

Readers who want to explore complementary topics may consult the following internal articles:

Apache Wayang runs an optimisation process that decides the right execution platform (e.g., Apache Flink) to execute each operator in the Wayang plan so that overall execution time or monetary cost is reduced. All this happens transparently to the user.

Cross Platform Data Processing and the Need for Abstraction

Most analytical workloads today require more than one execution engine. This need arises from architectural design, performance characteristics, data governance restrictions or the nature of the data itself. The following scenarios illustrate why cross platform data processing is now a standard requirement rather than an exception.

Platform independence

A single workload might run on Spark for large scale transformations, on Flink for low latency processing or on a lightweight Java Streams implementation when the dataset is small. Developers want the flexibility to switch engines without rewriting the application. This scenario is common when organizations migrate between technologies, optimize for cost or respond to new performance requirements.

Opportunistic cross platform execution

Different operators in a pipeline may be better suited to different engines. A filtering or projection step may run efficiently inside a relational database. Iterative machine learning or graph processing may run better on Spark. Small control flow logic may perform best using Java Streams. By allowing each operator to run where it performs best, execution time can be reduced significantly.

Mandatory cross platform execution

When a data store lacks support for a particular operation, such as a machine learning algorithm, data must be moved to another platform capable of performing the task. This creates necessary cross platform workflows where certain operations must occur on specialized engines.

Polystore and data lake environments

Large organizations increasingly adopt architectures where datasets span object stores, SQL engines, NoSQL databases, key value stores, warehouses, stream logs and vector databases. Analytical pipelines must operate across these distributed systems. Cross platform computation becomes a required design pattern.

Traditional approaches struggle with these requirements. Monolithic systems that integrate multiple engines internally are limited in scope and difficult to evolve. Manual integration forces teams to develop glue code, orchestration layers and custom connectors that are hard to maintain. Apache Wayang provides a systematic alternative by decoupling the logical description of computation from the physical execution strategy.

What Apache Wayang Provides

Apache Wayang introduces a general purpose abstraction layer for data processing that allows users to express logical queries or pipelines independently of the underlying engines. This separation follows the same architectural principle that made relational databases successful: users describe what they want to compute and a system determines how to execute it.

The key difference is that Wayang operates across multiple, heterogeneous engines rather than within a single system. Users write pipelines using platform agnostic operators such as Map, Filter, Reduce, Join, RepeatLoop or TableSource. Wayang takes this logical plan and constructs an execution strategy that may involve one or several engines.

This is achieved via several essential capabilities.

Logical plan and platform agnostic operators

The Wayang plan captures the intent of the computation. It contains no references to Spark, Flink or any other engine. Operators describe logical transformations that must be applied to the data.

Execution operators and platform mappings

Each Wayang operator can be implemented on one or more execution engines. Mapping rules define how a logical operator translates into a physical operator for a specific platform. For example, a Filter operator may map to a Spark RDD filter, a Flink stream filter or a Java Streams filter.

Data cardinality estimation

To make optimization decisions, Wayang must estimate the amount of data produced by each operator. Cardinality estimation allows the system to predict the cost of execution and data transfer.

Cost based optimization

Wayang uses a cost model to evaluate the cost of executing operators on different engines. It considers computation cost, communication cost, memory, CPU usage and other metrics. The optimizer then enumerates possible plans and selects the most efficient one.

Cross platform execution

Once the plan is generated, the executor orchestrates the execution across the selected engines. This may involve reading data from multiple stores, performing transformations on different systems and coordinating data movement between engines when necessary.

Together, these capabilities allow Apache Wayang to act as a unified execution abstraction for analytics, simplifying development and improving performance while reducing vendor lock in.

Apache Wayang: a Systematic Solution for Cross-Platform Data Processing

The research and industry communities have identified the need for a systematic solution that decouples applications from the underlying processing platforms and enables efficient cross platform data processing, transparently from applications and users.

The ultimate goal is to replicate the success of DBMSs for cross platform applications: users formulate platform agnostic data analytic tasks and an intermediate system decides on which platforms to execute each subtask with the goal of minimizing cost such as runtime or monetary cost.

Architecture of Apache Wayang

Wayang’s architecture follows a multi stage process that transforms a logical plan into an optimized physical plan.

1. Plan inflation

Given a logical operator, Wayang enumerates all platform specific execution alternatives. For example, a Map operator may have Spark, Flink and Java Streams implementations. Plan inflation expands the logical plan into a graph that contains every possible implementation route.

2. Cardinality and cost annotation

Wayang estimates the amount of data that flows between operators, the cost of executing each operator on each platform and the impact of data movement between platforms. These annotations form the foundation for optimization decisions.

3. Data movement analysis

Cross engine execution requires data transfer when the output of one engine is consumed by another. This cost is often the determining factor in plan selection. Wayang includes a model for data movement based on the ICDE 2019 research paper on cross platform movement optimization.

4. Plan enumeration

Wayang systematically enumerates possible execution plans using its annotated operator graph. Each plan is evaluated using the cost model. The optimizer selects the plan with the lowest expected cost.

5. Execution

The selected plan is passed to the Wayang executor, which coordinates execution across the chosen platforms. The executor handles scheduling, operator invocation, data streaming and result retrieval.

This design transforms what would normally be a complex cross engine orchestration problem into an automated optimization process.

The key component of Apache Wayang is its cross platform optimizer. More concretely, Wayang’s optimizer tackles the problem of finding an execution plan able to run across multiple platforms that minimizes the execution cost of a given task. The following example illustrates this.

How Wayang Optimizes Execution
0

Logical Plan

User writes platform-agnostic code (Map, Filter, Loop).

1

Plan Inflation

Wayang maps logical operators to ALL possible engine alternatives (Spark, Flink, Java).

2

Cost & Cardinality

Estimates data volume and calculation costs for every possible route.

3

Plan Enumeration

The "Brain" selects the path with the lowest data movement and compute cost.

4

Hybrid Execution

Orchestrates the workload. E.g., Use Postgres for filtering, Spark for Join.

Figure 1 shows a standard Wayang plan for algorithms when the initial data is stored in a database. In more detail, the input data points are read via a TableSource and filtered via a Filter operator. Then, they are (i) stored into a file for visualization using a CollectionSink and (ii) parsed using a Map, while the initial weights are read via a CollectionSource.

SGD Optimization Cycle
RepeatLoop Execution
1. Sample Data
2. Compute
4. Update Weights
3. Reduce/Sum
Converged?
Initial Data
Final Weights

The main operations of the plan (i.e., sampling, computing the gradients of the sampled data point(s), and updating the weights) are repeated until convergence (i.e., the termination condition of RepeatLoop). The resulting weights are output in a collection.

Given this input plan, the cross-platform optimizer passes the Wayang plan into several phases: the plan inflation, operator costs, movement costs, and plan enumeration phases.

Figure 2 depicts the workflow of Wayang’s optimizer. At first, given a Wayang plan, the optimizer passes the plan through a plan enrichment phase where it inflates the input plan by applying a set of mappings to actual execution operators. In other words, these mappings list how each of the platform-agnostic Wayang operators can be implemented on the different platforms with execution operators. The resulting inflated Wayang thus contains all its execution alternatives. The optimizer then annotates the inflated plan with estimates for both data cardinalities and the costs of executing each execution operator. Next, it takes a graph-based approach [3] to determine how data can be moved most efficiently among different platforms and annotates the inflated plan accordingly. It then uses all these annotations to determine the optimal execution plan via an enumeration algorithm. Eventually, the resulting execution plan can be enacted by the executor of Apache Wayang on all the selected processing platforms.

For example, Wayang’s optimizer outputs the execution plan illustrated in Figure 3 for our SGD example in Figure 1.

Wayang Optimizer Workflow
Optimization Pipeline
1. Plan Inflation Maps logic to engine alternatives
2. Cost & Cardinality Estimates data size and CPU load
3. Movement Analysis Minimizes cross-platform transfer
4. Enumeration Selects best plan based on costs
Agnostic
Wayang Plan
Physical
Execution Plan

The above plan shows the execution plan for the SGD Wayang plan when Postgres, Spark, and JavaStreams are the only available platforms. This plan exploits Postgres to extract the desired data points, Spark’s high parallelism for the large input dataset, and at the same time, the low latency of JavaStreams for the small collection of centroids. Also note the three additional execution operators for data movement (Results2Stream, Broadcast) and to make data reusable (Cache).

Benchmarking Cross Platform Execution

Benchmarking is essential for understanding the value of multi engine execution. Apache Wayang provides official benchmark results for common analytical tasks. The use cases include word count, terasort and machine learning workloads such as logistic regression and stochastic gradient descent.

Benchmark methodology

The official Wayang benchmarks compare Spark only execution, Flink only execution, Java Streams only execution and hybrid execution with Wayang selecting engines per operator. These comparisons involve workloads of varying sizes and complexity. Metrics include runtime, data processed, shuffle volume, memory usage and efficiency of engine selection.

Single engine execution limitations

Single engine execution often performs well for certain workloads but poorly for others. Spark provides strong throughput on large datasets but higher latency on iterative or small input tasks.Flink provides low latency for streaming but may not optimize batch workloads as effectively.Java Streams performs well with small or medium data sets but lacks distributed scalability.

When a workload contains mixed characteristics, the choice of a single engine becomes suboptimal.

Hybrid execution outperforming single engines

Hybrid execution allows the system to use:

  • PostgreSQL or a database to filter data close to where it is stored
  • Spark for compute heavy stages that benefit from distributed parallelism
  • Java Streams for small collections or iterative control logic
  • Other engines where appropriate

According to the Wayang benchmark documentation, certain workloads achieve significant performance improvements when Wayang selects the optimal engine for each operator. Academic evaluations have demonstrated speedups of up to one order of magnitude for mixed workloads. These gains arise from selecting specialized engines for each stage and reducing unnecessary data movement.

Data movement reduction

Data movement is often the most expensive part of distributed analytics. Wayang’s optimizer includes explicit modeling for this cost and attempts to minimize it. Experiments published in research literature demonstrate that optimizing data movement can reduce execution time and improve scalability.

Federated benchmarking considerations

Federated execution is evaluated based on:

  • reduction of cross boundary data movement
  • compliance with locality constraints
  • throughput under multi region workloads
  • execution time when minimizing inter region transfer
  • cost in multi cloud deployments

Academic and industry experiments consistently show that minimizing data movement improves performance and reduces cost. Although results vary by workload, federated execution is especially advantageous for multi region or privacy constrained analytics.

Benchmark conclusion

Hybrid execution does not guarantee better performance for all workloads, but it frequently offers substantial improvements when workloads contain heterogeneous characteristics. Academic results confirm that speedups of up to an order of magnitude are possible for certain tasks.

Performance Comparison
SGD Workload Execution Time (Lower is Better)
Spark
High Overhead
~Mins
Flink
High Overhead
~Mins
Wayang
Optimal
~Secs
Why? By choosing the right engine for each specific operator (e.g., using Java Streams for small iterative loops instead of spinning up a full Spark job), Wayang reduces execution time by orders of magnitude.

We observe that the cross-platform optimizer allows Apache Wayang to run the SGD tasks more than one order of magnitude faster than any single-platform execution (Apache Spark, Apache Flink, or stand-alone Java): Apache Wayang can execute the SGD task in a few seconds, while all other processing platforms do so in the order of minutes!

Comparison with Other Frameworks

Wayang vs Spark

Spark is a distributed processing engine. Wayang is an abstraction layer that can use Spark when optimal. Wayang does not replace Spark. Instead, it allows Spark to be one of several engines.

Wayang vs Flink

Flink provides strong stream processing capabilities. Wayang integrates Flink as an execution backend when streaming semantics are required.

Wayang vs Beam

Beam provides a unified programming model but relies on runners for execution. Wayang provides a cost based optimizer and cross engine enumeration that selects optimal engines.

Wayang and Lakehouse architectures

Lakehouse systems focus on storage and metadata. Wayang focuses on computation. The two are complementary.

From Cross Platform Execution to Federated Execution

Cross platform execution addresses the problem of multiple engines inside a single environment. Federated execution extends this concept to environments where data, engines and compute resources are distributed across locations, administrative boundaries or regulatory domains.

Why federated execution is required

Federated execution is essential when:

  • data cannot be moved for regulatory reasons
  • data residency rules restrict centralization
  • different departments or organizations control different data sources
  • clouds or regions must be isolated
  • secure multi party analytics or federated learning are required

Traditional approaches rely on data exports or custom pipelines. These solutions are slow, error prone and difficult to scale.

How Wayang’s abstraction naturally extends to federation

Wayang already abstracts computation from execution engines. This abstraction can be extended to include:

  • constraints on data locality
  • restrictions on data movement
  • execution boundaries
  • federated optimizer rules

A federated environment can be viewed as a set of execution islands with limited or controlled data exchange. Wayang’s logical plan remains unchanged, but the optimizer must consider locality and privacy rules when selecting engines.

Execution Scenarios in Which Wayang Leads

Mixed execution for ETL

A common ETL pipeline may involve reading from a database, filtering within the database engine, transforming data using Spark and aggregating results using Java Streams. Wayang’s optimizer can select this combination automatically.

Combining streaming and batch workflows

Some applications require batch preprocessing and streaming inference. Wayang can map stages of the workflow to engines optimized for each mode.

Federated analytics

Departments, business units or partners can execute joint analytics without moving data. Wayang provides the logical model and Scalytics Federated operationalizes it.

Machine learning in constrained environments

Iterative ML algorithms can run partly on local nodes and partly on distributed clusters.

Scalytics as a Virtual Data Layer and Virtual Data Lake

Scalytics extends Apache Wayang by addressing operational needs not covered by the research framework. It provides:

A virtual data layer

  • abstracts compute away from engines
  • exposes a unified execution interface
  • enables hybrid and federated computation
  • aligns with data residency and privacy constraints

A virtual data lake

  • provides a unified view of distributed storage
  • minimizes data movement
  • integrates with object stores, databases and streaming systems
  • enables consistent analytics without centralizing data
The Virtual Lakehouse
Wayang abstracts the complexity of different engines and table formats, creating a single virtual access layer.
Unified Virtual Layer (Scalytics)
Apache Wayang Cross-Platform Optimizer
Apache Spark
Batch / ML
Apache Flink
Streaming
PostgreSQL
Relational
Physical Table Formats (Lakehouses)
Apache Hudi
Apache Iceberg
Delta Lake
* Wayang plans execution across these silos without moving data unnecessarily.

Enterprise features

  • policy driven federated execution
  • security and governance
  • observability
  • resilience and workload orchestration
  • integration with machine learning systems

Scalytics operationalizes the academic innovations behind Apache Wayang for enterprise scale and federated environments.

Scalytics Federated

Scalytics Federated builds on this idea by providing:

  • a virtual data layer across distributed and isolated environments
  • a virtual data lake view of storage without centralizing data
  • federated pipeline execution
  • scheduling across heterogeneous and isolated resources
  • privacy preserving execution models
  • observability, governance and policy enforcement

This positions Scalytics Federated as the operational environment for federated analytics while Wayang provides the abstraction for cross engine execution.

What Do Apache Wayang’s Users Have to Do?

To begin using Wayang, users install the platform and activate the desired plugins. A simple program registers Java Streams, Spark or Flink plugins and expresses computation in a platform agnostic manner. Wayang determines the execution plan. When deployed inside Scalytics Federated, the same program can run across federated environments without modification.

Users simply declare their available processing platforms. For example, in Paul King’s blog post, users enable platform plugins via .withPlugin(Java.basicPlugin()) and .withPlugin(Spark.basicPlugin()). Additional platforms can be added in the same manner.

add wayang to existing platforms requires 3 lines of java code

Users simply specify their tasks in Apache Wayang in a platform-agnostic manner and let Wayang do the rest for them to achieve the best performance!

Expertise and Technical Provenance

Apache Wayang is developed under the Apache Software Foundation and is the product of over a decade of research into cross platform data processing, query optimization and execution strategies. The underlying optimizer is based on peer reviewed work presented at ICDE, SIGMOD and VLDB, and its architecture incorporates validated cost models, cardinality estimation strategies and cross platform enumeration techniques.

The Scalytics team includes members who have contributed to the academic foundation of Wayang and who continue to advance the state of distributed data systems, federated execution, virtual data layers and large scale analytical pipelines. Scalytics Federated operationalizes the principles of Apache Wayang by providing a unified execution abstraction, governance, scheduling, privacy controls and observability for multi cloud, hybrid and regulated data environments.

The content in this article is based on established research, reference implementations and production deployments of cross platform analytics systems, ensuring that all concepts presented are both technically sound and practically applicable.

Frequently Asked Questions

Is Wayang a replacement for Spark or Flink

> No. Wayang uses Spark, Flink and other engines as execution backends.

Does Wayang support streaming workloads

> Yes, through Flink or other streaming engines.

What determines whether engine A or engine B is used

> The optimizer evaluates cost, data movement and operator characteristics.

Does federation require rewriting pipelines

> No. The logical plan remains unchanged.

References

[1] Wayang with Groovy: https://blogs.apache.org/groovy/entry/using-groovy-with-apache-wayang

[2] Apache Wayang: https://wayang.apache.org/

[3] Sebastian Kruse, Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, Sanjay Chawla, Felix Naumann, Bertty Contreras-Rojas: Optimizing Cross-Platform Data Movement. ICDE 2019: 1642–1645

About Scalytics

Scalytics builds on Apache Wayang, the cross-platform data processing framework created by our founding team and now an Apache Top-Level Project. Where traditional platforms require moving data to centralized infrastructure, Scalytics brings compute to your data—enabling AI and analytics across distributed sources without violating compliance boundaries.

Scalytics Federated provides federated data processing across Spark, Flink, PostgreSQL, and cloud-native engines through a single abstraction layer. Our cost-based optimizer selects the right engine for each operation, reducing processing time while eliminating vendor lock-in.

Scalytics Copilot extends this foundation with private AI deployment: running LLMs, RAG pipelines, and ML workloads entirely within your security perimeter. Data stays where it lives. Models train where data resides. No extraction, no exposure, no third-party API dependencies.

For organizations in healthcare, finance, and government, this architecture isn't optional, it's how you deploy AI while remaining compliant with HIPAA, GDPR, and DORA.Explore our open-source foundation: Scalytics Community Edition

Questions? Reach us on Slack or schedule a conversation.
back to all articles
Unlock Faster ML & AI
Free White Papers. Learn how Scalytics Copilot streamlines data pipelines, empowering businesses to achieve rapid AI success.

Scalytics Copilot:
Real-time intelligence. No data leaks.

Launch your data + AI transformation.

Thank you! Our team will get in touch soon.
Oops! Something went wrong while submitting the form.