Apache Wayang: cross-platform data processing

The open execution layer for multi-platform analytics. No lock-in. No ETL. Apache-licensed.

What is Apache Wayang?

Apache Wayang is an open-source cross-platform data processing framework and Apache Software Foundation Top-Level Project. It provides a single pipeline abstraction - WayangPlan - that sits above heterogeneous execution engines and coordinates work between them. Write once, run across Apache Spark, Apache Flink, PostgreSQL, Java Streams, and object storage backends. A cost-based optimizer decides at runtime which engine handles each operation, or splits work across several, based on data size and execution cost. No manual platform selection. No pipeline rewrites when infrastructure changes.

Apache Wayang · Open-source cross-platform framework
One pipeline. Any platform.
No lock-in. No ETL overhead. Full control.
Write a single WayangPlan and let the cost-based optimizer decide whether to run it in Spark, Flink, Postgres, or a combination — based on your data size and infrastructure. No code changes required when platforms change.
Sp
Apache Spark
Distributed batch + ML
Apache Flink
Stream processing
Pg
PostgreSQL
Relational in-situ
WayangPlan
Cost-model
optimizer
Jv
Java Streams
Single-node, small data
S3
Object Storage
S3 / HDFS / local FS
DB
JDBC sources
Any relational backend
Change one line to switch platforms Mix platforms in a single pipeline Data stays in place — no ETL Apache Top-Level Project · Apache 2.0

The AI Readiness Check takes 3 minutes. The audit turns it into a 90-day roadmap.

The High Cost of Moving Data

Centralization was built for a world where storage was expensive and compute was cheap. Today, the equation has flipped. Moving data is the bottleneck.

Traditional ETL
The "Copy" Tax

Requires replicating data into a central lake. Incurs massive egress fees and creates governance silos.

Cloud Egress Cost $90k / PB
Storage Redundancy 300% (3 Copies)
Engineering Time 50% Maintenance
Weeks Time to Insight
Scalytics Federated
Zero-Copy Intelligence

Leaves data where it lives. Pushes the query plan to the source. Pay only for the compute you use.

Cloud Egress Cost >90% Savings
Storage Redundancy 0% (In-Place)
Engineering Time 90% Innovation
Minutes Time to Insight
* Based on standard AWS/Azure egress rates ($0.09/GB) and industry average data redundancy benchmarks.

Performance validated across three real-world scenarios.

Official benchmarks · wayang.apache.org
Proven performance across three scenarios
Multi-platform execution. Measured. Documented. Open.
Published by the Apache Wayang project. Full methodology, datasets, and result charts available at the source links below each scenario.
~2×faster
vs. Postgres on TPC-H Q5
Federated relational query across multiple independent databases
Wayang splits TPC-H Q5 across Postgres and Spark automatically — selection and projection stay in Postgres, large joins execute in Spark. No data migration required. Matches Spark performance without the ETL overhead Spark needs to ingest the data first.
TPC-H benchmark 1GB → 100GB HDFS + Postgres + S3
Full results at wayang.apache.org →
10×faster
vs. MLlib and SystemML on large datasets
Machine learning cost reduction using multiple execution systems
Stochastic gradient descent benchmarked against MLlib (Spark) and SystemML (IBM). Wayang preprocesses in Spark, then switches to local Java for gradient computation as the dataset shrinks. The optimizer identifies this transition automatically — no user intervention required.
higgs ~11M points rcv1 677K points synthetic 88M points
Full results at wayang.apache.org →
Autoselect
optimal platform per dataset size
Platform adaptation for big data analytics (WordCount, 1GB → 800GB)
Wayang consistently chose the fastest available platform — Java Streams for small datasets, Apache Spark for large — across Wikipedia abstracts from 1GB to 800GB. Zero manual platform selection, zero code changes between runs. Eliminates per-workload infrastructure benchmarking.
1GB → 800GB Spark vs Flink vs Java HDFS
Full results at wayang.apache.org →

Wayang and federated learning

Federated learning trains AI models across distributed data sources without centralising the underlying data. The training computation runs where each dataset lives, only model updates - never raw records - move between nodes. This preserves data privacy, satisfies cross-border data residency requirements, and eliminates the egress and storage costs of assembling a centralised training corpus.

Apache Wayang is the right execution substrate for this because it was built for in-situ, distributed computation across heterogeneous systems. Scalytics uses Wayang's execution model to coordinate federated training across nodes, enforce data locality, and route aggregated gradients - not training data - through the system.

Capability comparison
Apache Wayang vs single-platform alternatives
Middleware, not a replacement. Orchestrates what you already run.
Wayang sits above your existing engines and coordinates work between them. It doesn't replace Spark or Flink — it makes them interoperable.
Capability Apache Wayang Apache Spark Apache Flink Postgres / JDBC
Cross-platform pipeline execution ✓ Native — Single engine — Single engine — Single engine
Automated cost-based optimizer ✓ Cross-engine ◑ Within Spark ◑ Within Flink ◑ Query planner
In-situ processing — no ETL ✓ Data stays in place — Requires ingestion — Requires ingestion ◑ SQL sources only
Federated learning support ✓ Designed for FL — Not native — Not native — Not native
Platform portability ✓ Change one line — Rewrite required — Rewrite required — Rewrite required
Mix batch + streaming in one plan ✓ Yes ◑ Structured Streaming ✓ Yes — Batch only
Apache Top-Level Project (ASF) ✓ Yes ✓ Yes ✓ Yes — Not ASF
License Apache 2.0 Apache 2.0 Apache 2.0 PostgreSQL License
✓ full support  ·  ◑ partial / within-engine only  ·  — not supported
Apache Wayang is an Apache Software Foundation Top-Level Project. Scalytics provides enterprise support and integration. Apache®, Wayang®, Spark®, and Flink® are trademarks of the Apache Software Foundation.

Enterprise support for Apache Wayang

Production deployment, pipeline migration, and managed integration - from the team behind the project.

Scalytics Industry Solutions

Purpose built for the hardest data challenges.

Trusted by Innovators

Scalytics gave us a secure, explainable way to scale private AI models across our organization into multiple European countries — fully aligned with GDPR and the AI Act.
CTO
Chief Technology Officer | Large Travel Agency
Building an efficient and scalable data architecture with Scalytics reduced our development time and costs dramatically. From demo to proof of concept in just 2 days, now I wonder why we waited so long.
SVP Data and AI
Senior Vice President | FinTech Asia
Easy to use API, makes the training of ML pipelines faster as it can run on distributed platforms such as Spark for example.
SVP Data and AI
Senior Vice President | FinTech Asia
What I like the most about this platform is its ease of use. One has to only express the business logic within its API, and then the platform optimizes for the underlying system usage.
Haralampos G.
Research | Gartner Digital Markets
I could execute my Spark job on Flink by changing only one line of code. I also liked a lot the optimizer that can select the platform based on a cost model.
George P.
Engineering Manager | Gartner Digital Markets

Start with an AI Strategy Audit

4-6 week deep dive: ERP analysis, streaming architecture design, ROI projections, and implementation roadmap. Everything you need to make the right AI investment decisions.