Apache Wayang is an open-source cross-platform data processing framework and Apache Software Foundation Top-Level Project. It provides a single pipeline abstraction - WayangPlan - that sits above heterogeneous execution engines and coordinates work between them. Write once, run across Apache Spark, Apache Flink, PostgreSQL, Java Streams, and object storage backends. A cost-based optimizer decides at runtime which engine handles each operation, or splits work across several, based on data size and execution cost. No manual platform selection. No pipeline rewrites when infrastructure changes.
Apache Wayang · Open-source cross-platform framework
One pipeline. Any platform.
No lock-in. No ETL overhead. Full control.
Write a single WayangPlan and let the cost-based optimizer decide whether to run it in Spark, Flink, Postgres, or a combination — based on your data size and infrastructure. No code changes required when platforms change.
Sp
Apache Spark
Distributed batch + ML
Fl
Apache Flink
Stream processing
Pg
PostgreSQL
Relational in-situ
⬡
WayangPlan
Cost-model optimizer
Jv
Java Streams
Single-node, small data
S3
Object Storage
S3 / HDFS / local FS
DB
JDBC sources
Any relational backend
Change one line to switch platformsMix platforms in a single pipelineData stays in place — no ETLApache Top-Level Project · Apache 2.0
The AI Readiness Check takes 3 minutes. The audit turns it into a 90-day roadmap.
Centralization was built for a world where storage was expensive and compute was cheap.
Today, the equation has flipped. Moving data is the bottleneck.
Traditional ETL
The "Copy" Tax
Requires replicating data into a central lake. Incurs massive egress fees and creates governance silos.
Cloud Egress Cost$90k / PB
Storage Redundancy300% (3 Copies)
Engineering Time50% Maintenance
WeeksTime to Insight
Scalytics Federated
Zero-Copy Intelligence
Leaves data where it lives. Pushes the query plan to the source. Pay only for the compute you use.
Cloud Egress Cost>90% Savings
Storage Redundancy0% (In-Place)
Engineering Time90% Innovation
MinutesTime to Insight
* Based on standard AWS/Azure egress rates ($0.09/GB) and industry average data redundancy benchmarks.
Performance validated across three real-world scenarios.
Published by the Apache Wayang project. Full methodology, datasets, and result charts available at the source links below each scenario.
~2×faster
vs. Postgres on TPC-H Q5
Federated relational query across multiple independent databases
Wayang splits TPC-H Q5 across Postgres and Spark automatically — selection and projection stay in Postgres, large joins execute in Spark. No data migration required. Matches Spark performance without the ETL overhead Spark needs to ingest the data first.
Machine learning cost reduction using multiple execution systems
Stochastic gradient descent benchmarked against MLlib (Spark) and SystemML (IBM). Wayang preprocesses in Spark, then switches to local Java for gradient computation as the dataset shrinks. The optimizer identifies this transition automatically — no user intervention required.
Platform adaptation for big data analytics (WordCount, 1GB → 800GB)
Wayang consistently chose the fastest available platform — Java Streams for small datasets, Apache Spark for large — across Wikipedia abstracts from 1GB to 800GB. Zero manual platform selection, zero code changes between runs. Eliminates per-workload infrastructure benchmarking.
Federated learning trains AI models across distributed data sources without centralising the underlying data. The training computation runs where each dataset lives, only model updates - never raw records - move between nodes. This preserves data privacy, satisfies cross-border data residency requirements, and eliminates the egress and storage costs of assembling a centralised training corpus.
Apache Wayang is the right execution substrate for this because it was built for in-situ, distributed computation across heterogeneous systems. Scalytics uses Wayang's execution model to coordinate federated training across nodes, enforce data locality, and route aggregated gradients - not training data - through the system.
Capability comparison
Apache Wayang vs single-platform alternatives
Middleware, not a replacement. Orchestrates what you already run.
Wayang sits above your existing engines and coordinates work between them. It doesn't replace Spark or Flink — it makes them interoperable.
Capability
Apache Wayang
Apache Spark
Apache Flink
Postgres / JDBC
Cross-platform pipeline execution
✓ Native
— Single engine
— Single engine
— Single engine
Automated cost-based optimizer
✓ Cross-engine
◑ Within Spark
◑ Within Flink
◑ Query planner
In-situ processing — no ETL
✓ Data stays in place
— Requires ingestion
— Requires ingestion
◑ SQL sources only
Federated learning support
✓ Designed for FL
— Not native
— Not native
— Not native
Platform portability
✓ Change one line
— Rewrite required
— Rewrite required
— Rewrite required
Mix batch + streaming in one plan
✓ Yes
◑ Structured Streaming
✓ Yes
— Batch only
Apache Top-Level Project (ASF)
✓ Yes
✓ Yes
✓ Yes
— Not ASF
License
Apache 2.0
Apache 2.0
Apache 2.0
PostgreSQL License
✓ full support · ◑ partial / within-engine only · — not supported
Apache Wayang is an Apache Software Foundation Top-Level Project. Scalytics provides enterprise support and integration. Apache®, Wayang®, Spark®, and Flink® are trademarks of the Apache Software Foundation.
Enterprise support for Apache Wayang
Production deployment, pipeline migration, and managed integration - from the team behind the project.
Scalytics gave us a secure, explainable way to scale private AI models across our organization into multiple European countries — fully aligned with GDPR and the AI Act.
CTO Chief Technology Officer | Large Travel Agency
Building an efficient and scalable data architecture with Scalytics reduced our development time and costs dramatically. From demo to proof of concept in just 2 days, now I wonder why we waited so long.
SVP Data and AI Senior Vice President | FinTech Asia
Easy to use API, makes the training of ML pipelines faster as it can run on distributed platforms such as Spark for example.
SVP Data and AI Senior Vice President | FinTech Asia
What I like the most about this platform is its ease of use. One has to only express the business logic within its API, and then the platform optimizes for the underlying system usage.
Haralampos G. Research | Gartner Digital Markets
I could execute my Spark job on Flink by changing only one line of code. I also liked a lot the optimizer that can select the platform based on a cost model.
George P. Engineering Manager | Gartner Digital Markets
Start with an AI Strategy Audit
4-6 week deep dive: ERP analysis, streaming architecture design, ROI projections, and implementation roadmap. Everything you need to make the right AI investment decisions.