Who created Apache Wayang?

Apache Wayang was created by the team at Scalytics. The technology was donated to the Apache Software Foundation and is now an Apache Top-Level Project maintained by a community of contributors. Scalytics continues to be a primary commercial backer and contributor to the project, and provides enterprise support and integration through Scalytics Connect.

How does Apache Wayang differ from Apache Spark or Flink?

Apache Spark and Flink are single-engine execution platforms. Wayang is middleware that sits above them and orchestrates work across multiple engines. A Spark job runs entirely in Spark; a Wayang pipeline can split work between Postgres, Spark, and Java Streams in a single execution plan. The key difference is portability and in-situ processing: Wayang can query data where it lives without requiring ETL into a single system first.

What does cross-platform data processing mean in Apache Wayang?

Cross-platform data processing means a single analytical pipeline can execute across multiple heterogeneous systems. For example, a TPC-H query can filter data in PostgreSQL, join large datasets in Apache Spark, and aggregate results back locally, all within one Wayang execution plan. The developer writes the plan once using the Wayang API. The optimizer decides at runtime which parts run where, based on data locality, size, and cost estimates. Changing the execution platform requires changing one line of code.

What is the relationship between Apache Wayang and federated learning?

Apache Wayang is designed to support federated data processing, which is the computational foundation for federated learning. Federated learning trains AI models across distributed data sources without centralizing data. Wayang enables this by executing model training workloads in-situ, meaning computations run where the data already lives rather than requiring data movement to a central location. This preserves data privacy, reduces egress costs, and supports compliance with regulations like GDPR.

How do I get enterprise support for Apache Wayang?

Enterprise support, integration services, and managed deployment for Apache Wayang are available through Scalytics. Scalytics Connect uses Apache Wayang as its core execution engine, adding production features including monitoring, security controls, and managed Kubernetes deployment. Scalytics also provides consulting for migrating existing Spark or Flink pipelines to Wayang-based architectures.

Scalytics | Apache Wayang: cross-platform data processing framework

Q: What is Apache Wayang?

Apache Wayang is a cross-platform data processing framework and Apache Top-Level Project. It provides a unified abstraction layer that allows developers to write a single pipeline and execute it across multiple processing engines including Apache Spark, Apache Flink, PostgreSQL, Java Streams, and JDBC sources. Wayang's cost-based optimizer automatically selects the best platform or combination of platforms for each task based on data size and execution cost models.

What is Apache Wayang?

Apache Wayang is an open-source cross-platform data processing framework and Apache Software Foundation Top-Level Project. It provides a single pipeline abstraction - WayangPlan - that sits above heterogeneous execution engines and coordinates work between them. Write once, run across Apache Spark, Apache Flink, PostgreSQL, Java Streams, and object storage backends. A cost-based optimizer decides at runtime which engine handles each operation, or splits work across several, based on data size and execution cost. No manual platform selection. No pipeline rewrites when infrastructure changes.

Apache Wayang · Open-source cross-platform framework

One pipeline. Any platform.

No lock-in. No ETL overhead. Full control.

Write a single WayangPlan and let the cost-based optimizer decide whether to run it in Spark, Flink, Postgres, or a combination — based on your data size and infrastructure. No code changes required when platforms change.

Apache Spark

Distributed batch + ML

Apache Flink

Stream processing

PostgreSQL

Relational in-situ

⬡

WayangPlan

Cost-model
optimizer

Java Streams

Single-node, small data

Object Storage

S3 / HDFS / local FS

JDBC sources

Any relational backend

Change one line to switch platforms Mix platforms in a single pipeline Data stays in place — no ETL Apache Top-Level Project · Apache 2.0

The AI Readiness Check takes 3 minutes. The audit turns it into a 90-day roadmap.

Check your AI Readiness

The High Cost of Moving Data

Centralization was built for a world where storage was expensive and compute was cheap. Today, the equation has flipped. Moving data is the bottleneck.

Traditional ETL

The "Copy" Tax

Requires replicating data into a central lake. Incurs massive egress fees and creates governance silos.

Cloud Egress Cost $90k / PB

Storage Redundancy 300% (3 Copies)

Engineering Time 50% Maintenance

Weeks Time to Insight

Scalytics Federated

Zero-Copy Intelligence

Leaves data where it lives. Pushes the query plan to the source. Pay only for the compute you use.

Cloud Egress Cost >90% Savings

Storage Redundancy 0% (In-Place)

Engineering Time 90% Innovation

Minutes Time to Insight

* Based on standard AWS/Azure egress rates ($0.09/GB) and industry average data redundancy benchmarks.

Performance validated across three real-world scenarios.

Official benchmarks · wayang.apache.org

Proven performance across three scenarios

Multi-platform execution. Measured. Documented. Open.

Published by the Apache Wayang project. Full methodology, datasets, and result charts available at the source links below each scenario.

~2×faster

vs. Postgres on TPC-H Q5

Federated relational query across multiple independent databases

Wayang splits TPC-H Q5 across Postgres and Spark automatically — selection and projection stay in Postgres, large joins execute in Spark. No data migration required. Matches Spark performance without the ETL overhead Spark needs to ingest the data first.

TPC-H benchmark 1GB → 100GB HDFS + Postgres + S3

Full results at wayang.apache.org →

10×faster

vs. MLlib and SystemML on large datasets

Machine learning cost reduction using multiple execution systems

Stochastic gradient descent benchmarked against MLlib (Spark) and SystemML (IBM). Wayang preprocesses in Spark, then switches to local Java for gradient computation as the dataset shrinks. The optimizer identifies this transition automatically — no user intervention required.

higgs ~11M points rcv1 677K points synthetic 88M points

Full results at wayang.apache.org →

Autoselect

optimal platform per dataset size

Platform adaptation for big data analytics (WordCount, 1GB → 800GB)

Wayang consistently chose the fastest available platform — Java Streams for small datasets, Apache Spark for large — across Wikipedia abstracts from 1GB to 800GB. Zero manual platform selection, zero code changes between runs. Eliminates per-workload infrastructure benchmarking.

1GB → 800GB Spark vs Flink vs Java HDFS

Full results at wayang.apache.org →

Wayang and federated learning

Federated learning trains AI models across distributed data sources without centralising the underlying data. The training computation runs where each dataset lives, only model updates - never raw records - move between nodes. This preserves data privacy, satisfies cross-border data residency requirements, and eliminates the egress and storage costs of assembling a centralised training corpus.

Apache Wayang is the right execution substrate for this because it was built for in-situ, distributed computation across heterogeneous systems. Scalytics uses Wayang's execution model to coordinate federated training across nodes, enforce data locality, and route aggregated gradients - not training data - through the system.

Capability comparison

Apache Wayang vs single-platform alternatives

Middleware, not a replacement. Orchestrates what you already run.

Wayang sits above your existing engines and coordinates work between them. It doesn't replace Spark or Flink — it makes them interoperable.

Capability	Apache Wayang	Apache Spark	Apache Flink	Postgres / JDBC
Cross-platform pipeline execution	✓ Native	— Single engine	— Single engine	— Single engine
Automated cost-based optimizer	✓ Cross-engine	◑ Within Spark	◑ Within Flink	◑ Query planner
In-situ processing — no ETL	✓ Data stays in place	— Requires ingestion	— Requires ingestion	◑ SQL sources only
Federated learning support	✓ Designed for FL	— Not native	— Not native	— Not native
Platform portability	✓ Change one line	— Rewrite required	— Rewrite required	— Rewrite required
Mix batch + streaming in one plan	✓ Yes	◑ Structured Streaming	✓ Yes	— Batch only
Apache Top-Level Project (ASF)	✓ Yes	✓ Yes	✓ Yes	— Not ASF
License	Apache 2.0	Apache 2.0	Apache 2.0	PostgreSQL License

✓ full support · ◑ partial / within-engine only · — not supported
Apache Wayang is an Apache Software Foundation Top-Level Project. Scalytics provides enterprise support and integration. Apache®, Wayang®, Spark®, and Flink® are trademarks of the Apache Software Foundation.

Trusted by Innovators



Scalytics gave us a secure, explainable way to scale private AI models across our organization into multiple European countries — fully aligned with GDPR and the AI Act.

CTO
‍Chief Technology Officer | Large Travel Agency



Building an efficient and scalable data architecture with Scalytics reduced our development time and costs dramatically. From demo to proof of concept in just 2 days, now I wonder why we waited so long.

SVP Data and AI
‍Senior Vice President | FinTech Asia



Easy to use API, makes the training of ML pipelines faster as it can run on distributed platforms such as Spark for example.

SVP Data and AI
‍Senior Vice President | FinTech Asia



What I like the most about this platform is its ease of use. One has to only express the business logic within its API, and then the platform optimizes for the underlying system usage.

Haralampos G.
‍Research | Gartner Digital Markets



I could execute my Spark job on Flink by changing only one line of code. I also liked a lot the optimizer that can select the platform based on a cost model.

George P.
‍Engineering Manager | Gartner Digital Markets

Apache Wayang: cross-platform data processing

What is Apache Wayang?

The High Cost of Moving Data

Performance validated across three real-world scenarios.

Wayang and federated learning

Enterprise support for Apache Wayang

Production deployment, pipeline migration, and managed integration - from the team behind the project.

Scalytics Industry Solutions

Purpose built for the hardest data challenges.

Streaming Intelligence

Put Your ERP in Motion.
Real time analytics without moving core data.

Industrial Intelligence

Put Your SCADA in Motion.
Stream insights without exposing OT networks.

Federated Intelligence

Put Your Data Lake to work.
Query across clouds without centralizing data.

Trusted by Innovators

Start with an AI Strategy Audit

4-6 week deep dive: ERP analysis, streaming architecture design, ROI projections, and implementation roadmap. Everything you need to make the right AI investment decisions.

Apache Wayang: cross-platform data processing

What is Apache Wayang?

The High Cost of Moving Data

Performance validated across three real-world scenarios.

Wayang and federated learning

Enterprise support for Apache Wayang

Production deployment, pipeline migration, and managed integration - from the team behind the project.

Scalytics Industry Solutions

Purpose built for the hardest data challenges.

Streaming Intelligence

Put Your ERP in Motion. Real time analytics without moving core data.

Industrial Intelligence

Put Your SCADA in Motion.Stream insights without exposing OT networks.

Federated Intelligence

Put Your Data Lake to work.Query across clouds without centralizing data.

Trusted by Innovators

Start with an AI Strategy Audit

4-6 week deep dive: ERP analysis, streaming architecture design, ROI projections, and implementation roadmap. Everything you need to make the right AI investment decisions.

Put Your ERP in Motion.
Real time analytics without moving core data.

Put Your SCADA in Motion.
Stream insights without exposing OT networks.

Put Your Data Lake to work.
Query across clouds without centralizing data.