Modern data architectures increasingly depend on heterogeneous execution environments: Spark for large-scale analytics, Flink for streaming, Postgres for transactional queries, Java for in-process computation, and GPUs or Python frameworks for AI workloads. Scalytics Federated unifies these systems under a single execution layer powered by Apache Wayang. Developers can extend this layer with custom operators, channels, and execution backends to integrate new platforms or optimize for specialized workloads.
This article introduces the core abstractions behind Scalytics Federated and prepares you for writing custom plugins. It is the first in a series aimed at engineers who want to extend the platform with new logical operators, cross-platform mappings, and optimized execution strategies.
Execution Plans: The Heart of Cross-Platform Optimization
At its core, Scalytics Federated builds a logical plan of operators that represent the analytical workflow. The optimizer evaluates multiple execution alternatives for each operator across available platforms and constructs an execution plan that minimizes cost, latency, resource usage, or domain-specific objectives.
A plan is therefore not just a DAG of operations; it is a search space that the optimizer navigates to resolve which operator should run where, how data should be exchanged between systems, and how to minimize platform-specific overheads.
Logical Operators: The Platform-Independent Layer
Logical operators define what to compute, not how or where to compute it. These operators are backend-agnostic and represent conceptual building blocks such as:
- filters
- projections
- joins
- aggregations
- unions
- feature engineering steps
- ML-oriented transformations
A key design principle is separation of concerns:
- Logical operator: describes the transformation semantically
- Execution operator: implements that transformation on a specific backend
This separation allows a single logical operator (for example, Count) to be executed across many platforms: Spark, Flink, Java Collections, Postgres, or a custom system you plug in.
Operator Interfaces: Defining Inputs, Outputs, and Semantics
Every logical operator specifies:
- the number and type of input slots
- the number and type of output slots
- the dataset type flowing through those slots
- any required configuration or parameters
Scalytics Federated provides abstract operator families such as:
- UnarySource / UnarySink – for single-in or single-out operators
- UnaryToUnaryOperator – for transformations on one input producing one output
- BinaryToUnaryOperator – for joins, merges, unions, or intersections
Binary operators require careful definition of input cardinalities and data types. Dataset types ensure correctness at compile time and enforce compatibility across channels.
This level of abstraction allows the optimizer to treat operators uniformly and enables automated translation into platform-specific execution paths.
Input/Output Cardinality Abstract Operator
As suggested in the previous section, different Operators require a different number of Input/Output slots. Source Operators require only an Output Slot because they do not receive any Input, and Sink Operators require only an Input Slot because they do not transmit results to other operators. For a better classification of operators, Wayang incorporates UnaryToUnaryOperator, UnarySource, UnarySink, and BinaryToUnaryOperator classes to handle every specific case. Input and Output Slot are defined by a DatasetType that keeps track of the type and structure being transferred between operators through a slot.
Going further with this explanation, let's review the abstract class BinaryToUnaryOperator. It Receives three Generic Types[1] corresponding to the two inputs type of the operator and a single output type. Extending this class the user can model Join, Union, and Intersect Operators.

Operator Classes
. The purpose of a Blossom operator class is to define a promised functionality that could be implemented on different processing platforms. The Wayang community usually calls these operators platform-independent operators. Wayang Operator Classes do not describe how a specific functionality will be delivered, that is tightly dependent on each underlying platform that Wayang can use to run the operator.
Any Operator Class that extends an Input/Output Cardinality Abstract Operator is a Wayang Operator Class:

Channels: Connecting Operators Across Platforms
Channels describe how data moves between operators and across execution backends. A channel is a concrete representation of a dataset on a specific platform.
Examples:
- A Java Collection channel
- A Spark RDD or Dataset channel
- A Flink DataSet/DataStream channel
- A Postgres cursor or temporary table
Channels enforce compatibility: the output of one operator can only be consumed by another operator that supports that channel type. When compatibility does not exist, the optimizer inserts conversion channels that transform data from one representation into another.
This is how Scalytics Federated enables true cross-platform pipelines: channels serve as the glue between execution engines.
Execution Operators: Platform-Specific Implementations

Execution operators implement the logic of a logical operator on a particular platform. For example:
JavaCountOperator<T>implementsCountusing Java CollectionsFlinkCountOperator<T>implementsCountusing Flink APIsSparkJoinOperatorimplements a logical join via Spark SQL
Execution operators:
- define the platform they target
- specify the channel types they accept and produce
- implement the runtime execution logic
- expose metadata needed by the cost model (I/O size, CPU, memory, etc.)
During optimization, Scalytics Federated evaluates all available execution operators for each logical operator and chooses the most efficient alternative based on the full pipeline context—not just local cost.
This enables powerful cross-platform strategies, such as:
- preprocessing in Java
- joining large partitions in Spark
- aggregating in Postgres
- streaming updates through Flink
=> all automatically selected and orchestrated.

A platform-specific execution operator implements the behavior of a logical operator for a particular backend. For example, FlinkCountOperator<T> extends the logical CountOperator<T> and implements the FlinkExecutionOperator interface. In this case, the operator consumes a FlinkDatasetChannel, applies Flink’s native DataSet.count() method, and returns the result.
Each logical operator may have multiple execution operators—one per supported platform. During optimization, Scalytics Federated evaluates these alternatives using cost estimates that reflect I/O patterns, data size, operator complexity, and platform characteristics. The optimizer selects the most efficient execution path based on the global optimization objective and the platforms available in the environment.
Bringing It All Together
Extending Scalytics Federated with custom operators, channels, and execution backends requires understanding how logical operators, execution operators, and channels interact within the optimizer. The abstractions described above form the foundation for integrating new platforms or enhancing existing ones.
When developers implement a new operator or backend, four components typically come together:
- a logical operator, describing what computation should occur
- one or more execution operators, mapping that logic to concrete platforms
- channels, defining how data is exchanged between operators and backends
- cost metadata, allowing the optimizer to choose the most efficient execution path
These elements enable Scalytics Federated to build heterogeneous pipelines that run across Spark, Flink, SQL engines, Java, Python, or any additional system you introduce. By following these abstractions, developers can extend the platform with predictable behavior, strong performance characteristics, and seamless integration into the global optimization process.
This conceptual foundation prepares you to design plugins that align with Scalytics Federated's execution model and to integrate new systems into a unified cross-platform environment.
References
About Scalytics
Scalytics Federated provides federated data processing across Spark, Flink, PostgreSQL, and cloud-native engines through a single abstraction layer. Our cost-based optimizer selects the right engine for each operation, reducing processing time while eliminating vendor lock-in.
Scalytics Copilot extends this foundation with private AI deployment: running LLMs, RAG pipelines, and ML workloads entirely within your security perimeter. Data stays where it lives. Models train where data resides. No extraction, no exposure, no third-party API dependencies.
For organizations in healthcare, finance, and government, this architecture isn't optional, it's how you deploy AI while remaining compliant with HIPAA, GDPR, and DORA.Explore our open-source foundation: Scalytics Community Edition
Questions? Reach us on Slack or schedule a conversation.
