Scalytics Connect provides a cutting-edge solution to manage complex data pipelines, covering everything from data extraction to access to multiple data sources, careful data pre-processing to advanced feature engineering.
This article is the start of a series designed for developers and techies who want to take advantage of all the power Scalytics Connect has to offer. The main topic? Building and integrating custom data platform plugins. These aren’t your average plugins. They’re built with custom logical operators, complex mappings, and optimized for performance across multiple execution platforms. One of the best things about these plugins is the ability to start conversion channels. This means output data types will always be flexible and compatible with any platform you choose.
In this post, we’re going to dive into the key concepts of Scalytics Connect, which are essential for improving and realizing its wide range of capabilities. It’s important to understand that Scalytics Connect isn’t just a cross platform processing framework, it’s the heart of complex and high-quality calculations. We’re about to take a journey, going through the complex abstractions that drive the smooth integration of hundreds of technologies, while improving Blossom’s ability to optimise operators across multiple platforms.
Processing Execution Plan
It’s not just a strategy, it’s the lifeblood powering the optimization journey of your data analytics pipeline. Hidden inside are a collection of smart operators, each designed to handle a unique workload. Boosting its performance? Advanced algorithms designed to navigate, fine-tune, and manage a massive number of operators, driving transformations to create diverse, usable implementations.
Scalytics Connect offers a wide range of operators in its toolbox, ranging from the fundamental components of relational databases to the fluidity of FlinkStreamProcessors. The brilliance lies in the underlying framework that navigates through complex abstractions, allowing for the comprehensive integration of multiple components. For the developers, the workflow is straightforward: focus on creating pre-configured interfaces, while Scalytics Connect organizes the orchestration.
This interface is at the core of the Blossom Plan. It provides a detailed description of each node and it’s components. When implementing it, developers need to specify the type of the operator and dive into the details of its configurations for best processing. The methods that control the Input and output slots are crucial to this process. They carefully control the data that each operator unit processes and produces. One thing to note is that some Binary Operators, such as Join Operators, manage multiple input sources at the same time. On the other hand, others like Replicate Operators spawn many output streams, each dedicated to different operators.
At the heart, Input and output slots play a crucial role in connecting two operators, creating a Producer-Consumer relationship.
Input/Output Cardinality Abstract Operator
As suggested in the previous section, different Operators require a different number of Input/Output slots. Source Operators require only an Output Slot because they do not receive any Input, and Sink Operators require only an Input Slot because they do not transmit results to other operators. For a better classification of operators, Blossom incorporates UnaryToUnaryOperator, UnarySource, UnarySink, and BinaryToUnaryOperator classes to handle every specific case. Input and Output Slot are defined by a DatasetType that keeps track of the type and structure being transferred between operators through a slot.
Going further with this explanation, let's review the abstract class BinaryToUnaryOperator. It Receives three Generic Types corresponding to the two inputs type of the operator and a single output type. Extending this class the user can model Join, Union, and Intersect Operators.
Blossom Operator Classes are the actual nodes that compose a BlossomPlan. The purpose of a Blossom operator class is to define a promised functionality that could be implemented on different processing platforms. The Blossom community usually called these operators platform-independent operators. Blossom Operator Classes do not describe how a specific functionality will be delivered, that is tightly dependent on each underlying platform that Blossom can use to run the operator.
Any Operator Class that extends an Input/Output Cardinality Abstract Operator is a Blossom Operator Class. Let's review the CountOperator Blossom Operator Class; CountOperator<Type> extends UnaryToUnaryOperator<Type, Long>, meaning that it receives a Generic and returns a Long value. Therefore, the only restriction to Platforms implementing this operator is that in execution time, a CountOperator will receive a Stream of Type elements; after processing them, the CountOperator must return a single Long value. Any platform that wants to support CountOperator must follow that pattern.
A Channel in Blossom is the interface that interconnects different sets of operators; in other words, a channel is a glue that connects one operator with another operator. Imagine an operator "Source" running in Java reading tuples from a local file; the output of "Source" will be a Collection of tuples. In the described case, the Output Channel provided by Source is a Java Collection. A Java collection channel only can be used as Input of Operators that accept Java Collection format as an Input. To allow other Platforms than Java to accept the output of Source, it is mandatory to convert this Java Collection into another format.
A CountOperator cannot run unless a specific behavior is given. An Execution operator implements the procedure followed by an operator for its correct execution on a specific Platform. Let's see two examples:
JavaCountOperator<Type> extends CountOperator<Type> and implements the interface JavaExecutionOperator. In the case of Java Platform, the evaluate method gives behavior to this ExecutionOperator; notice from the extraction of the code that the operator uses a Java Collection Channel, and after casting the Channel as a Collection uses the standard Collection.size method to get the result:
On the other hand, FlinkCountOperator<Type> extends CountOperator<Type> and implements the interface FlinkExecutionOperator. The code Extract shows that in this case, the Channel must be a Flink DatasetChannel, and the operation is also trivial returning the Flink Dataset.count method result.
Execution Operators of an Operator are all the implementation alternatives for a Blossom Operator to be included as part of an executable plan. To decide which alternative is more efficient given a certain optimization goal, Scalytics Connect compares an estimation of resources required by different execution operators running on the available platforms.
In the next part of this tutorial, we will show how Scalytics Connect manages to optimize a plan, what pieces of code must be provided in our plugins to allow this, and a real-world example of a custom executor that includes a Postgres platform.
We enable you to make data-driven decisions in minutes, not days
Scalytics Connect delivers unmatched flexibility, seamless integration with all your AI and data tools, and an easy-to-use platform that frees you to focus on building high-performance data architectures to fuel your AI innovation.
Scalytics is powered by Apache Wayang, and we're proud to support the project. You can check out their public GitHub repo right here. If you're enjoying our software, show your love and support - a star ⭐ would mean a lot!
If you need professional support from our team of industry leading experts, you can always reach out to us via Slack or Email.