Machine Learning (ML) is a specific subset (branch) of Artificial Intelligence (AI). The main idea of ML is to enable systems to learn from historical data to predict new output values for input events. ML does not require systems to be explicitly programmed to achieve the goal. With the growing volumes of data in today’s world, ML has gained unprecedented popularity; we achieve today what was unimaginable yesterday: from predicting cancer risk over mammogram images and patient risk data to polyglot AI translators, for example. As a result, ML has become the key competitive differentiator for many companies, leading ML-powered software to quickly become omnipresent in our lives. The core key of ML is data dependence; more available data enables much better accuracy of predictive models built by ML.
The Appearance of Distributed ML
While ML has become quite a powerful technology, its hunger for training data makes it hard to build ML models with a single machine. It is not unusual to see training data sizes in the order of hundreds of gigabytes to terabytes, such as in the Earth Observation domain. This has created the need to build ML models over distributed data, stored in multiple storage nodes across the globe.
Distributed ML aims to train itself, using multiple compute nodes to cope with larger input training data sizes as well as improve performance and models’ accuracy [1]. Thus, distributed ML enables organizations and individuals to draw meaningful conclusions from vast amounts of distributed training data. Healthcare, enterprise data analytics, and advertising are examples of the most common sectors that greatly benefit from distributed ML.
There are two fundamental ways to perform distributed ML: data parallelism and model parallelism.

In the data parallelism approach, the system horizontally partitions the input training data; usually, it creates as many partitions as there are available compute nodes (workers) and distributes each data partition to a different worker. Afterwards, it sends the same model features to each worker, which, in turn, learns a local model using their data partition as input. The workers send their local model to a central location, where the system merges the multiple smaller inputs into a single global model.
The model parallelism approach, in contrast to data parallelism, partitions the model features and sends each model partition to a different worker, which in turn builds a local model using the same input data. That is, the entire input training data is replicated by all workers. Then, the system brings these local models into a central location to aggregate them into a single global model.
Yet, although powerful, distributed ML has a core assumption that limits its applicability: that one can have control over and access to the entire training data.
However, in an increasingly large number of cases, one cannot have direct access to raw data; hence, distributed ML cannot be applied in such cases, for example, in the healthcare domain.
The Emergence of Federated Learning
The concept of FL was first introduced by Google in 2017 [3]. Yet, the concept of federated analytics and databases dates from the 1980s [4]. Similar to federated databases, FL aims to bring computation to where the data is.
Federated learning (FL) is basically a distributed ML approach. Unlike traditional distributed ML, raw data at different workers is never moved out of the worker.The workers own the data, and they are the only ones with control over it and direct access to it. Generally speaking, FL allows for gaining experience with a more diverse set of datasets at different independent or autonomous locations.
Ensuring data privacy is crucial in today’s world; societal awareness of data privacy is rising as one of the main concerns of our increasingly data-driven world. For example, many governmental organizations have laws, such as GDPR [5] and CCPA [6], to control the way data is stored and processed. FL enables organizations and individuals to train ML models across multiple autonomous parties without compromising data privacy, since the sole owner of the data is the node on which the data is stored. During training, organizations and individuals share their local models to learn from each other’s. Thus, organizations and individuals can leverage others’ data to learn more robust ML models than when using their own data alone. Multiple participants collaborate to train a model with their sensitive data and communicate only the learned local model among themselves.
The beauty of FL is that it enables organizations and individuals to collaborate towards a common goal without sacrificing data privacy.
FL also leverages the two fundamental execution modes to build models across multiple participants: horizontal learning (data parallelism) and vertical learning (model parallelism).
What are the problems today?
The research and industry communities have already started to provide multiple systems in the arena of federated learning. TensorFlow Federated [7], Flower [8], and OpenFL [9] are just a few examples of such systems. All these systems allow organizations and individuals (users) to deploy their ML tasks in a simple and federated way using a single system interface.
Yet, there are still several open problems that have not been tackled by these solutions, such as preserving data privacy, model debugging, reducing wall-clock training times, and reducing the size of the trained model. All are equally important.Among all of these open problems, there is one of crucial importance: supporting end-to-end pipelines.
Currently, users must have good knowledge of several big data systems to be able to create their end-to-end pipelines. They must have decent technical knowledge, starting with data preparation techniques and not ending with ML algorithms. Furthermore, users must also deploy good coding skills to put all the pieces (systems) together to create an end-to-end pipeline. And Federated Learning exacerbates the problem.
How does Scalytics Federated solve these problems?
Scalytics Federated is our federated AI and distributed data processing platform built to let teams design and operate complete end-to-end pipelines without dealing with the complexity of multi-platform execution. It covers the full analytics lifecycle and executes pipelines in a federated way across clouds, data centers, and heterogeneous processing engines. Users focus solely on the logic of their applications. Scalytics Federated takes care of optimization, system decisions, and deployment workflows.The platform provides two straightforward interfaces: Python (FedPy) for data scientists and a graphical pipeline environment (FedUX) for broader engineering and operations teams.
The internal architecture of Scalytics Federated (former Blossom Sky)
Scalytics Federated gives users the ability to develop federated pipelines rapidly and execute them with high performance. Users specify their dataflows in either FedPy or FedUX. The platform then executes those pipelines federated across any available data processing system, including hybrid, multi-cloud, and on-premises environments.Once defined, a pipeline is deployed across the most suitable combination of processing engines and environments. This selection is driven by performance characteristics, resource availability, and execution constraints.

Scalytics Federated provides users with the means to easily develop their federated data analytics in a simple way for a fast execution.
Code a pipeline with Apache Wayang
The example above illustrates a simple WordCount application in Apache Wayang, the first cross-platform data processing system that our team originally created and contributed to the Apache Software Foundation. Scalytics Federated builds on this foundation and allows users to focus on their data tasks rather than on manually deciding which platform (for example Java Streams or Spark) should execute each part of the pipeline.The platform automatically determines the execution plan based on characteristics such as dataset size, engine capabilities, and cluster capacity. This is powered by an AI-supported optimizer that handles cross-platform decisions end-to-end.

Scalytics Federated is an AI-driven data processing and query optimizer
At the core of Scalytics Federated is Apache Wayang, extended with our AI-driven optimizer to unify and optimize heterogeneous, federated data pipelines. The system can also evaluate and select the most suitable cloud provider and data processing engine for each stage of a pipeline.Scalytics Federated uses an intermediate representation between applications and underlying processing platforms. This enables flexible composition of pipelines across multiple engines. The optimizer not only compiles and distributes pipeline fragments but also determines the best execution strategy to improve runtime and cost efficiency, including intelligent data movement across environments when required.
Scalytics Federated’s cross-platform executor
The platform includes a cloud-native executor that deploys and runs federated analytics pipelines across any supported cloud, cluster, or hybrid environment. Engineering teams can explicitly choose where workloads run or let the system make cost- and performance-aware decisions.The executor manages all data transfers required between cloud providers and processing platforms. While the optimizer decides what must move, the executor ensures that movement is efficient, secure, and aligned with budget and resource constraints.
Conclusion
Because of its architecture, optimizer, and executor, Scalytics Federated provides a complete federated lakehouse analytics framework from the start:
- Heterogeneous data sources
Scalytics Federated processes data over multiple, diverse sources in a seamless and unified manner. - Multi-platform and hybrid-cloud execution
Each pipeline stage is dispatched to the most suitable cloud provider or processing engine to improve performance and reduce cost. - Federated machine learning and AI
The platform includes mechanisms for federated model training and distributed AI, including a parameter server architecture for coordinated updates. - Ease of use
Users concentrate on the analytical logic. Scalytics Federated handles optimization, deployment, and execution across distributed environments.
References
[1] Alon Y. Halevy, Peter Norvig, Fernando Pereira: The Unreasonable Effectiveness of Data. IEEE Intell. Syst. 24(2): 8-12 (2009).
[2] Diego Peteiro-Barral, Bertha Guijarro-Berdiñas: A survey of methods for distributed machine learning. Prog. Artif. Intell. 2(1): 1-11 (2013).
[3] Brendan McMahan, Daniel Ramage: Federated Learning: Collaborative Machine Learning without Centralized Training Data. Google AI Blog. April 6, 2017.
[4] Dennis Heimbigner, Dennis McLeod: A Federated Architecture for Information Management. ACM Trans. Inf. Syst. 3(3): 253-278 (1985).
[5] General Data Protection Regulation (GDPR): https://gdpr-info.eu/
[6] California Consumer Privacy Act (CCPA): https://oag.ca.gov/privacy/ccpa
[7] TensorFlow Federated: https://www.tensorflow.org/federated
[8] Flower: https://flower.dev/
[9] OpenFL: https://www.openfl.org/
[10] Apache Wayang: https://wayang.apache.org/
About Scalytics
Scalytics Federated provides federated data processing across Spark, Flink, PostgreSQL, and cloud-native engines through a single abstraction layer. Our cost-based optimizer selects the right engine for each operation, reducing processing time while eliminating vendor lock-in.
Scalytics Copilot extends this foundation with private AI deployment: running LLMs, RAG pipelines, and ML workloads entirely within your security perimeter. Data stays where it lives. Models train where data resides. No extraction, no exposure, no third-party API dependencies.
For organizations in healthcare, finance, and government, this architecture isn't optional, it's how you deploy AI while remaining compliant with HIPAA, GDPR, and DORA.Explore our open-source foundation: Scalytics Community Edition
Questions? Reach us on Slack or schedule a conversation.
