Federated Learning (Part II): The Blossom Framework

February 14, 2022
Dr. Jorge Quiané

This is the second post in our Federated Learning (FL) series. In our previous post, we introduced FL as a distributed machine learning (ML) approach where raw data from different workers is not moved out of the workers. We now take a dive into Databloom Blossom, a federated data lakehouse analytics framework, which provides a solution for federated learning.

The research and industry communities have already started to provide multiple systems in the arena of federated learning. TensorFlow Federated [1], Flower [2], and OpenFL [3] are just a few examples of such systems. All these systems allow organizations and individuals (users) to deploy their ML tasks in a simple and federated way using a single system interface.

What Is the Problem?

Yet, there are still several open problems that have not been tackled by these solutions, such as preserving data privacy, model debugging, reducing wall-clock training times, and reducing the trained model size. All of the equal importance. Among all of these open problems, there is one of crucial importance: supporting end-to-end pipelines. Currently, users must have good knowledge of several big data systems to be able to create their end-to-end pipelines. They must know everything from data preparation techniques to ML algorithms. Furthermore, users must also have good coding skills to put all the pieces (systems) together in a single end-to-end pipeline. The FL setting only exacerbates the problem.

Databloom Blossom Overview

Blossom Sky is a Federated Data Lakehouse Analytics platform to help users build their end-to-end federated pipelines. Blossom covers the entire spectrum of analytics in end-to-end pipelines and executes them in a federated way. Especially, Blossom, users can focus solely on the logic of their applications, instead of worrying about the system, execution, and deployment details.

Blossom Sky general architecture
Blossom Sky general architecture

Figure 1 illustrates the general architecture of Blossom. Overall, Blossom Sky comes with two simple interfaces for users to develop their pipelines: Python (FedPy) for data scientists and a graphical dashboard (FedUX) for users in general.

Blossom Sky enables users to easily develop their federated data analytics in a simple way for fast execution.

In more detail, users specify their pipelines using any of these two interfaces, and Blossom Sky, in turn, runs them in a federated fashion using any cloud provider and data processing platform.

WordCount program in Blossom Sky
WordCount program in Blossom Sky

Listing 1 above shows the simple WordCount application in Blossom. The first three lines allow the user to register the platform to use in Blossom (Java and Spark in our example). The remaining lines of code are the actual WordCount program. The beauty of Blossom is that the user does not have to decide on which data processing platform to run the program on (Java or Spark). Blossom decides the actual execution based on the input dataset’s and processing platforms’ characteristics (such as the size of the input dataset and the Spark cluster size). It can do so via an AI-powered cross-platform optimizer and executor.

AI-Powered Query Optimizer

At its core, we can find Apache Wayang [4], the first cross-platform data processing system. Blossom leverages and equips Apache Wayang with AI to unify and optimize heterogeneous (federated) data pipelines as well as select the right cloud provider and data processing platform to run the resulting federated data pipelines. As a result, users can seamlessly run general data analytics and AI together on any data processing platform. Blossom’s optimizer mainly provides an intermediate representation between applications and processing platforms, which allows it to flexibly compose users’ pipelines using multiple processing platforms. Besides translating users’ pipelines to the underlying processing platforms, the optimizer decides what is the best way to perform a pipeline so that runtime is improved, as well as how to move data from one processing platform (or cloud provider) to another.

Cross-Data-Platform Executor

Blossom Sky also comes with a cloud-native executor that allows users to deploy their federated data analytics on any cloud provider and data processing platform. They can choose their preferred cloud provider or data processing platform or let Blossom select the best cloud provider or data processing platform based on their time and monetary budget. In both cases, Blossom deploys and executes users’ federated pipelines on their behalf. More importantly, the executor takes care of any data transfer that must occur among cloud providers and data processing platforms. While the optimizer decides which data must be moved, the executor ensures the efficient movement of the data among different providers and data processing platforms.

Blossom, a Federated Data Lakehouse Analytics Framework

Thanks to its design, optimizer, and executor, Blossom can provide a real federated data lakehouse analytics framework:

  • Heterogeneous Data Sources
    It can process data from (or over) multiple data sources in a seamless manner.
  • Multi-Platform and Hybrid Cloud Execution
    It automatically deploys each sub-part of a pipeline to the most relevant cloud provider and processing platform in a seamless manner to reduce costs and improve performance.
  • Federated Machine Learning and AI
    It comes with its own framework (including a parameter server) to run pipelines in a federated manner.
  • Ease of use
    It allows users to focus on the logic of their applications by taking care of how to optimize, deploy, and execute their pipelines.


[1] TensorFlow Federated: https://www.tensorflow.org/federated
[2] Flower: https://flower.dev/
[3] OpenFL: https://www.openfl.org/
[4] Apache Wayang: https://wayang.apache.org/

About Scalytics

The Scalytics Data Connect Platform is all about taking data collaboration and efficiency to the next level. Our platform tackles the big challenges of continuous data movement and timely data access to train AI effectively, bringing everything together in one easy-to-use system. It's built to work smoothly with a whole range of AI algorithms and models.

The cool part? Scalytics works hand-in-hand with top data frameworks like Databricks, Snowflake, Cloudera, and others, including Hadoop, Teradata, and Oracle. Plus, it's fully compatible with AI favorites like TensorFlow, Pandas, and PyTorch. We've made sure it fits right into your existing setup.

Scalytics is powered by Apache Wayang, and we're proud to support the project. You can check out their public GitHub repo right here. If you're enjoying our software, show your love and support - a star ⭐ would mean a lot!

If you need professional support from our team of industry leading experts, you can always reach out to us via Slack or Email.

Get started with Scalytics Connect today

Thank you! Our team will get in touch soon.
Oops! Something went wrong while submitting the form.