Distributed Systems and Federated AI Research

Enabling Unified Data Systems Integration via Wayang

Sigmod Demo 2025

Apache Wayang is an open-source framework that unifies data analytics across disparate data sources and heterogeneous data systems. It decouples applications from underlying systems and provides an optimizer for better performance.

Grounding, Trust, and Relevance in LLM Reasoning

ZENODO (CERN) 2025

This paper addresses these challenges by introducing DSO-Agent (Distributed Streaming Orchestration Agent), a novel framework designed to enhance the reliability and verifiability of agentic reasoning when utilizing open LLMs and combinations of those (such as Mistral, Llama, Gemma, and DeepSeek models).

Apache Wayang: A Unified Data Analytics Framework

SIGMOD 2023

Explore Apache Wayang, a groundbreaking open-source data analytics framework that unites various data processing platforms, optimizing performance, and reducing costs. Dive into the paper for insights on Wayang’s architecture and its seamless, integrated user experience.

P2A: Framework for Optimizing Data Science Pipelines

SIGMOD 2023

Our approach allows to identify DBMS-supported operations and translate them into SQL to leverage DBMSes for accelerating data science workloads. The optimization target is twofold: First, to improve data loading, by reducing the amount of data to be transferred between runtimes.

Artificial intelligence to advance Earth observation

arXiv 2023

Earth observation (EO) is a prime instrument for monitoring land and ocean processes, studying the dynamics at work, and taking the pulse of our planet. This article gives a bird's eye view of the essential scientific tools and approaches informing and supporting the transition from raw EO data to usable EO-based information.

Navigating Compliance in Federated Data Processing

IEEE Data Engineering 2022

The processing of geo-distributed data is subject to data transfer regulations. In this paper, we present our work on a federated data processing system that can comply with these regulations. We also present research challenges and opportunities for the system to make compliance truly first-class citizens.

ML-based Cross-Platform Query Optimization

ICDE 2020

Cost-based optimization is widely known to suffer from a major weakness: administrators spend a significant amount of time to tune the associated cost models. This problem only gets exacerbated in cross-platform settings as there are many more parameters that need to be tuned.

RHEEMix in the Data Jungle
‍

VLDB Journal 2020

Data analytics are moving beyond the limits of a single platform. In this paper, we present the cost-based optimizer of Rheem, an open-source crossplatform system that copes with these new requirements.

Optimizing Cross-Platform Data Movements

ICDE 2019

Data analytics are moving beyond the limits of a single data processing platform. A cross-platform query optimizer is necessary to enable applications to run their tasks over multiple platforms efficiently and in a platform-agnostic manner.

Simplified Big Data Debugging for Dataflow Jobs

SoCC 2019

Although big data processing has become dramatically easier over the last decade, there has not been matching progress over big data debugging. It is estimated that users spend more than 50% of their time debugging their big data applications.

Enabling Cross-Platform Data Processing

VLDB 2018

Solving business problems increasingly requires going beyond the limits of a single data processing platform (platform for short), such as Hadoop or a DBMS. As a result,organizations typically perform tedious and costly tasks to juggle their code and data across different platforms.

Cross-Platform Data Analytics Made Easy

ICDE 2018

Many of today’s applications need several data processing platforms for complex analytics. Thus, recent systems have taken steps towards supporting cross-platform data analytics. Yet, current cross-platform systems lack of ease-of-use, which is crucial for their adoption.

Challenges of Cross-Platform Data Processing

ICDE 2018

There is a zoo of data processing platforms which help users and organizations to extract value out of their data. Although each of these platforms excels in specific aspects, users typically end up running their data analytics on suboptimal platforms.

Building your Cross-Platform Application with RHEEM

CoRR 2018

Today, organizations typically perform tedious and costly tasks to juggle their code and data across different data processing platforms. Addressing this pain and achieving automatic cross-platform data processing is quite challenging because it requires quite good expertise for all the available data processing platforms.

Fast and scalable inequality joins

VLDB Journal 2017

Inequality joins, which is to join relations with inequality conditions, are used in various applications. Optimizing joins has been the subject of intensive research ranging from efficient join algorithms such assort-merge join, to the use of efficient indices such as B+-tree, R∗-tree and Bitmap.

A Cost-based Optimizer for Gradient Descent Optimization

SIGMOD 2017

As the use of machine learning (ML) permeates into diverse application domains, there is an urgent need to support a declarative framework for ML. Ideally, a user will specify anML task in a high-level and easy-to-use language and the framework will invoke the appropriate algorithms and system configurations to execute it.

Road to Freedom in Big Data Analytics

EDBT 2016

The world is fast moving towards a data-driven society where data is the most valuable asset. Organizations need to perform very diverse analytic tasks using various data processing platforms. In Doing so, they face many challenges; chiefly, platform dependence,poor interoperability, and poor performance when using multiple platforms.

Enabling Multi-Platform Task Execution

SIGMOD 2016

Many emerging applications, from domains such as healthcare and oil & gas, require several data processing system for complex analytics. This demo paper showcases Rheem, a framework that provides multi-platform task execution for such applications

BigDansing: A System for Big Data Cleansing

SIGMOD 2015

Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions.

Lightning Fast and Space Efﬁcient Inequality Joins

VLDB 2015

Inequality joins, which join relational tables on inequality conditions, are used in various applications. While there have been a wide range of optimization methods for joins in database systems, from algorithms such as sort-merge join and band join, to various indices such as B+-tree, R∗-tree and Bitmap.

Research at Scalytics

We are researchers with roots in the real world