Cross-Platform Analytics Benchmark: 10x Performance Gains

Dr. Zoi Kaoudi

Our federated data processing and analytics engine, Scalytics Federated, is a cross-platform optimizer that can seamlessly orchestrate multiple execution backends, including Postgres, Spark, Flink, Java Streams, and Python. In our benchmarks, Scalytics Federated delivers strong performance across several representative workloads: a complex relational query over dispersed data, a large-scale machine learning task, and a classical big data analytics job.

Scalytics Federated is built on Apache Wayang, the open source cross-platform data processing system originally created by our team. It enables data-agnostic applications and decentralized data processing, which are the foundations of federated learning and modern distributed analytics.

Dealing with Dispersed Data: Running a Relational Query Across Multiple Stores 

In this use case we address a common reality in enterprises: critical relational data is split across different systems. Instead of forcing teams to move everything into a single warehouse, Scalytics Federated runs a single query across the existing stores.

Datasets

We use the standard TPC-H benchmark [1], which consists of five main relations. For this experiment:

  • lineitem and orders are stored in HDFS
  • customer, supplier, and region are stored in Postgres
  • nation resides in S3 or the local file system

Dataset sizes range from 1 GB to 100 GB to test scalability and robustness.

Query / task

We evaluate performance on the complex TPC-H SQL query 5.

SELECT N_NAME, SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)) AS REVENUE
FROM CUSTOMER, ORDERS, LINEITEM, SUPPLIER, NATION, REGION
WHERE C_CUSTKEY = O_CUSTKEY AND L_ORDERKEY = O_ORDERKEY AND L_SUPPKEY = S_SUPPKEY 
AND C_NATIONKEY = S_NATIONKEY AND S_NATIONKEY = N_NATIONKEY AND N_REGIONKEY = R_REGIONKEY 
AND R_NAME = 'ASIA' AND O_ORDERDATE >= '1994-01-01' 
AND O_ORDERDATE < DATEADD(YY, 1, cast('1994-01-01' as date)) 
GROUP BY N_NAME 
ORDER BY REVENUE DESC

Baselines: We compare Scalytics Federated against two widely used relational systems:

  • Apache Spark
  • Postgres

For a fair baseline, we load all datasets into each system under test (Spark or Postgres) and then execute the query there.

Results

Figure 1 shows the execution time in seconds for TPC-H query 5. The data transfer time required to move datasets into Spark or Postgres is not included in those measurements. As the figure shows, Scalytics Federated significantly outperforms Postgres while achieving runtimes close to Spark. In practice, however, Spark required additional time to extract the datasets from Postgres and move them into the cluster before execution could start.

Scalytics Federated achieves these results by combining Postgres and Spark intelligently. The optimizer chooses to perform selections and projections on the data that resides in Postgres, reducing the volume of data that needs to be moved. It then joins the resulting data with the relations in HDFS, placing the expensive join between lineitem and supplieron Spark to leverage distributed computation. All of this happens without the user having to specify where each operation should run.

With Scalytics Federated, organizations can execute relational analytics directly across their existing systems instead of restructuring everything around a single database.

Blossom Sky TPC results
Figure 1

Reducing Execution Costs for Machine Learning Tasks Using Multiple Systems

In this case study we explore how Scalytics Federated can reduce runtime and execution cost for machine learning workloads by leveraging multiple systems, even when all data initially resides in a single store.

We focus on stochastic gradient descent, a widely used algorithm for classification and regression.

Datasets
We use two real-world datasets from the UCI Machine Learning Repository:

  • higgs: ~11 million data points with 28 features each
  • rcv1: ~677 thousand data points with ~47 thousand features each

In addition, we construct a synthetic dataset with 88 million data points and 100 features each to stress the system.

Query / task
We train:

  • Classification models on higgs and on the synthetic dataset
  • A regression model on rcv1

All three models use stochastic gradient descent with different loss functions:

  • Hinge loss to simulate support vector machines for the classification tasks
  • Logistic loss for the regression task

Baselines
We compare Scalytics Federated against two popular machine learning libraries:

  • MLlib (Apache Spark)
  • SystemML (IBM)

Results
Figure 2 shows the runtime performance. For large datasets, Scalytics Federated outperforms both baselines by more than an order of magnitude. On the largest synthetic dataset, Spark and SystemML cannot complete the training task, while Scalytics Federated finishes.

Cost reduction with parallelism over multiple data processing systems
Figure 2

The key driver of this performance is the optimizer. It recognizes that a hybrid strategy works best: preprocessing and data preparation are executed with Spark, while the later stages of gradient descent operate on a much smaller sampled dataset that fits well on a single machine. Scalytics Federated then switches to local Java execution for that phase. This kind of cross-platform plan is difficult to design manually and requires deep experience in distributed systems. The optimizer in Scalytics Federated finds it automatically.

Optimizing Big Data Analytics by Adapting Platforms to Data and Task Characteristics 

In the third study we evaluate how Scalytics Federated adapts to data size and workload characteristics by switching between different execution platforms for a classic big data task.

Datasets
We use Wikipedia abstracts stored in HDFS, varying the dataset size from 1 GB up to 800 GB.

Query / task
We run WordCount, a widely used analytical task that counts distinct words in a corpus. Variants of WordCount underpin text mining workloads such as term frequency analysis and word statistics used in search and NLP pipelines.

Baselines
We consider three platforms on which WordCount can be executed:

  • Apache Spark
  • Apache Flink
  • A single-node Java program

We then configure Scalytics Federated to automatically select among these platforms for each dataset size.

Results
Figure 3 shows the runtime performance. Scalytics Federated consistently chooses the fastest available platform for each dataset size. For smaller inputs, the single-node Java program is often optimal. As the data grows, the optimizer shifts work to Flink or Spark where distributed execution pays off.

By modeling execution cost inside the optimizer, Scalytics Federated removes the need for users to guess or hard-code the “right” engine. There is no manual migration effort from one platform to another to gain performance or reduce cost. Users express their analytical tasks once; the system selects the best execution plan and platform combination.

Blossom selects the best execution framework for each data task based on our integrated AI
Figure 3

Summary

Scalytics Federated is a federated data processing and analytics engine that leverages Apache Wayang to orchestrate multiple execution platforms. The benchmarks above demonstrate its ability to:

  • Run complex relational queries across dispersed data without centralizing everything in one system
  • Accelerate large-scale machine learning workloads by combining distributed and single-node execution
  • Adapt big data analytics to the most efficient platform for each dataset size and task

Instead of forcing all workloads onto a single engine, Scalytics Federated uses cost-based, cross-platform optimization to deliver predictable performance and efficiency on real enterprise workloads.

[1] TPC-H Homepage

About Scalytics

Scalytics builds on Apache Wayang, the cross-platform data processing framework created by our founding team and now an Apache Top-Level Project. Where traditional platforms require moving data to centralized infrastructure, Scalytics brings compute to your data—enabling AI and analytics across distributed sources without violating compliance boundaries.

Scalytics Federated provides federated data processing across Spark, Flink, PostgreSQL, and cloud-native engines through a single abstraction layer. Our cost-based optimizer selects the right engine for each operation, reducing processing time while eliminating vendor lock-in.

Scalytics Copilot extends this foundation with private AI deployment: running LLMs, RAG pipelines, and ML workloads entirely within your security perimeter. Data stays where it lives. Models train where data resides. No extraction, no exposure, no third-party API dependencies.

For organizations in healthcare, finance, and government, this architecture isn't optional, it's how you deploy AI while remaining compliant with HIPAA, GDPR, and DORA.Explore our open-source foundation: Scalytics Community Edition

Questions? Reach us on Slack or schedule a conversation.
back to all articles
Unlock Faster ML & AI
Free White Papers. Learn how Scalytics Copilot streamlines data pipelines, empowering businesses to achieve rapid AI success.

Scalytics Copilot:
Real-time intelligence. No data leaks.

Launch your data + AI transformation.

Thank you! Our team will get in touch soon.
Oops! Something went wrong while submitting the form.