Query optimization is a core component of any data management or analytics system. It determines how a query or task should be executed and produces the execution plan that the system will run. Traditionally, query optimization consists of three steps:
- enumerating possible execution plans,
- estimating the cost of each subplan to select the best one, and
- estimating the cardinalities of intermediate results, which directly influence cost predictions.
Recent work in data management has begun to use machine learning to improve these tasks. In this post, we focus on using ML to estimate the cost of subplans more accurately and efficiently.
In classical systems, cost models are built from mathematical formulas describing the cost of each operator. These formulas are then combined to estimate the runtime of a given query plan. Designing such a cost model in a federated environment, as the one Scalytics Federated is built for, is exceptionally challenging and often results in suboptimal performance. There are several reasons:
- traditional optimizers assume linear behavior that does not reflect modern, distributed systems,
- they require statistics from all underlying platforms, which may not be available in federated settings, and
- they demand extensive fine-tuning to reflect real system behavior.
The plot below illustrates how a well-tuned cost model can improve performance by an order of magnitude, highlighting both the value and the difficulty of manual tuning.

A well-tuned cost-based optimizer in a federated setting can lead to an order of magnitude better performance. Yet, it is very tedious and time-consuming.
Cost model comparison based on compute time
A well-tuned cost-based optimizer can significantly improve federated execution performance, but developing such a model is tedious, brittle, and time-consuming.
To overcome these limitations, we explore replacing traditional cost formulas with an ML model that predicts the runtime of a (sub)plan. Although appealing, this approach introduces two major challenges.
The first challenge arises from the enumeration process, which generates thousands or even millions of candidate plans. Each plan must be evaluated by the ML model, yet query plans are structured objects, while ML models require numerical feature vectors. Transforming every enumerated plan into a vector introduces substantial overhead, especially because optimization must run within milliseconds at query time.
The second challenge is the need for training data: query plans and their actual runtimes. As discussed in earlier work, our DataFarm methodology generates such data efficiently. With DataFarm, we produced high-quality training data in four hours, compared with forty hours required by a naive exhaustive approach.
To address the first challenge, we redesigned the enumeration algorithms to operate directly on vectors. This vector-based plan enumeration removes the need for repeated plan-to-vector transformations and enables the optimizer to use primitive operations and SIMD instructions to evaluate multiple candidate vectors in parallel. The result is a substantial reduction in optimization time for ML-based optimizers.
Vector-based execution plan
Vector-based plan enumeration can significantly accelerate query optimization in ML-driven optimizers.
With a library of vector primitives, we construct an efficient vector-based enumeration strategy. The architecture of this ML-based optimizer begins with transforming the logical plan once into a vector representation. The optimizer then performs vectorized enumeration and pruning, consulting the ML model to discard inefficient candidates. The cheapest vector is selected and translated back into an executable plan.

A vector-based plan enumeration approach can lead to significant improvement in query optimization time for an ML-based optimizer.
Given a set of vector-based operations, we can define an efficient vector-based plan enumeration. The figure below shows the general architecture of an ML-based optimizer with a vector-based plan enumeration. The logical plan is given as input to the optimizer, which first transforms the plan into a vector. Then, vector-based enumeration and pruning take place, also consulting the ML model for pruning inefficient plans. The cheapest (i.e., most efficient) plan vector is output and then transformed into an execution plan that the system can actually execute.
How vector-based plans work
A learning-based optimizer can deliver considerable performance gains. In our preliminary results, for example, an ML-based optimizer selects better execution plans for k-means and achieves up to 7× faster runtimes than a highly tuned traditional cost-based optimizer.

A learning-based optimizer can lead to significant results. See our preliminary results below. For k-means, an ML-based optimizer can choose better plans and achieve 7x better runtime performance than a highly-tuned cost-based optimizer!
Benchmarking vectorized cost plans after learning
We are currently integrating this architecture into the Scalytics Federated optimizer. The result will be faster query optimization, better execution plans, and improved performance for real federated workloads.

We are currently working on incorporating this architecture into Apache Wayang's optimizer. This will not only speed up the query optimization time but will also lead to better execution plans, meaning faster execution runtimes. Stay tuned!
References
[1] Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, Bertty Contreras-Rojas, Rodrigo Pardo-Meza, Anis Troudi, Sanjay Chawla: ML-based Cross-Platform Query Optimization. ICDE 2020: 1489-1500.
About Scalytics
Scalytics Connect provides federated data processing across Spark, Flink, PostgreSQL, and cloud-native engines through a single abstraction layer. Our cost-based optimizer selects the right engine for each operation, reducing processing time while eliminating vendor lock-in.
Scalytics Copilot extends this foundation with private AI deployment—running LLMs, RAG pipelines, and ML workloads entirely within your security perimeter. Data stays where it lives. Models train where data resides. No extraction, no exposure, no third-party API dependencies.For organizations in healthcare, finance, and government, this architecture isn't optional—it's how you deploy AI while remaining compliant with HIPAA, GDPR, and DORA.Explore our open-source foundation: Scalytics Community Edition
Questions? Reach us on Slack or schedule a conversation.
