Scalytics FAQ: Data Platform Questions Answered

Dr. Zoi Kaoudi

This article explains how Scalytics Federated handles regulated data, distributed data processing, access delegation, and performance optimization across heterogeneous systems. It highlights how federated execution simplifies complex data scenarios, reduces operational burden, and supports AI deployment without centralizing data. Organizations use Scalytics to work across dispersed data silos, unify incompatible technologies, and enforce strict governance rules.

How Scalytics Solves Data Regulation Challenges

Data regulation complexity continues to increase as organizations store information across countries, platforms, and governance domains. Scalytics addresses these challenges by processing data where it resides. This eliminates unnecessary movement of regulated data and ensures compliance with GDPR, HIPAA, CCPA, and jurisdiction specific rules.

A common question

Example scenario:
An organization holds US customer data on a Spark cluster in New York and EU customer data in a SQL data warehouse in Paris. A data team must compute late fees by account size and country while maintaining full GDPR compliance.

How Scalytics executes this workload

The data engineer writes a single pipeline using Wayang operators, referencing the two data sources in a configuration file. Scalytics Federated then decomposes the pipeline into local execution plans. Each environment processes its portion of the data internally and produces compliant intermediate results.

Raw data never leaves its jurisdiction. Only aggregated or anonymized intermediate outputs are exchanged.

Three compliant ways to merge results

Method 1: Remote Federated Merge
Intermediate results from each location are sent to a designated processing environment, such as the central data team. Scalytics merges these intermediate outputs using its execution engine and produces the summary table. No raw data crosses borders.

Method 2: Local Merge in Region A
Intermediate results from New York are transmitted to Paris and merged with locally processed results. This is fully GDPR compliant because only non identifiable aggregated data is exchanged.

Method 3: Local Merge in Region B
Intermediate results from Paris are transmitted to New York and merged there. This produces a consolidated table without moving raw data out of Europe.

Across all methods, Scalytics ensures that raw data remains in its original location and that all processing steps adhere to strict regulatory controls.

Scalytics does not require deploying additional third party compute engines into each data pool. The platform orchestrates processing using existing systems and avoids operational disruption.

How Scalytics Ensures Data Access Controls

Organizations often need to restrict access to sensitive datasets based on internal roles, project boundaries, or regulatory requirements.

Scalytics enforces strict access delegation. Users can only execute pipelines on datasets they have been granted permissions for. Authentication is handled through Scalytics Federated Studio, where administrators create working groups and assign access rights that mirror the organization's existing governance model.

This ensures that federated execution does not bypass internal controls. People can only process the data they are authorized to work with, even when the computation spans multiple systems or regions.

Do We Need a Master Data Management Layer

Scalytics connects directly to existing data systems and does not require a dedicated Master Data Management layer. It integrates with current governance, catalog, and metadata systems while providing a unified execution layer across databases, warehouses, lakes, and distributed compute environments.

The platform does not replace enterprise MDM systems. It works alongside them and provides federated processing capabilities without duplicating data or creating new storage requirements.

How Scalytics Optimizes Data Processing Performance

Performance varies significantly when executing workloads across Spark, SQL engines, object stores, or custom platforms. Scalytics includes a cost based optimizer that evaluates the most efficient execution strategy based on runtime, data distribution, hardware availability, and economic cost.

Example

If merging large intermediate results in a third neutral location would cause excessive memory usage or long processing times, the optimizer evaluates alternatives and selects the optimal location automatically. This includes prioritizing regions with lower compute cost, faster hardware, or more efficient execution engines.

The optimizer does not rely on manual tuning. It selects execution strategies that outperform single platform approaches and ensures predictable performance across federated environments.

How Much Effort Is Needed To Start With Scalytics

Scalytics supports standard SQL and provides APIs for Java, Scala, and Python. Data teams familiar with platforms such as Apache Spark ramp up quickly because the processing model follows the same declarative principles.

Scalytics Federated Studio also offers a low code interface where teams build pipelines using visual operators. This shortens onboarding time and reduces the complexity of working with distributed systems.

About Scalytics

Scalytics architects and troubleshoots mission-critical streaming, federated execution, and AI systems for scaling SMEs. When Kafka pipelines fall behind, SAP IDocs block processing, lakehouse sinks break, or AI pilots collapse under real load, we step in and make them run.

Our founding team created Apache Wayang (now an Apache Top-Level Project), the federated execution framework that orchestrates Spark, Flink, and TensorFlow where data lives and reduces ETL movement overhead.

We also invented and actively maintain KafScale (S3-Kafka-streaming platform), a Kafka-compatible, stateless data and large object streaming system designed for Kubernetes and object storage backends. Elastic compute. No broker babysitting. No lock-in.

Our mission: Data stays in place. Compute comes to you. From data lakehousese to private AI deployment and distributed ML - all designed for security, compliance, and production resilience.

Questions? Join our open
Slack community or schedule a consult.
back to all articles
Unlock Faster ML & AI
Free White Papers. Learn how Scalytics Copilot streamlines data pipelines, empowering businesses to achieve rapid AI success.

The experts for mission-critical infrastructure.

Launch your data + AI transformation.