MultiContext: Cross-Cluster Data Processing in Scalytics

Dr. Zoi Kaoudi

In this post, we introduce a new capability in Scalytics Connect: MultiContext. MultiContext extends ETL and data processing for organizations that operate across several sites and platforms. It enables data pipeline deployment across multiple locations while preserving data privacy and integrity.

Consider an organization with several departments, each running its own data processing engines, IT staff, data protection officers, and data engineers. Due to regulation and privacy policies, raw data cannot be centralized. Only aggregated data may be exported and processed further.

Departments A and B use Spark clusters with data in HDFS, CSV files, and a database accessible over JDBC. Department C uses a Flink cluster and another database for its workloads.

Scalytics Connect MultiContext Explained

Until now, this scenario was difficult to handle in a single pipeline. Scalytics Connect introduces MultiContext ETL data pipelines to address exactly this use case.

MultiContext explained

Current ETL systems struggle to integrate data that lives in multiple environments such as databases, data warehouses, data lakes, HDFS, or S3. MultiContext data pipeline processing is a next generation approach that allows direct querying and processing of data inside its original environment using a single procedure call in one JVM. This avoids constant data movement and centralization.

MultiContext streamlines distributed data pipelines and supports near real time insights and operational decision making. It also improves data management by increasing agility, flexibility, and security, which makes it a core building block for next generation ETL platforms and federated execution.

Additionally, MultiContext processing enables enhanced data management by promoting agility, flexibility, and robust security practices. This advancement solidifies its position as a cornerstone technology for next-generation ETL platforms.

Examples of MultiContext Processing

  • Retail:
    Combine real time sales data from point of sale systems with customer information stored in other systems, each processed in its own context. This supports immediate insight into customer behavior and performance by region or store.
  • Financial Services:
    Analyze transaction or portfolio data in a secure core banking system together with external reference or market data, without moving raw data into a single central cluster.


By avoiding lengthy data movement and multi step transformation chains, MultiContext enables long running pipelines that deliver insights in minutes instead of days.

Use MultiContext in Scalytics Connect

The ScalyticsContext and MultiContextPlanBuilder APIs in Scalytics Connect allow developers to configure several execution contexts and then define one logical pipeline that spans them:

// Define contexts
val ctx1 = new ScalyticsContext(conf1)
  .withPlugin(Java.basicPlugin())
  .withPlugin(Spark.basicPlugin())
  .withPlugin(JDBC.basicPlugin())
  .withTextFileSink("file:///path/to/out1")

val ctx2 = new ScalyticsContext(conf2)
  .withPlugin(Java.basicPlugin())
  .withPlugin(Spark.basicPlugin())
  .withPlugin(JDBC.basicPlugin())
  .withTextFileSink("hdfs:///path/to/out2")

val ctx3 = new ScalyticsContext(conf3)
  .withPlugin(Java.basicPlugin())
  .withPlugin(Flink.basicPlugin())
  .withPlugin(JDBC.basicPlugin())
  .withTextFileSink("hdfs:///path/to/out3")

Next, the MultiContextPlanBuilder describes the processing tasks across these contexts:

// Build multi context plan
val multi = new MultiContextPlanBuilder(List(ctx1, ctx2, ctx3))

val customers = multi
  .readTable(ctx1, customersDB1)
  .readTable(ctx2, customersDB2)
  .readTable(ctx3, customersDB3)
  .forEach(_.filter(r => r.getField(5).toString.contains("2022")))

val sales = multi
  .readTextFile(ctx1, salesFile1)
  .readTextFile(ctx2, salesFile2)
  .readTextFile(ctx3, salesFile3)
  .forEach(
    _.map { line =>
        val v = line.split(",")
        new Record(v(0), v(1), v(2).toDouble, 1)
      }
      .filter(r => r.getField(0).toString.contains("2022"))
  )

val results = sales
  .combineEach(customers, _.join(id, _ , _._2))
  .reduceByKey(
    _.getField(1),
    (a, b) =>
      new Record(
        a.getField(0),
        a.getField(1),
        a.getDouble(2) + b.getDouble(2),
        a.getInt(3) + b.getInt(3)
      )
  )
  .map(r => new Record(r.getField(1), r.getDouble(2) / r.getInt(3)))
  .execute()

With a small amount of code, developers can execute in situ data processing across multiple sites. Installations at each site can be heterogeneous and may include different engines such as Spark and Flink. A single job can be issued to several Spark clusters from one JVM, which is not possible with native Spark alone because it does not allow more than one Spark context per JVM.

Conclusion

MultiContext is a practical step toward federated and in situ data processing. It allows organizations to use their existing distributed infrastructure, keep data local for privacy and compliance, and still define one logical pipeline that runs across all sites.

Scalytics Connect extends the work started with Apache Wayang into a product that data engineering teams can adopt directly. MultiContext is one of the core features that enables cross platform pipelines without data consolidation.

About Scalytics

Scalytics builds on Apache Wayang, the cross-platform data processing framework created by our founding team and now an Apache Top-Level Project. Where traditional platforms require moving data to centralized infrastructure, Scalytics brings compute to your data—enabling AI and analytics across distributed sources without violating compliance boundaries.

Scalytics Federated provides federated data processing across Spark, Flink, PostgreSQL, and cloud-native engines through a single abstraction layer. Our cost-based optimizer selects the right engine for each operation, reducing processing time while eliminating vendor lock-in.

Scalytics Copilot extends this foundation with private AI deployment: running LLMs, RAG pipelines, and ML workloads entirely within your security perimeter. Data stays where it lives. Models train where data resides. No extraction, no exposure, no third-party API dependencies.

For organizations in healthcare, finance, and government, this architecture isn't optional, it's how you deploy AI while remaining compliant with HIPAA, GDPR, and DORA.Explore our open-source foundation: Scalytics Community Edition

Questions? Reach us on Slack or schedule a conversation.
back to all articles
Unlock Faster ML & AI
Free White Papers. Learn how Scalytics Copilot streamlines data pipelines, empowering businesses to achieve rapid AI success.

Scalytics Copilot:
Real-time intelligence. No data leaks.

Launch your data + AI transformation.

Thank you! Our team will get in touch soon.
Oops! Something went wrong while submitting the form.