MultiContext: Cross-Cluster Data Processing in Scalytics

Dr. Zoi Kaoudi

In this post, we introduce a new capability in Scalytics Connect: MultiContext. MultiContext extends ETL and data processing for organizations that operate across several sites and platforms. It enables data pipeline deployment across multiple locations while preserving data privacy and integrity.

Consider an organization with several departments, each running its own data processing engines, IT staff, data protection officers, and data engineers. Due to regulation and privacy policies, raw data cannot be centralized. Only aggregated data may be exported and processed further.

Departments A and B use Spark clusters with data in HDFS, CSV files, and a database accessible over JDBC. Department C uses a Flink cluster and another database for its workloads.

Scalytics Connect MultiContext Explained

Until now, this scenario was difficult to handle in a single pipeline. Scalytics Connect introduces MultiContext ETL data pipelines to address exactly this use case.

MultiContext explained

Current ETL systems struggle to integrate data that lives in multiple environments such as databases, data warehouses, data lakes, HDFS, or S3. MultiContext data pipeline processing is a next generation approach that allows direct querying and processing of data inside its original environment using a single procedure call in one JVM. This avoids constant data movement and centralization.

MultiContext streamlines distributed data pipelines and supports near real time insights and operational decision making. It also improves data management by increasing agility, flexibility, and security, which makes it a core building block for next generation ETL platforms and federated execution.

Additionally, MultiContext processing enables enhanced data management by promoting agility, flexibility, and robust security practices. This advancement solidifies its position as a cornerstone technology for next-generation ETL platforms.

Examples of MultiContext Processing

  • Retail:
    Combine real time sales data from point of sale systems with customer information stored in other systems, each processed in its own context. This supports immediate insight into customer behavior and performance by region or store.
  • Financial Services:
    Analyze transaction or portfolio data in a secure core banking system together with external reference or market data, without moving raw data into a single central cluster.


By avoiding lengthy data movement and multi step transformation chains, MultiContext enables long running pipelines that deliver insights in minutes instead of days.

Use MultiContext in Scalytics Connect

The ScalyticsContext and MultiContextPlanBuilder APIs in Scalytics Connect allow developers to configure several execution contexts and then define one logical pipeline that spans them:

// Define contexts
val ctx1 = new ScalyticsContext(conf1)
  .withPlugin(Java.basicPlugin())
  .withPlugin(Spark.basicPlugin())
  .withPlugin(JDBC.basicPlugin())
  .withTextFileSink("file:///path/to/out1")

val ctx2 = new ScalyticsContext(conf2)
  .withPlugin(Java.basicPlugin())
  .withPlugin(Spark.basicPlugin())
  .withPlugin(JDBC.basicPlugin())
  .withTextFileSink("hdfs:///path/to/out2")

val ctx3 = new ScalyticsContext(conf3)
  .withPlugin(Java.basicPlugin())
  .withPlugin(Flink.basicPlugin())
  .withPlugin(JDBC.basicPlugin())
  .withTextFileSink("hdfs:///path/to/out3")

Next, the MultiContextPlanBuilder describes the processing tasks across these contexts:

// Build multi context plan
val multi = new MultiContextPlanBuilder(List(ctx1, ctx2, ctx3))

val customers = multi
  .readTable(ctx1, customersDB1)
  .readTable(ctx2, customersDB2)
  .readTable(ctx3, customersDB3)
  .forEach(_.filter(r => r.getField(5).toString.contains("2022")))

val sales = multi
  .readTextFile(ctx1, salesFile1)
  .readTextFile(ctx2, salesFile2)
  .readTextFile(ctx3, salesFile3)
  .forEach(
    _.map { line =>
        val v = line.split(",")
        new Record(v(0), v(1), v(2).toDouble, 1)
      }
      .filter(r => r.getField(0).toString.contains("2022"))
  )

val results = sales
  .combineEach(customers, _.join(id, _ , _._2))
  .reduceByKey(
    _.getField(1),
    (a, b) =>
      new Record(
        a.getField(0),
        a.getField(1),
        a.getDouble(2) + b.getDouble(2),
        a.getInt(3) + b.getInt(3)
      )
  )
  .map(r => new Record(r.getField(1), r.getDouble(2) / r.getInt(3)))
  .execute()

With a small amount of code, developers can execute in situ data processing across multiple sites. Installations at each site can be heterogeneous and may include different engines such as Spark and Flink. A single job can be issued to several Spark clusters from one JVM, which is not possible with native Spark alone because it does not allow more than one Spark context per JVM.

Conclusion

MultiContext is a practical step toward federated and in situ data processing. It allows organizations to use their existing distributed infrastructure, keep data local for privacy and compliance, and still define one logical pipeline that runs across all sites.

Scalytics Connect extends the work started with Apache Wayang into a product that data engineering teams can adopt directly. MultiContext is one of the core features that enables cross platform pipelines without data consolidation.

About Scalytics

Scalytics architects and troubleshoots mission-critical streaming, federated execution, and AI systems for scaling SMEs. When Kafka pipelines fall behind, SAP IDocs block processing, lakehouse sinks break, or AI pilots collapse under real load, we step in and make them run.

Our founding team created Apache Wayang (now an Apache Top-Level Project), the federated execution framework that orchestrates Spark, Flink, and TensorFlow where data lives and reduces ETL movement overhead.

We also invented and actively maintain KafScale (S3-Kafka-streaming platform), a Kafka-compatible, stateless data and large object streaming system designed for Kubernetes and object storage backends. Elastic compute. No broker babysitting. No lock-in.

Our mission: Data stays in place. Compute comes to you. From data lakehousese to private AI deployment and distributed ML - all designed for security, compliance, and production resilience.

Questions? Join our open
Slack community or schedule a consult.
back to all articles
Unlock Faster ML & AI
Free White Papers. Learn how Scalytics Copilot streamlines data pipelines, empowering businesses to achieve rapid AI success.

The experts for mission-critical infrastructure.

Launch your data + AI transformation.