Scalytics | Optimize Data Processing with AI-Powered ETL: Introducing MultiContext in Scalytics Connect

Dr. Zoi Kaoudi

Welcome to our latest blog post where we're excited to introduce a powerful new addition to Scalytics Connect: MultiContext. In this post, we'll demonstrate how MultiContext revolutionizes ETL and data processing for organizations, enabling seamless data pipeline deployment across multiple sites while maintaining data privacy and integrity.

Picture this: An organization with disparate departments, each housing its own data processing engines, IT departments, data regulation officer and data engineers. Due to data regulation and privacy concerns, raw data cannot be centralized for analysis. Instead, only aggregated data can be extracted and processed further. Departments A and B utilize their Spark cluster, storing data in HDFS, CSV files and a database with a JDBC connection, while Department C employs a Flink cluster and another database for their data processing needs.

‍

Scalytics Connect MultiContext Explained

‍

This problem was not easy solvable - util now. Scalytics Connect introduces MultiContext ETL data pipelines. Wait, what? Let's dive a bit into and see what that means.

‍

MultiContext explained

Current ETL systems face the challenge of efficiently integrating data residing in diverse sources, like multiple databases, data warehouses, data lakes or other data stores like HDFS or S3. MultiContext data pipeline processing is a next-gen technology solution, enabling direct querying and processing of data within their original environments using the same procedure call (aka in one JVM), eliminating the need for constant data movement and centralization. This innovative approach significantly streamlines data pipelines, paving the way for real-time insights and improved decision-making.

Additionally, MultiContext processing enables enhanced data management by promoting agility, flexibility, and robust security practices. This advancement solidifies its position as a cornerstone technology for next-generation ETL platforms.

Examples of MultiContext Processing

Retail Industry: Combine real-time sales data from point-of-sale systems with customer sentiment analysis from social media platforms, all within their respective contexts, to gain immediate insights into customer behavior and preferences.
Financial Services: Analyze customer financial data stored in a secure banking system alongside market data from external sources, in real-time, to make informed investment decisions.

By eliminating the need for lengthy data movement and transformation processes, MultiContext processing enables the creation of ever-running data pipelines that deliver valuable data insights in minutes, not days. This significantly improves efficiency and allows businesses to make data-driven decisions faster than ever before.

‍

Use MultiContext in Scalytics Connect

The MultiContext() function in Scalytics Connect empowers developers to easily create separate configurations tailored to the specific needs of each department. These configurations encompass details like Spark and Flink clusters, JDBC installations (if applicable), and the designated path for storing processed data:


val context1 = new ScalyticsContext(configuration1)
  	.withPlugin(Java.basicPlugin())
  	.withPlugin(Spark.basicPlugin())
	.withPlugin(JDBC.basicPlugin())
  	.withTextFileSink("file:///path/to/output/out1")

val context2 = new ScalyticsContext(configuration2)
  	.withPlugin(Java.basicPlugin())
  	.withPlugin(Spark.basicPlugin())
	.withPlugin(JDBC.basicPlugin())
  	.withTextFileSink("hdfs:///path/to/output/out2")

val context3 = new ScalyticsContext(configuration2)
  	.withPlugin(Java.basicPlugin())
  	.withPlugin(Flink.basicPlugin())
	.withPlugin(JDBC.basicPlugin())
  	.withTextFileSink("hdfs:///path/to/output/out3")

‍

Next, utilizing the MultiContextPlanBuilder, developers can outline the desired data processing tasks across these disparate contexts.


val multiContextPlanBuilder = new MultiContextPlanBuilder(List(context1, context2, context3));

val customers = multiContextPlanBuilder
	.readTable(context1, customersDB1)
	.readTable(context2, customersDB2)
	.readTable(context3, customersDB3)

.forEach(_
/* Filter for year 2022 */
.filter(record -> record.getField(5).contains(“2022”)
)

val sales = multiContextPlanBuilder
.readTextFile(context1, salesFile1)
  	.readTextFile(context2, salesFile2)
.readTextFile(context3, salesFile3)

.forEach(_
  	.map(line -> {
       String [] vals = line.split("\\,");
       return new Record(vals[0], vals [1], Double.parseDouble(vals[2]), 1);
            	})
     /* Filter for year 2022 */
     .filter(record -> ((String)record.getField(0)).contains("2022"))

val results = sales
/* Join sales with customers data*/
.combineEach(customers, (dq1: DataQuanta[Customer], dq2: DataQuanta[Result]) => dq1.join(id, dq2, _._2))

     /* Aggregate data */
     .reduceByKey(record -> record.getField(1), (r1, r2) -> {
                	return new Record(r1.getField(0), r1.getField(1), r1.getDouble(2) + r2.getDouble(2), r1.getInt(3) + r2.getInt(3));
            	})
       /* Average */
       .map(r -> new Record(r.getField(1), r.getDouble(2)/r.getInt(3)))
	).execute()

‍

With just a few lines of code, developers can now execute in-situ data processing across different sites. What's more, installations in each site can be heterogeneous, accommodating various clusters like Spark or Flink seamlessly. Importantly, one can issue the same job to multiple Spark clusters simultaneously, something that is not possible within Spark because it does not allow to have more than one Spark context in a single JVM.

In conclusion, MultiContext heralds a new era of federated and in-situ data processing, empowering organizations to leverage their distributed infrastructure while ensuring data privacy and efficiency. Stay tuned for more updates and insights from the Scalytics team as we continue to innovate in the realm of unifying data processing solutions.

About Scalytics

Scalytics provides enterprise-grade infrastructure that enables deployment of compute-intensive workloads in any environment—cloud, on-premise, or dedicated data centers. Our platform, Scalytics Connect, delivers a robust, vendor-agnostic solution for running high-performance computational models while maintaining complete control over your infrastructure and intellectual assets.
Built on distributed computing principles and modern virtualization, Scalytics Connect orchestrates resource allocation across heterogeneous hardware configurations, optimizing for throughput and latency. Our platform integrates seamlessly with existing enterprise systems while enforcing strict isolation boundaries, ensuring your proprietary algorithms and data remain entirely within your security perimeter.
‍
With features like autodiscovery and index-based search, Scalytics Connect delivers a forward-looking, transparent framework that supports rapid product iteration, robust scaling, and explainable AI. By combining agents, data flows, and business needs, Scalytics helps organizations overcome traditional limitations and fully take advantage of modern AI opportunities.

If you need professional support from our team of industry leading experts, you can always reach out to us via Slack or Email.