Summary: In this blog post, we answer key questions about the data processing capabilities of Scalytics, GDPR compliance, data access delegation and performance optimization. Learn how Scalytics can simplify complex data scenarios and empower your organization with effective data management and AI deployment. Explore the challenges of data regulation in today’s AI-driven world, and how Scalytics solves the problems of dispersed data silos and unifies different, incompatible data technologies.
How Does Scalytics Solve Data Regulation Challenges?
We often get asked how we tackle the rising challenge of data regulations, especially in an era where AI is gaining more and more importance and data is scattered across various silos and technologies, making data management increasingly complex. Questions like this:
We have US customer data on a Spark cloud in NYC, and EU customer data on a SQL Data Warehouse on premise in Paris. The requirement is to find late fees charged by customer account size and country. How does Blossom process data to generate a summary table while meeting GDPR compliance?
Scalytics inherently offers data compliance (GDPR, HIPAA, CCPA etc) via its data federation technology. WIth Scalytics, the data engineer writes a single Wayang job composed of the selection, projection, and aggregation statements required for the query. This job simply declares the path to the two different sources in a configuration file. Then, Scalytics creates two independent Wayang jobs and ships them to the two premises. Each platform, i.e., the Spark cluster in NYC and the SQL data warehouse in Paris, executes the query.
There are three ways to merge the results and generate the summary table, all are data regulation compliant. Use the best method which fits your use case.
Method 1 - Remote Data Federation
When Scalytics is executing federated data operations on multiple data sources at the same time, the intermediate results of that operations can be sent to the location where Scalytics is currently running. This location could be, for example, the central data team in the US or France. At this location, Scalytics combines and integrates these intermediate results into a summary table by using Apache Wayang's data processing capabilities. In essence, it means that Scalytics processes data from various sources, brings the computed intermediate results to a central location, and then merges it into a consolidated summary table.
Method 2 - Local Data Federation
Scalytics also supports location-based data federation. The intermediate aggregated results, which were generated in New York City (NYC), are transmitted to the other location, Paris. In Paris, these results are combined or merged together. This process involves integrating the intermediate data from NYC with other data present in Paris, resulting in a consolidated dataset or outcome. In simpler terms, it means that computed intermediate results from NYC are sent to Paris, where they are combined with other data to create a merged dataset. This method implies that the merged data is further processed in Paris, fully GDPR compliant.
Or the intermediate aggregated results, which have been processed in Paris, are sent to New York City (NYC) for merging. This means that the data computed in Paris is transferred to NYC, where it is combined into a single summary table. This approach allows for data consolidation without requiring raw data to leave its initial location, ensuring data privacy and compliance. This implies that the intermediate results are further processed in NYC, eventually to build a holistic view about market conditions in certain economic areas.
In all scenarios, Scalytics prioritizes data security and privacy by ensuring that raw data remains within its original location, adhering to strict compliance regulations. Unlike other solutions, Scalytics does not require the deployment of third-party execution engines on the data pools, simplifying the data management process. With Scalytics' advanced capabilities, organizations have the flexibility to choose the approach that suits their specific needs. Whether they opt for data aggregation at the platform where Scalytics operates, at the source platform, or even let Blossom's AI optimizer make the decision, Scalytics empowers users to effortlessly navigate complex data scenarios while maintaining the highest data security standards.
How Does Scalytics Ensures Data Access Controls?
We have a scenario where we need to ensure that only specific team members have access to sensitive financial data while others should be restricted. How does Scalytics handle such stringent data access control requirements?
One of the most frequently asked questions is how Scalytics ensures stringent data access controls. Our platform operates on the principle of tight access delegation, where each user is granted access to specific data tables, mirroring the level of control available within your organization. We require user authentication only for our studio, Blossom Studio, enabling users to create working groups and manage access efficiently. This approach ensures that data access remains secure and controlled, minimizing the risk of unauthorized access and breaches while maintaining an intuitive and streamlined user experience. In practice, this means that the user executing a federated job must have access to the data sources included in their query, a process typically managed internally within organizations.
Do we have a holistic view of all datasets using Blossom or do we need a Master Data Management (MDM) type layer?
Scalytics offers the capability to connect with multiple data pools and platforms, making it unnecessary to implement a separate Master Data Management (MDM) layer only for Scalytics. Our platform seamlessly integrates with existing data management systems, serving as a versatile and compliant solution for executing data pipelines, streamlining data operations, and ensuring data consistency across the organization.
How Does Scalytics' AI Optimizer Enhances Data Processing Efficiency?
We've been facing performance and reliability issues with our Spark and SQL instances. How does Scalytics' AI optimizer address these challenges, and can you share a specific case where it improved data processing efficiency for an organization with similar issues?
In scenarios where users may inadvertently make suboptimal decisions, Scalytics addresses performance concerns by leveraging its AI optimizer. For instance, in the example above, when the merge operation includes a lot of intermediate results, doing so in a third location and using the Java platform as a plugin could result in very long processing times or even memory exceptions. Scalytics comes with an AI optimizer to make the best decisions on where operations should take place taking into consideration either the runtime and/or the monetary costs involved. Below we see an example of a classification task where the optimizer of Blossom decided on a plan that outperforms the single Java and Spark plugins by more than an order of magnitude.
How Much Effort is Needed To Start With Scalytics?
What programming options are available with Scalytics, and how user-friendly is the platform for our data team, especially if they have experience with tools like Apache Spark?
Blossom supports standard SQL, which makes it convenient to write analytical pipelines. Additionally, it comes with three programmatic APIs: a Java scala-like API, a Scala API, and an SQL API. A Python API is on its way too! Writing pipelines from scratch involves a small learning curve, but not for those familiar with big data platforms such as Apache Spark. Scalytics also comes with Blossom Studio, where users can drag and drop operators to build their pipelines with a low code effort.
About Scalytics
Experience seamless integration across diverse data sources, enabling true AI scalability and removing the roadblocks that obstruct your machine learning data compliance and data privacy solutions for AI. Break free from the limitations of the past and accelerate innovation with Scalytics Connect, paving the way for a distributed computing framework that empowers your data-driven strategies.
Apache Wayang: The Leading Java-Based Federated Learning Framework
Scalytics is powered by Apache Wayang, and we're proud to support the project. You can check out their public GitHub repo right here. If you're enjoying our software, show your love and support - a star ⭐ would mean a lot!
If you need professional support from our team of industry leading experts, you can always reach out to us via Slack or Email.