Virtual Data Lake for Utilities

Connect on-prem SCADA, historians, and remote OT systems into a unified virtual data lake. No data movement. Full compliance. Built for utility operations.

The Problem: Utility Data Is Everywhere — and Can't Move

Utilities operate across dozens of disconnected data systems. SCADA historians sit air-gapped in control centers. Smart meter data flows into cloud platforms. GIS, asset management, and market systems each have their own storage. Remote substations and DER assets generate telemetry that never leaves the edge. Field crews collect inspection data on tablets that sync intermittently.

Traditional data lake approaches demand consolidation: extract everything, transform it, load it into a central repository. For utilities, this model breaks down:

  • Regulatory constraints — NERC CIP, GDPR, and sector-specific rules often prohibit moving operational data
  • Network isolation — OT systems are air-gapped by design; IT/OT convergence doesn't mean data convergence
  • Latency requirements — grid operations need millisecond decisions, not overnight batch loads
  • Field data gaps — mobile workforce data lives in disconnected systems, unavailable for analytics until manually reconciled
  • Cost and complexity — migrating decades of historian data is a multi-year project that rarely delivers ROI

The result: analytics teams build brittle point-to-point integrations, shadow IT proliferates, ML models train on stale snapshots, and AI initiatives stall waiting for "clean" data that never arrives.

The Solution: A Virtual Data Lake That Queries Data in Place

Scalytics Copilot creates a virtual data lake — a unified query layer across distributed utility systems without moving or copying data.

Your data stays where it is: in the historian, the SCADA network, the cloud platform, the edge device, the field crew's tablet. Scalytics Copilot federates queries across these sources, handles transformations in-flight, and returns unified results. Compliance boundaries remain intact. Network isolation is preserved. Analytics and ML training run where the data lives.

This isn't abstraction for its own sake. The architecture reflects lessons from real utility transformation programs — including E.ON's migration from legacy Hadoop to cloud-native platforms, where centralized approaches repeatedly failed before federated methods proved viable.

How federated data processing works — the technology behind Scalytics

Architecture Overview: The Scalytics Federated Layer

Utilities no longer need to risk compliance or latency by centralizing data. Scalytics Copilot establishes a Virtual Data Lake—a unified, secure layer that integrates data in place across your entire energy infrastructure.

The architecture is built to mirror utility reality:

  • Data Sources Remain Secure: Whether the data sits air-gapped On-premise (like SCADA, PI Historian, GIS, and Asset Databases), within Cloud environments (Databricks, Snowflake, IoT Platforms), on the Edge (Substations, DER, Smart Meters), or on Mobile devices (Field Crews, Work Orders), it is accessed directly.
  • Central Federation: The Scalytics Copilot layer sits in the middle, handling all query translation, execution, and security governance.
  • Computation Moves, Data Stays: Arrows on the diagram illustrate the critical difference: Queries are federated and ML models are trained across these distributed sources. Only unified results and aggregated model gradients move, ensuring that sensitive operational data never leaves its secure domain.

This design delivers high-accuracy grid intelligence without the complexity, cost, or regulatory risk of traditional data migration projects.

Scalytics Copilot: The Virtual Data Lake for Utilities

Scalytics Copilot enables advanced grid analytics and AI/ML operations by unifying distributed data systems without costly, non-compliant data movement.

The Problem: Data Isolation

  • Regulatory Constraints: NERC CIP, GDPR prohibit moving operational data.
  • Network Isolation: OT systems are air-gapped; central consolidation is impossible.
  • Latency Requirements: Grid decisions need real-time data, not slow batch ETL.
  • Cost & Risk: Migrating decades of data is costly, complex, and high-risk.

Result: ML models use stale data, AI initiatives stall, and costly point-to-point integrations proliferate.

The Solution: Federated Query

  • Connect — Not Migrate: Query historians, SCADA, and cloud data in place.
  • Query as One: A single query layer handles in-flight transformation across sources.
  • Compliance by Architecture: Access controls are enforced at the query layer, preserving residency.
  • Decentralized ML: Train models on distributed data; only model updates aggregate.

Result: Unified insights, real-time analytics, and secure AI/ML operations that honor isolation boundaries.

Architecture Overview: The Scalytics Federated Layer

Scalytics Copilot: The Virtual Data Lake (Unified Query & ML Layer)

SCADA / Historians

Air-Gapped, Time-Series Data

Cloud Data Platforms

Smart Meter (AMI), IoT Platforms, Market Data

Edge / DER Assets

Substations, Distributed Energy Resources

Field Crews & Mobile Ops

Work Orders, Inspection Reports, GPS Traces

Arrows represent Query Federation & Model Training, NOT raw data migration.

Key Utility Use Cases & Value

Grid Analytics Across IT/OT

Combine SCADA telemetry, weather, and market signals for real-time grid optimization without bridging air-gapped networks.

Asset Performance Management (APM)

Federate sensor streams, maintenance records, and inspection data to predict equipment failures before they cascade.

Compliance & Regulatory Reporting

Query distributed systems for required compliance reporting without staging data, minimizing exposure risk.

Federated AI/ML Operations

Train complex models on distributed, regulated data sources simultaneously, ensuring compliance while maximizing data access.

How It Works

1. Connect — not migrate Scalytics Copilot connects to your existing systems: PI historians, OSIsoft, cloud data warehouses, edge databases, mobile workforce platforms, and streaming systems. No data extraction. No schema redesign.

2. Query as one Write a single query or ML pipeline. Scalytics Copilot determines optimal execution across sources — pushing computation to the data rather than pulling data to computation.

3. Train models on federated data ML models access distributed datasets without centralization. Training runs across sources while raw data never leaves its secure environment — only model updates aggregate.

4. Govern consistently Access controls, encryption, and audit trails apply uniformly across the virtual data lake. Compliance is enforced at the federation layer.

5. Scale incrementally Start with one use case — outage prediction, demand forecasting, asset health, crew optimization. Add data sources and workloads without rearchitecting.

Utility Use Cases

Grid Analytics Across IT and OT

Combine SCADA telemetry, weather data, and market signals for real-time grid optimization — without bridging air-gapped networks.

See Smart Grid Intelligence for streaming grid use cases

Asset Performance Management

Federate maintenance records, sensor streams, and inspection reports to predict equipment failures before they cascade.

Regulatory Reporting

Query distributed systems for compliance reporting without staging data in intermediate warehouses.

Demand-Side Management

Integrate smart meter data, CRM systems, and weather feeds to model flexible load segments — data stays distributed, insights are unified.

Utility AI/ML Operations

Traditional MLOps assumes centralized data. Utility reality is different: training data lives in SCADA historians, asset databases, weather feeds, and field systems — often subject to strict residency requirements.

Scalytics Copilot enables federated ML operations:

Train on distributed data Models access data across sources without extraction. A transformer health model can train on historian sensor data, maintenance records from the asset system, and inspection notes from field crews — simultaneously, without moving bytes.

Privacy-preserving training Federated learning techniques allow model training across organizational or regulatory boundaries. Only model gradients move; raw operational data stays in place.

Consistent feature pipelines Feature engineering runs identically across development and production environments, regardless of where source data resides. No "works on my laptop" surprises.

Reproducible experiments Track which data sources, versions, and transformations contributed to each model. Audit trails span the entire virtual data lake.

Deep dive: Federated Learning for enterprise AI

Fleet & Mobile Workforce Integration

Field crews generate critical data — inspection reports, equipment photos, GPS traces, work order updates — but this information typically lives in disconnected mobile platforms, syncing hours or days later.

Scalytics Copilot brings field data into the virtual data lake:

Real-time crew visibility Query work order status, crew locations, and job progress alongside grid state — enabling dynamic dispatch based on actual conditions.

Inspection data for predictive models Field observations become training data for asset health models without manual export/import cycles. A technician's photo of corroded equipment feeds directly into failure prediction.

Optimize routing and scheduling Combine outage predictions, asset priority scores, weather forecasts, and crew availability into unified optimization — data from five systems, one query.

Close the loop When a model predicts equipment failure, automatically generate work orders routed to the nearest qualified crew. Federated data makes closed-loop operations possible.

Why Utilities Choose Scalytics

Built by utility veterans Our founding team led digital transformation at E.ON — architecting IoT platforms and cloud-native data infrastructure for connected energy assets. That hands-on experience shaped every design decision in Scalytics Copilot.

No rip-and-replace Connect to existing historians, SCADA systems, workforce management platforms, and cloud lakes. Your infrastructure stays; analytics capabilities expand.

Compliance by architecture Data residency, access controls, and audit requirements are enforced at the federation layer — not bolted on after the fact.

Transparent execution See exactly which systems are queried, what transformations run where, and how results are assembled. No black-box magic.

Connect your Databricks lakehouse to on-prem utility systems

Get Started

Most utilities start with a single high-value use case — grid analytics, asset health, or workforce optimization — and expand from there.

Talk to an engineer about your data architecture

About Scalytics

Scalytics builds on Apache Wayang, the cross-platform data processing framework created by our founding team and now an Apache Top-Level Project. Where traditional platforms require moving data to centralized infrastructure, Scalytics brings compute to your data—enabling AI and analytics across distributed sources without violating compliance boundaries.

Scalytics Connect provides federated data processing across Spark, Flink, PostgreSQL, and cloud-native engines through a single abstraction layer. Our cost-based optimizer selects the right engine for each operation, reducing processing time while eliminating vendor lock-in.

Scalytics Copilot extends this foundation with private AI deployment—running LLMs, RAG pipelines, and ML workloads entirely within your security perimeter. Data stays where it lives. Models train where data resides. No extraction, no exposure, no third-party API dependencies.For organizations in healthcare, finance, and government, this architecture isn't optional—it's how you deploy AI while remaining compliant with HIPAA, GDPR, and DORA.Explore our open-source foundation: Scalytics Community Edition

Questions? Reach us on Slack or schedule a conversation.

Scalytics Copilot:
Real-time intelligence. No data leaks.

Launch your data + AI transformation.

Thank you! Our team will get in touch soon.
Oops! Something went wrong while submitting the form.