Scalytics 3.5.0 Is Here: Enterprise-Grade vLLM Inference, Torch Resilience, and Deep Search Refined with Private-Data Training

Alexander Alten-Lorenz

Enterprises demanding scalable AI have a new ally: Scalytics 3.5.0. This release unifies three transformative capabilities—vLLM-powered streaming inference, rock-solid PyTorch model support, and our re-architected Deep Search engine—all wrapped in private-data fine-tuning and governance features that suit even the most regulated environments.

From Llama.cpp to vLLM: High-Throughput, Context-Aware Inference

Our migration from GGUF and Llama.cpp engines to vLLM isn’t just a swap—it’s a paradigm shift. Scalytics 3.5.0 automatically detects each model’s optimal context window, choosing between a vLLM backend or database-driven offload to balance latency and cost. Gone are static max_tokens caps: now, dynamic limits adjust at runtime to match model architecture, enterprise policy and GPU availability. Under the hood, a central router orchestrates token streams, back-pressure controls prevent SSE dropouts, and bitsandbytes precision modes enable sub-8-bit inference on commodity GPUs. Real-time VRAM estimation and adaptive GPU-recommendation logic ensure every model activates with the right resources—while cancelable downloads and progress callbacks keep teams moving without waiting on massive weights.

Torch Models That Just Work: Simplified Parsing, Pre-Flight Checks, and JSON Robustness

Custom PyTorch models demand reliability. In 3.5.0, we’ve rebuilt tokenizer pre-flight checks to catch misconfigurations before they block your pipelines. Deep Search’s JSON parser now handles nested payloads gracefully, eliminating edge-case failures during multi-step reasoning. And vLLM no longer attempts to load embedding-only checkpoints, slashing initialization times by up to 40 percent. Coupled with restored hardware-monitoring probes—complete with granular GPU-health metrics and alert thresholds—your in-house models run predictably from first request through production scale-out. That includes:

  • Dynamic Context Window Detection: Automated sizing logic now differentiates between vLLM and database backends, guaranteeing optimal token budgets per model instance.
  • Adaptive Max-Token Configuration: We’ve removed legacy max_tokens artefacts, replacing them with runtime-calculated limits that respect both model and enterprise policies.
  • Robust Streaming & Error-Handling: Centralized vLLM stream processing in our inference router ensures partial responses flow uninterrupted, while new back-pressure controls eliminate SSE disconnects and buffer deadlocks.
  • Compatibility & Precision Tuning: Our vLLM service now supports bitsandbytes precision options, context-limit enforcement, and automatic GPU recommendation based on real-time VRAM estimation.

Scalytics Connect Dashboard

Deep Search 3.5.0: The Graph-Based Foundation for Knowledge Discovery

Deep Search has evolved from keyword retrieval into an enterprise-grade reasoning framework. In version 3.5.0, every node in our LangGraph-based pipeline—whether it’s a QueryNode, URLProcessingNode or AnswerSynthesisNode—exposes configuration for max-queries-per-hop, domain blocklists, summary thresholds, coreference resolution and cross-document clustering, all tunable via your config.py. Early vector-search pre-filtering sifts high-relevance passages before they ever touch an LLM, reducing API calls and cost. CourtListener joins our provider ecosystem, giving legal teams direct access to judicial opinions alongside standard web crawls. And thanks to a deep dive on Caddy’s SSE buffering, multi-step research streams now survive high-throughput loads without silent timeouts. In version 3.5.0, we’ve re-architected our DeepSearchService around LangGraph principles and added:

  • Typed-State Configuration: QueryNode, URLProcessingNode, AnswerSynthesisNode, and LLMReasoningWorker now expose max_queries_per_hop, domain_blocklist, summary_word_threshold, coreference resolution, and cross-document clustering—all tunable via config.py.
  • Vector-Search Pre-Filtering: Early filtering of synthesis chunks ensures only high-relevance passages enter the LLM pipeline, reducing API calls and cost.
  • SSE Stability for Research Tasks: We’ve tracked down and eliminated a subtle Caddy buffer bug—our SSE streams for multi-step research now withstand high-throughput loads without silent failures.
  • CourtListener Integration: Legal teams can tap into courtroom opinions directly, expanding our provider pool beyond generic web scrapes.

AI Agents in Scalytics Connect

Private-Data Fine-Tuning and Continuous Refinement

Your data stays yours. Scalytics 3.5.0 introduces end-to-end private-data training workflows with secure Hugging Face token management: check status, revoke or delete credentials on demand, and audit every action through a centralized dashboard. Model activation flows are JWT-secured, with full audit logs for compliance. A new /storage/info endpoint lets you monitor disk usage, full-text-search index sizes and lock states—vital for capacity planning in regulated deployments—while graceful shutdown and file-locking mechanisms prevent race conditions during updates.

Scalytics 3.5.0 delivers the performance, stability, and governance that today’s enterprises require. Whether you’re building a mission-critical chatbot, automating research workflows, or refining proprietary data with state-of-the-art LLMs, this release gives you:

  • Streaming vLLM Inference with dynamic context-window detection and GPU-aware resource allocation
  • Torch Model Resilience through simplified tokenizer validation and robust JSON handling
  • Deep Search 3.5.0 powered by configurable LangGraph pipelines, vector pre-filtering, and new legal data sources
  • Private-Data Refinement workflows that blend on-premises security with continuous fine-tuning

Torch model configuration

Head over to scalytics.io/release-notes for the full changelog, reach out for a private demo. With Scalytics 3.5.0, you own every aspect of your AI stack—on your terms and behind your firewall.

About Scalytics

Scalytics provides enterprise-grade infrastructure that enables deployment of compute-intensive workloads in any environment—cloud, on-premise, or dedicated data centers. Our platform, Scalytics Connect, delivers a robust, vendor-agnostic solution for running high-performance computational models while maintaining complete control over your infrastructure and intellectual assets.
Built on distributed computing principles and modern virtualization, Scalytics Connect orchestrates resource allocation across heterogeneous hardware configurations, optimizing for throughput and latency. Our platform integrates seamlessly with existing enterprise systems while enforcing strict isolation boundaries, ensuring your proprietary algorithms and data remain entirely within your security perimeter.

With features like autodiscovery and index-based search, Scalytics Connect delivers a forward-looking, transparent framework that supports rapid product iteration, robust scaling, and explainable AI. By combining agents, data flows, and business needs, Scalytics helps organizations overcome traditional limitations and fully take advantage of modern AI opportunities.

If you need professional support from our team of industry leading experts, you can always reach out to us via Slack or Email.
back to all articles
Unlock Faster ML & AI
Free White Papers. Learn how Scalytics streamlines data pipelines, empowering businesses to achieve rapid AI success.

Scalytics Connect:
Powering Enterprises with Deep Search AI.

Launch your data + AI transformation.

Thank you! Our team will get in touch soon.
Oops! Something went wrong while submitting the form.