Scalytics | Benchmark for Contextual AI

Alexander Alten-Lorenz

Research-grade answers demand more than a quick lookup. They require gathering evidence from multiple sources, verifying each fact, and weaving a clear narrative. Most QA benchmarks stop at simple fact retrieval or single-hop queries. SynthLink changes that by offering an open-source suite of 60 multi-step challenges that mirror real-world research workflows. It lets you test AI systems on exactly the kind of complex questions analysts tackle every day.

‍

Traditional QA Benchmarks Fall Short

The usual benchmarks for answering questions have led to great progress in AI, but they make the problem simpler than it actually is in the real world. Many popular benchmarks involve answering questions with a single relevant document or a few paragraphs of context. Models can often find an answer by matching patterns or using simple reasoning. On the other hand, research questions require an AI to investigate. This means breaking the question into smaller parts, gathering information from different sources, and checking each piece of the puzzle. They require multi-step reasoning, also known as multi-hop, where the answer isn't found in one place – the system must find connections across different pieces of information.

For example, a typical benchmark question might be, "What year did X invention occur?" – answerable by one lookup. But a deep research question could be, "How did the invention of the printing press influence literacy rates and social movements that came later?" This question can't be answered with just one fact. To answer it, we need to look at how things have changed over time. We need to read about changes in literacy and connect those findings to events like the Protestant Reformation. QA benchmarks usually don't test whether a model can do synthesis and fact-checking. This means that a model might do well on those tests, but have a hard time helping with research or writing reports.

SynthLink tries to fix this by focusing on difficult questions that need analysis. It provides a way to measure how well AI systems perform at tasks that are like real investigative research, not just simple questions. This helps developers and product teams understand how their AI will perform on important, practical questions. These are the kinds of questions real users might ask when they need more than a quick answer.

‍

How the Benchmark is Organized

SynthLink is a collection of 60 challenging questions designed to test an AI system’s deep research capabilities. Each question requires multi-hop reasoning, meaning the system must find and link information from multiple sources to arrive at the answer. The questions are organized into six broad categories (with many based on real-world domains), including topics like historical analysis, economic shifts, environmental impacts, scientific breakthroughs, social movements, and future technologies. This diversity ensures that models are evaluated across a wide range of scenarios – from analyzing historical events to exploring scientific and technological trends.

What makes SynthLink unique is the nature of its tasks. These aren't simple Q&A pairs; they are scenarios that force an AI to act like a researcher. For each question, a model has to do several things:

Iterative Linking: Find a chain of relevant information across documents or data sources to progressively answer the question.
Synthesis: Combine insights from those sources into a cohesive, well-structured answer.
Fact-Checking: Verify that each claim in the answer is backed by a source, reducing the risk of unsupported statements.
Novel Connections: Draw connections that aren’t explicitly given in any single source (the model must infer or deduce them) without hallucinating facts.

‍

In other words, SynthLink challenges AI systems to think more like a human researcher or analyst. The questions often have many parts and require the AI to go deeper into sub-questions, gather evidence, and then create a final answer. This approach tests an AI's ability to not only retrieve information but also to reason and explain what it has learned. This is a critical skill for applications like decision support, report generation, or any domain where the process of finding the answer matters as much as the answer itself.

What makes SynthLink unique is the nature of its tasks. Each question has an expected answer summary and a list of relevant topics. This allows for a thorough review of an AI's answer and helps identify how it can be improved to match a well-sourced, correct answer. SynthLink checks not just if the answer is right or wrong, but also how it was derived.

‍

Evaluation using five different measures

To capture the many aspects of complex research tasks, SynthLink uses a custom scoring system that evaluates AI responses in five areas. Each answer is scored on five metrics (each ranging from 0 to 1). These metrics are combined into an overall score. The metrics are:

F1 Score (Answer Accuracy): Does the answer cover the right points? This is a comparison between the model's answer and the ground-truth summary at the token level. It rewards answers that include all the important information.
Precision@5 (Retrieval Relevance): Are the top sources the model used actually relevant? This metric checks the overlap between the model's top five retrieved documents and the known relevant sources for the question. A higher score means the system is finding useful information, not irrelevant passages.
Reasoning Quality Score (RQS): Does the answer include all the necessary steps in the reasoning process? SynthLink defines a set of key steps or points that a good answer should cover, and RQS measures whether the model's answer included each one. It makes sure the answer doesn't skip important parts of the explanation.
Fact-Checking Score (FCS): A way to measure the accuracy of a cited source. FCS checks if each claim in the answer is in the reference material. This makes it less likely that the model hallucinations will happen. It also encourages the model to only state what it has evidence for.
Iterative Efficiency (IE): How efficiently did the model arrive at the answer? Real research can take a lot of time, but an effective system should find the answer in a short amount of time. If the model finds the right answer quickly, IE will give it a higher score. In practice, this might mean rewarding a model that structures its search strategy well.

‍

SynthLink looks at five aspects to get a full picture of how well an AI system answers complex questions. For example, a model might get a high F1 score (it included the right facts) but a low FCS score (it cited incorrect sources or made unverifiable claims), indicating that some details may be guessed. A model might find good sources (high P@5) but not connect them correctly (low RQS). The combined scoring shows these differences. In SynthLink, the overall score is calculated by giving more importance to certain metrics. The most importance is given to accuracy and reasoning because these are important for deep research tasks. Retrieval, fact-checking, and efficiency are also important, but to a lesser degree. The result is a single score from 0 to 1 that summarizes overall performance, while still allowing one to look at specific strengths and weaknesses.

‍

Integrating SynthLink into Your Workflow

SynthLink is available as an open-source project, which makes it easy for teams to use and integrate it into their AI development workflow. The SynthLink Catalog repository (MIT-licensed) provides everything needed to get started, including the benchmark questions, expected answers, and a scoring script. You can find the project on GitHub.

After you've copied the repository, you can run the provided scoring script to evaluate your model's outputs on the benchmark, assuming the system you use allows deep search via an API, as Scalytics Connect does. The scoring expects your model's predictions in a simple JSON format. Each entry in this file should include the question ID, your model's predicted answer to that question, the list of documents or sources your model retrieved, and any other details like the step-by-step iterations (if applicable). Here's an example of an event entry:

{
  "question_id": "HIA-01",
  "predicted_answer": "The printing press made books cheaper, boosting literacy rates\
  and eventually fueling movements like the Reformation.",
  "retrieved_docs": ["https://en.wikipedia.org/wiki/Printing_press", ...],
  "iterations": [ ... ],
  "sources_verified": [ ... ]
}

(The repository provides a template and example of this predictions format.) Once you have your model’s answers in this format, running the scoring script will automatically compute all five metric scores for each question and produce an aggregate score. The output includes a detailed CSV report for each question (so you can see where your model did well or struggled) and an overall summary of the results.

‍

AI Research Relevance and Next Steps

If you're an AI developer or a product decision-maker, using SynthLink can give you confidence that a question-answering system will be able to handle complex, realistic tasks. If you're building an AI to help researchers, a smart chatbot for business data, or a new search engine, it's essential to be able to handle multi-hop queries with confirmed, synthesized answers. If you get a high score on SynthLink, it means that your system is good at finding important information and making connections, not just selling assumptions as facts. This helps us understand how open models work and what their limits are, and makes them better at complex reasoning, one question at a time.

About Scalytics

Scalytics provides enterprise-grade infrastructure that enables deployment of compute-intensive workloads in any environment—cloud, on-premise, or dedicated data centers. Our platform, Scalytics Connect, delivers a robust, vendor-agnostic solution for running high-performance computational models while maintaining complete control over your infrastructure and intellectual assets.
Built on distributed computing principles and modern virtualization, Scalytics Copilot orchestrates resource allocation across heterogeneous hardware configurations, optimizing for throughput and latency. Our platform integrates seamlessly with existing enterprise systems while enforcing strict isolation boundaries, ensuring your proprietary algorithms and data remain entirely within your security perimeter.
‍
With features like autodiscovery and index-based search, Scalytics Copilot delivers a forward-looking, transparent framework that supports rapid product iteration, robust scaling, and explainable AI. By combining agents, data flows, and business needs, Scalytics helps organizations overcome traditional limitations and fully take advantage of modern AI opportunities.

If you need professional support from our team of industry leading experts, you can always reach out to us via Slack or Email.