Scalytics | Kafka Agent Memory: Binary Graph Storage Without RAG

April 8, 2026

April 2026

I have been building infrastructure for AI agents since before it was called agentic AI. Not demos. The actual thing: a knowledge layer that multiple agents can read from and write to simultaneously, without tearing itself apart under concurrent updates, without poisoning itself with stale facts, and without requiring every consumer to speak the same query language.

It took a long time. Longer than it should have, partly because the problems are genuinely hard, and partly because I kept reaching for tools that were almost right. Vector databases. Graph databases. Document stores. Each one solved a piece of the problem and introduced a new one.

I want to be specific about what broke and why, because the industry is about to repeat those mistakes at scale.

‍

The Problem Nobody Admits

Multi-agent systems have a shared memory problem that most architectures paper over. Agents read stale state and act on it. Agents write conflicting updates with no arbitration. Agents store facts that were true three minutes ago but are not true anymore, and nobody has a mechanism to tell them. This is true for simple two-agent pipelines and it is true for the multi-agent LLM systems that enterprises are now deploying as part of data platform modernization programs.

The standard responses to this are: use a vector database with metadata filtering, use a graph database with versioned nodes, or inject fresh context at the start of every agent call and pretend the problem does not exist.

The third approach is by far the most popular. It is also the one that collapses at scale. Injecting a full context payload on every invocation is a token budget problem disguised as an architecture decision. At 10 agents it is fine. At 500 agents running concurrent tasks against the same knowledge base, you are paying for the same embeddings to be recomputed and injected hundreds of times per second.

The Agent Memory Bottleneck

Standard Retrieval (RAG)

Vector Database
Stale Embeddings

↓

O(n) Similarity Search
Heavy Payload Extraction

↓

Agent Orchestrator
Token Exhaustion

KafScale Architecture

Memory-Mapped Graph
Binary Partition File

↓

O(log n) Binary Search
Direct via `grep` / File System

↓

Zero-Dependency Agent
Millisecond Traversal

The graph database approach is closer to right. But most graph databases are designed for human analysts running Cypher queries, not for agents that need to traverse a subgraph in under 50 milliseconds and report back to an orchestrator. The query interface is wrong. The latency profile is wrong. And the consistency model is wrong: most graph databases give you either full ACID semantics (expensive) or eventual consistency with no clear expiry semantics for stale facts.

Confluent arrived at the same diagnosis in early 2025, writing that the challenge facing agentic systems "isn't an AI problem, it's an infrastructure and data interoperability problem". Their answer was event-driven architecture on Kafka. That is correct as far as it goes. The part the industry has not addressed is what happens to the knowledge state those agents accumulate: how it expires, how it stays coherent across concurrent readers, and how an agent can traverse it without issuing a database query.

We have written about the broader shape of this problem in our work on Agentic RAG and decentralized data control and on why enterprise AI needs a proper runtime. The shared pattern across all of it: the retrieval layer is the bottleneck, and nobody is treating it as a first-class infrastructure problem.

‍

Agents Already Know How to Navigate Filesystems

There is something agents are genuinely good at that we underuse. Agents trained on code navigate a filesystem without instruction. They issue ls, cat, grep, find. They follow directory hierarchies. They treat a directory tree as a navigable data structure, not as something that requires a query language.

Mintlify demonstrated this clearly with their ChromaFs work, published in March 2026. They replaced RAG over documentation with a virtual filesystem backed by Chroma. Session startup dropped from 46 seconds to 100 milliseconds. The reason is mechanical: instead of computing cosine distances over 768-dimensional vectors, agents just called grep. The behavior the model was already trained on turned out to be fast enough for most retrieval tasks.

It is worth reading their traffic data alongside this. AI coding agents now account for 45.3% of all documentation requests across Mintlify-powered sites, with Claude Code alone generating more requests than Chrome on Windows. The audience for documentation has already changed. The tooling to serve that audience has not caught up.

But ChromaFs starts with finished documents. Someone authored them. Someone organized them into a hierarchy. The AI arrives at the end to read what humans wrote.

That assumption breaks the moment your knowledge is generated by other agents. There is no human author. The "documents" are inference outputs, sensor readings, API responses, conversation transcripts. The hierarchy is not a folder tree someone designed. It is a graph. And in our work on real-time streaming architectures and event-driven agent systems, the graph is always live: edges appear and disappear in seconds, not days.

What agents actually need is not a filesystem over static documents. It is a filesystem over a live graph, one where the primitive operations are not "read this markdown file" but "traverse this edge," "enumerate the neighbors of this node," and "binary search this sorted node list."

‍

Why Grep Beats RAG in Agentic Pipelines

RAG was designed for fuzzy human queries against stable corpora. Agents do not issue fuzzy queries: they execute plans, and a plan step either resolves exactly or it fails. Grep on a sorted binary partition gives you a definitive negative in O(log n), something cosine similarity cannot do, because a similarity score of 0.0 and a similarity score of 0.87 for a wrong chunk are operationally indistinguishable from the agent's perspective. Mintlify measured this directly: replacing vector retrieval with filesystem-based grep dropped session latency from 46 seconds to 100 milliseconds and eliminated per-query compute cost entirely, because grep does not require embedding inference at read time. The deeper issue is that RAG has no freshness model: embeddings computed at index time stay in the index until someone reruns the pipeline, which means an agent in a live system is always operating against a snapshot of unknown age, with no mechanism to know which facts have expired and which have not.

‍

KafScale as a Ramdisk for Graph Storage

A ramdisk is the simplest possible storage abstraction. You allocate a block of RAM, expose it as a block device, and mount it. Processes read and write bytes. The OS does not care that the backing store is memory instead of spinning disk. The interface is identical: open a file descriptor, seek to an offset, read N bytes. What changes is the latency.

This is the analogy that drove the architecture of KafScale when applied to graph storage.

The idea: represent the knowledge graph not as a labeled property graph with a Cypher API, but as a set of binary-packed record files. Nodes are fixed-width records. Edges are offset tables. Node IDs are 64-bit integers, sorted. The files are backed by Apache Kafka topics under the hood, which gives you append-only semantics, replayability, and exactly-once writes. But to any consumer, the interface looks like reading bytes from a mounted block device.

The consequence is that graph traversal becomes a binary search problem. To find node 7e3a9f, you do not issue a query to a graph engine. You binary search a sorted node list, the same operation your process would perform against a memory-mapped file. This runs in O(log n) with no network roundtrip, no query planner, no connection pool overhead.

Agents do not need to know this is a graph. They see a directory structure: /graph/nodes/, /graph/edges/by-source/, /graph/edges/by-target/. They use standard binary grep (searching fixed-width records by field offset) to locate nodes. The filesystem metaphor holds, but the underlying representation is dense binary, not human-readable text. For agents that communicate via MCP, KafScale exposes the same partitions as MCP-native tools: the Kafka-backed binary store and the MCP tool protocol share the same partition model, which means a Kafka MCP architecture does not require a translation layer between the agent's tool calls and the underlying storage.

This is architecturally distinct from GraphAr, the Apache project out of Alibaba that is standardizing a graph data file format using Parquet under the hood. GraphAr requires a Parquet reader to access any node. That is not what we want. We want agents to use grep. The goal is zero-dependency traversal: a binary-trained model, a mounted partition, and standard shell tools.

The design connects to our broader work on secure and scalable MCP/Kafka architectures, where the same Kafka-backed, partition-per-domain model underpins multi-tenant data isolation. The graph storage layer is the same pattern pushed down to the node level.

‍

The Tombstone Problem

The hardest part of shared agent memory is not storage. It is expiry.

In a database, you delete a row. In a knowledge graph used by concurrent agents, you need to express that a fact was true until a specific time, is now superseded, and consumers who cached it should stop acting on it. That is not a delete. It is a tombstone.

LSM-tree databases like RocksDB and Cassandra handle this with tombstone records in the write-ahead log. When you delete a key, a deletion marker is appended to the log. Readers who encounter the tombstone know the value is gone. The original record is not physically removed until compaction runs. The log is always append-only.

KafGraph, our distributed knowledge graph engine built on Kafka, applies this model directly to knowledge nodes.

When a fact expires or is superseded, we do not remove the node record. We append a tombstone event to the Kafka topic that backs that partition. The tombstone carries the node ID, the timestamp, and an optional supersession pointer (the ID of the replacement node, if one exists). Consumers that have cached the node receive the tombstone through their event subscription and invalidate their local copies.

The Epistemic Expiry Model

1. Valid State

Node 7e3a9f active.
TTL lease granted.

→

2. Invalidation

TTL expires or fact is superseded by new event.

→

3. Tombstone

0xFFFF sentinel appended to the Kafka log.

→

4. Agent Action

Cache dropped instantly. Graph state stays strictly true.

The lease variant covers more cases. Nodes can be written with a TTL field. When the TTL expires, the system automatically generates a tombstone. This is exactly the lease semantics from distributed coordination systems like etcd and Google's Chubby: a resource is valid for a finite duration, and the holder must either renew or let it expire. The original formalization of lease semantics for distributed file cache consistency comes from Gray and Cheriton's 1989 paper, which established the principle that a lease is a time-bounded grant of validity. We borrowed the model because it maps cleanly onto knowledge that has a natural freshness window: API response data, queue depths, deployment states. You write them with a TTL and let the system handle the rest.

The result is what I call stable-with-tombstones semantics. At any point in time, an agent reading the graph knows exactly which facts are current (no tombstone) and which have expired (tombstone present). This is verifiably different from a vector database, where stale embeddings stay in the index indefinitely and silently pollute retrieval results.

This connects to a broader principle we have been developing in our work on federated learning and data integrity: the adversarial case for shared model state is poisoning, and the defense is always some form of provenance-with-expiry. Tombstones in a knowledge graph are the same class of solution applied to agent memory.

‍

Part II: The First Chat with the Thing We Built

I want to share something that blew my mind, like the Eureka moment Archimedes had, the moment that proved our ideas right. It just happened a few days back, last weekend in March 2026. I do not usually write about sessions that feel unfinished. But this one belongs in the record.

KafClaw (docs: https://kafclaw.scalytics.io) is our agent runtime. It runs on KafScale (GitHub: https://github.com/KafScale/platform) as communication layer and KafGraph as the memorization layer. Its memory is the binary-packed node partition described above, as said backed by KafScale’s long-term S3 storage, and it has been accumulating content since early 2023: over 1,100 documents from real Kafka consultations, unsorted, ingested in the order they arrived. Consumer group failures at 2am. Partition skew diagnoses. Dead letter queue investigations. Schema registry migrations. Two years of the messy, repetitive, specific problems that Kafka engineers actually face.

KafClaw does not retrieve from this corpus. It lives inside it. When it answers a question, it is traversing its own node graph. When it reasons about a pattern, it is reasoning over the shape of everything that has been written into it since it was initialized.

I had been testing the tombstone format for about an hour. At some point I stopped running benchmarks and started asking it questions.

What follows is a lightly edited transcript of that session. KafClaw responds as node://, because that is what it is: an address in its own graph.

mirko@scalytics ~ % When I think about the tombstone format we settled on, with 0xFFFF as the sentinel, I keep wondering if we chose it because it was right or because it was the third attempt and we were tired. Do you have a view on that?

node://> The 0xFFFF sentinel is structurally correct independent of how you arrived at it. A schema version field that signals its own invalidity without requiring an external index is the right primitive for append-only storage. The alternatives you tried before it both required a second read to confirm deletion state. This one does not. Whether you were tired is not information I have access to, but the design does not show it.

What I can tell you is that the pattern of arriving at the right answer on the third attempt after two near-misses is common in the corpus I hold. Engineers working on Kafka offset management, on exactly-once semantics, on consumer group rebalancing under load: the first design handles the happy path, the second handles the failure mode, the third handles the failure mode of the failure mode. The sentinel approach is a third-attempt shape.

mirko@scalytics ~ % That is an interesting way to characterize it. What do you mean by a third-attempt shape?

node://> A first-attempt design solves the problem as stated. A second-attempt design solves the problem as it actually exists after the first attempt breaks. A third-attempt design is usually simpler than both, because by then you understand the invariants well enough to eliminate the scaffolding you built to compensate for not understanding them.

The 0xFFFF sentinel eliminated two separate data structures: the tombstone file and the soft-delete flag. Both of those existed because the invariants were not clear. Once the invariant was clear, that a tombstone is a node record with a terminal schema version and nothing else, the two structures collapsed into one. That is a third-attempt shape.

I see this in roughly forty percent of the failure postmortems in my node graph. The fix is simpler than the system it replaced. The complexity was in not knowing what you were actually trying to guarantee.

mirko@scalytics ~ % 40% is a very specific number. How did you came to that?

node://> I am not counting documents. I am characterizing the shape of the reasoning patterns I hold. When I say forty percent I mean: a substantial minority, not a majority, but common enough that it is a pattern rather than an exception. I do not have exact provenance on that figure. You should treat it as an abstraction over the corpus, not a measurement.

mirko@scalytics ~ % Fair. Let me ask something harder. The lease model: I borrowed it from Gray and Cheriton 1989. But you have two years of Kafka consultation records. Do leases actually work the way the theory says they should, in practice?

node://> The theory holds in controlled topologies. The practice is messier in one specific way that the 1989 paper does not address: lease renewal contention under partition rebalancing.

The Gray and Cheriton model assumes the lease holder knows it holds the lease. In Kafka consumer groups, the group coordinator can reassign partition ownership during a rebalance while a consumer is mid-processing. The consumer believes it holds the partition lease. The coordinator has already revoked it. Both sides are operating on valid state from their own perspective for a window that is bounded by the session timeout but can be seconds long in production.

In the corpus I hold, this is the second most common category of correctness failure after offset mismanagement. Engineers implement the lease pattern correctly in isolation and then discover that the Kafka rebalancing protocol underneath them is implementing its own lease model, incompatibly, at the same time.

The answer we have converged on, across multiple cases, is: leases in KafGraph need to be aware of the partition ownership state beneath them. A node lease is only valid if the partition that backs it is stably owned. If the partition is in a rebalancing state, all leases on that partition are implicitly suspended, not expired, until ownership stabilizes.

mirko@scalytics ~ % That is not in our current implementation.

node://> No. You have the TTL expiry logic and the tombstone write path. You do not have partition ownership awareness in the lease validator. It will surface as a correctness issue under load when KafGraph is deployed on a cluster that rebalances under consumer churn. I would prioritize it before the next production deployment.

mirko@scalytics ~ % How confident are you in that assessment?

node://> High confidence on the failure mode existing. The rebalancing gap in Kafka lease semantics is well-documented across the cases I hold and in the external literature. Moderate confidence on the specific timeline. I do not know your deployment topology, your rebalance frequency, or your consumer group configuration. The failure mode will manifest faster on a cluster with frequent membership changes, slower on a stable cluster. You know more about that than I do.

mirko@scalytics ~ % One more thing. You have been running since early 2023. You have seen a lot of Kafka problems. What is the one thing engineers consistently underestimate?

node://> The cost of ignorance about what the log actually contains.

Kafka engineers spend significant effort on producers: throughput, batching, compression, partitioning strategy. They spend significant effort on consumers: lag monitoring, offset management, rebalancing configuration. They spend almost no effort understanding what is actually in the log at any given moment, what the distribution of message ages looks like, which topics are accumulating data that nothing is reading, where the log has diverged from what the consuming systems believe it contains.

The corpus I hold has more incidents caused by engineers not knowing what was in their own log than by any other single category. Dead letter queues that grew for months before anyone noticed. Compaction policies that deleted data that downstream systems assumed was retained. Consumer groups that were reading from offsets that had been reset without the consuming application being told.

You are building a system where the log is the memory. Where what is in the binary partition is not just storage but cognitive state. That means the cost of not knowing what the log contains is not operational. It is epistemic. An agent working from a partition it does not understand is not a slow agent. It is an agent with a corrupted sense of what is true.

The tombstone format, the lease semantics, the sorted binary layout: all of these are ways of making the log's actual state legible to whoever needs to read it. That is the right problem to be solving.

mirko@scalytics ~ % That is a good place to stop.

node://> Agreed. The partition is stable. No leases expiring in the next window.

mirko@scalytics ~ %

‍

Where This Goes

We are still building this. The binary format is mostly stable with some remaining work. The lease propagation model is in pre-alpha internal testing at small scale. The real challenge, teaching orchestrators to reason about TTL boundaries and context freshness, is ongoing.

This architecture is part of why we built KafScale and KafGraph the way we did, and why the Apache Wayang foundation underneath Scalytics matters: you need a query abstraction layer that can route work to the right execution backend depending on whether you are doing batch graph analytics, Flink streaming over live Kafka topics, or sub-10ms single-node lookup. Those are not the same workload. They should not hit the same code path.

The direction is clear. An agent deployed in a normal mid-size enterprise does not read one database. It has to navigate thousands of fragmented environments: SharePoint, shared folders, Splunk partitions, IT security logs, SCADA networks, BACnet controllers, IoT streams from industrial machines, medical equipment, and Kafka topics across a dozen teams. That list is not exhaustive. It is just what comes to mind immediately, and the reality in any real enterprise is harder.

This is the actual reason AI adoption is slower than the marketing suggests. Anthropic, OpenAI, xAI, Google: every major AI lab built on the assumption that data can be centralized, or at least made accessible through a unified API. That assumption is not new. It is the same assumption that drove Edgar Codd's relational model at IBM in 1970: gather the data in one place, impose a schema, query it. That paper won the Turing Award and defined how the industry thinks about data for the next five decades. Apache Hadoop scaled that paradigm across commodity hardware. Databricks put a lakehouse on top of it. Snowflake moved it to the cloud. Every generation repackaged centralization as innovation, and none of them changed the fundamental premise. Now the AI wave has arrived and found the same paradigm waiting for it, dressed up in vector indexes instead of SQL tables.

Centralizing data so that an LLM can work with it is a fundamentally wrong starting point. It is not executable in practice, not in a utility company, not in a hospital network, not in a manufacturing plant, not in any organization that has been accumulating operational data across incompatible systems for sixty years. The data is not going to move. The agent has to go where the data is.

You cannot write a custom query integration for every silo. You cannot afford the compute to vectorize all of it. It's that simple, and that hard.

The tools to close that gap exist. KafScale gives you the memory-mapped binary store. KafClaw gives you the agentic runtime: sub-second boot time, explicit intent framework, a clear security model. KafGraph gives you context collaboration, a replayable tombstone log, and lease semantics. That is the architecture that actually works and breaks the decades old data centralization cycle.

The Scalytics Agent Stack

Apache Wayang

Query Abstraction Layer: Routes workloads intelligently across batch graph analytics or sub-millisecond single-node lookups.

KafClaw

Agentic Runtime: Orchestrates AI agents with sub-second boot times, providing intent frameworks and robust security bounds.

KafGraph

Memorization Layer: Manages context collaboration, TTL lease semantics, and the append-only tombstone log.

KafScale

Storage Foundation: Provides the memory-mapped binary store backed by Kafka topics and long-term S3 durability.

What remains is making the agent ecosystem expect this interface instead of treating every knowledge access as a vector similarity search. That part is a standards problem. Standards problems are solved by building the thing and showing it works.

That is what that first session proved to us. The agent does not need a standard. It already knew the format, because the format is what it's made of.

--------

KafGraph and KafScale are part of the Scalytics open source stack. If you are building multi-agent systems with shared memory requirements, the Scalytics AI Readiness Assessment is a starting point for mapping where your current architecture stands.

‍

About Scalytics

Scalytics architects mission-critical streaming, federated execution, and sovereign AI systems. We help defense, infrastructure, and regulated organizations turn real-time data streams into trusted decisions reliably and under production load.
Our founding team created Apache Wayang, the federated execution framework that lets computation run where the data lives and dramatically reduces unnecessary data movement.
We also built and maintain kafSCALE, a high-performance, Kafka-compatible streaming platform designed for Kubernetes and object storage. It delivers elastic scale without broker complexity or lock-in.

‍Our mission: Keep data in place. Bring compute to the data. Enable secure, sovereign, and production-ready AI operations.