Scalytics | Agent Observability and Multi-Agent Systems in Production

CEO & co-founder

March 18, 2026

How We Solved the 45-Minute Debugging Crisis with Kafka-First Architecture

Imagine, your week starts, smoothly, good progress on all channels, and then, on Tuesday, an invoice-generation agent made 847 API calls to process a single request. The customer noticed when their monthly bill jumped 340%. We had logs. We had metrics. We had alerts. But we couldn't answer one question: which agent session caused this?

That debugging session took 4 hours. It should have taken 5 minutes.

This is the observability crisis in production agent systems. Here's how we solved it with Kafka-first architecture.

‍

The Challenge: Traditional Observability Breaks for Agents

Standard APM tools assume short-lived, synchronous requests. A web request hits your API, processes in 200ms, returns a response. You trace it with request IDs and structured logs.

And now, the multi-agent systems break every assumption:

Sessions span hours or days - An agent researches competitors, drafts content, schedules posts across 6 hours
Sub-agents create independent execution chains - The main agent delegates to research, writing, and publishing sub-agents
Context lives outside your application - Vector stores, external APIs, LLM reasoning aren't in your database
The same agent handles hundreds of parallel sessions - Logs mix unrelated work streams

‍

When something fails, your engineer spends hours reconstructing the session timeline from scattered CloudWatch entries. You're not debugging. You're doing forensic archaeology.

‍

What We Learned From 40+ Production Agent Teams

We interviewed engineering leads running agents in production. The pattern was consistent. They needed three capabilities:

1. Session Replay "What did agent session X do, step by step?" Not log searching. Chronological event replay with full context.

2. Decision Context "Why did the agent choose this action?" Not just the API call. The reasoning, the model output, the confidence score.

3. Pattern Detection "How many sessions hit this failure mode?" Not one-off debugging. Analysis across thousands of sessions to find systemic issues.

Traditional APM tools can't deliver this. Datadog and New Relic trace HTTP requests through microservices. They don't replay agent reasoning chains or correlate events across 6-hour sessions.

We needed observability built for agent workflows.

‍

Why Kafka for Agent Events

Most teams start by pushing agent telemetry to existing logging infrastructure. JSON blobs to CloudWatch, parse later, hope grep works.

That approach fails when you need to:

Correlate events across services - Agent calls vector DB, triggers embedding pipeline, invokes LLM
Replay sessions chronologically - Logs from 5 services arrive out of order
Analyze patterns at scale - CloudWatch Insights times out on 15-minute queries

‍

We evaluated three approaches:

Option 1: Database writes (Postgres, MongoDB)

Strong consistency, SQL queries
Write throughput collapses at 100K+ events/min
Schema migrations block deployments when you need new fields

Option 2: Existing logging (CloudWatch, Datadog)

Zero infrastructure changes
Built for debugging applications, not replaying agent sessions
Query costs explode when analyzing thousands of sessions

Option 3: Kafka event streaming

Handles 1M+ events/sec without performance degradation
Preserves exact ordering per session (partition by session_id)
Multiple consumers read the same stream (real-time dashboard + batch analysis)
Requires running Kafka infrastructure

‍

We chose Kafka because agent telemetry is event-driven. You don't need ACID transactions. You need guaranteed ordering and replay capability.

‍

Agent Memory Persistence: The Underserved Problem

Here's what we discovered: agent observability isn't just about tracing actions. It's about persisting memory.

An agent session builds context over time:

Conversation history with the user
Retrieved documents from vector search
Tool call results (API responses, file reads)
Reasoning chains and decision points
Sub-agent delegation and results

‍

Traditional logs capture outputs. They don't capture the full memory state that led to each decision.

‍

Kafka as Agent Memory Store

When you partition Kafka topics by session_id, you get:

Ordered memory timeline - Every event appended in sequence
Replayable context - Reconstruct exact agent state at any point
Persistent storage - Memory survives agent crashes and restarts
Multi-consumer access - Real-time dashboard, batch analytics, audit logs

‍

This transforms Kafka from a message queue into a memory persistence layer.

‍

Memory Retention Strategy

We use tiered storage to balance performance and cost:

Hot tier (7 days): Kafka brokers store recent events for fast random access. Active sessions need sub-second lookups for dashboard queries and real-time debugging. This tier handles 95% of debugging requests.

Warm tier (30 days): Kafka's tiered storage moves older events to S3. Still queryable through Kafka consumers, but with higher latency (2-5 seconds vs 50ms). Used for post-incident reviews and pattern analysis.

Cold tier (90+ days): Events exported to Parquet files in S3. Batch analysis only via Spark or Athena. Compliance and long-term trend detection. Query time measured in minutes, not seconds.

This three-tier approach gives us fast debugging for recent issues while maintaining audit history at low cost. Storage cost dropped 73% compared to keeping everything hot in Kafka brokers.

‍

Implementation: The Architecture We Built

Here's the production system we're running today.

Event Schema

Every agent action produces a structured event:

{
  "event_id": "evt_9x7k2m",
  "session_id": "sess_a1b2c3",
  "agent_id": "invoice-generator-v2",
  "timestamp": "2026-01-15T14:23:11.847Z",
  "event_type": "tool_call",
  "payload": {
    "tool": "stripe.create_invoice",
    "input": {"customer_id": "cus_xyz", "amount": 4500},
    "output": {"invoice_id": "inv_abc", "status": "draft"},
    "latency_ms": 1240
  },
  "context": {
    "parent_event_id": "evt_8k3j1n",
    "reasoning": "Customer requested invoice for milestone 2",
    "model": "claude-sonnet-4-5",
    "confidence": 0.94
  }
}

‍

Design decisions:

session_id as partition key guarantees ordering for replay
parent_event_id links sub-agent calls to triggering events
reasoning captures why the agent chose this action
context is flexible (add fields without breaking consumers)

‍

Producer Integration

We built a TypeScript SDK wrapping the Kafka producer:

import { AgentTracer } from '@scalytics/agent-tracer';

const tracer = new AgentTracer({
  sessionId: request.sessionId,
  agentId: 'invoice-generator-v2',
  kafka: { brokers: ['kafka-1:9092', 'kafka-2:9092'] }
});

// Automatic event capture with reasoning
await tracer.trackToolCall(
  'stripe.create_invoice',
  { customer_id: 'cus_xyz', amount: 4500 },
  async () => {
    return await stripe.invoices.create({...});
  },
  { reasoning: 'Milestone 2 completion', confidence: 0.94 }
);

// Manual event for custom actions
await tracer.logDecision({
  action: 'skip_email',
  reasoning: 'Customer disabled notifications',
  confidence: 0.89
});

‍

Producer gotchas we hit:

Batching breaks ordering if you batch across sessions. We partition the producer pool by session_id.
Fire-and-forget loses events during crashes. We use acks=all and idempotent producers. This adds 15ms latency but prevents silent data loss.
Network failures block threads. We queue events in-memory and flush async. If Kafka is unavailable, events buffer for 60 seconds before the agent errors.

‍

Consumer: Real-Time Dashboard

The ops dashboard consumes agent-events and updates Redis for fast queries:

from kafka import KafkaConsumer
import redis
import json

consumer = KafkaConsumer(
    'agent-events',
    group_id='session-dashboard',
    bootstrap_servers=['kafka-1:9092'],
    value_deserializer=lambda m: json.loads(m.decode('utf-8'))
)

redis_client = redis.Redis(host='redis-1', decode_responses=True)

for message in consumer:
    event = message.value
    session_key = f"session:{event['session_id']}"

    # Append to timeline
    redis_client.rpush(
        f"{session_key}:events",
        json.dumps(event)
    )

    # Update metadata
    redis_client.hset(session_key, mapping={
        'last_event': event['timestamp'],
        'event_count': redis_client.llen(f"{session_key}:events"),
        'status': event.get('status', 'active')
    })

    # 30-day TTL
    redis_client.expire(session_key, 2592000)

‍

The dashboard shows:

Live session timelines (events stream in real-time)
Current agent state (active tool call, reasoning)
Cost tracking (LLM token usage per session)

‍

Why Redis instead of querying Kafka directly: Kafka optimizes for sequential writes, not random reads. Fetching "last 50 events for session X" requires scanning the partition. Redis gives O(1) lookups.

‍

Consumer: Batch Analytics

A second consumer runs hourly for pattern detection:

from kafka import KafkaConsumer
from pyspark.sql import SparkSession

consumer = KafkaConsumer(
    'agent-events',
    group_id='analytics-batch',
    auto_offset_reset='earliest',
    consumer_timeout_ms=10000
)

events = [msg.value for msg in consumer]

spark = SparkSession.builder.appName('agent-analytics').getOrCreate()
df = spark.createDataFrame(events)

# Find expensive sessions
expensive = df.filter(
    df.event_type == 'llm_call'
).groupBy('session_id').agg(
    sum('payload.cost_usd').alias('total_cost')
).filter('total_cost > 10')

# Detect retry loops
loops = df.filter(
    df.event_type == 'tool_call'
).groupBy('session_id', 'payload.tool').agg(
    count('*').alias('call_count')
).filter('call_count > 50')

‍

This pipeline caught three production issues in week one:

Agent retrying failed Stripe calls without backoff (164 calls in 90 seconds)
Vector search running every loop iteration instead of caching
LLM prompt including full conversation history on every call (10x token cost)

‍

Standard logs and APM tools don't surface these patterns.

‍

Production Toolchain: What We Use Daily

KShark: Session Replay Tool

KShark consumes Kafka events and reconstructs session timelines:

kshark replay sess_a1b2c3 --from "2026-01-15T14:00:00Z"

Output shows:

Chronological event list with timestamps
Tool calls with inputs/outputs
Agent reasoning at each decision point
Sub-agent delegations and results
Performance metrics (latency per action)

‍

We use this for debugging customer issues and post-incident reviews.

‍

KDiff: Session Comparison

KDiff compares two sessions to find behavioral differences:

kdiff compare sess_a1b2c3 sess_b4c5d6 --show-divergence

This helps answer:

Why did session A succeed but session B fail?
What decision point caused different outcomes?
Which reasoning chain led to the error?

‍

KafMirror: Cross-Region Replication

KafMirror replicates agent events to disaster recovery regions:

kafmirror replicate --source us-east-1 --target eu-west-1 \
  --topics agent-events --lag-threshold 10s

This ensures we can replay sessions even if the primary region fails.

‍

The Results After 60 Days

Debugging time:

Before: 2-4 hours reconstructing sessions from logs
After: 5 minutes with KShark replay

Cost visibility:

Before: Discovered overages 3 days later (AWS bill arrival)
After: Slack alert within 10 minutes when session exceeds threshold

Incident response:

Before: "Something broke, grep logs, good luck"
After: Click session ID, view timeline, see exact failure event

Team confidence:

Before: Engineers nervous to deploy agent changes
After: Full visibility enables daily deployments

‍

Production Gotchas We Learned

1. Topic retention surprises you

We started with 7-day retention. Someone needed to debug a 10-day-old session. The events were gone.

Solution: Tiered storage (7 days hot, 30 days warm, 90 days cold archive).

‍

2. Partition count matters upfront

We started with 3 partitions. One chatty agent dominated partition 2, creating a hotspot.

Kafka can't increase partitions without breaking ordering guarantees. We created a new topic with 20 partitions and migrated consumers.

Plan for growth from day one.

‍

3. Event size grows over time

Early events: 500 bytes. Then someone added full LLM prompts to context. Events ballooned to 50KB, killing throughput.

Solution: 10KB event limit. Large payloads go to S3 with references in events.

‍

4. Consumer lag is your SLA

Dashboard consumers falling behind show stale session data. We monitor lag as a first-class metric.

Fix: Autoscale consumers based on lag (not CPU). Kubernetes HPA with custom metrics.

‍

When NOT to Use This Approach

Kafka adds operational complexity. Don't use it if:

Running fewer than 10 agents - CloudWatch is sufficient
Sessions under 60 seconds - Standard APM works fine
Existing streaming platform - Use Kinesis or Pub/Sub if you already run them
No Kafka experience - Steep learning curve

‍

Start simple. Graduate to Kafka when your tools break.

‍

Scalytics: Production-Grade Infrastructure

Building this yourself requires:

Kafka cluster management (brokers, ZooKeeper, schema registry)
Consumer scaling and lag monitoring
Tiered storage configuration
Session replay tooling (KShark, KDiff)
Cross-region replication
Cost tracking and alerting

‍

KafScale provides this as managed infrastructure and handles:

Kafka operations and scaling
Schema management and versioning
Consumer orchestration
Session replay dashboard
Cost anomaly detection

‍

KafClaw adds agent development tooling:

Pre-built agent templates with tracing
Local development environment
Integration testing framework
Production deployment pipeline

‍

If you're running agents in production and debugging with grep, you're one incident away from a multi-hour outage.

The pattern we've validated:

Stream agent events to Kafka (ordered, replayable, scalable)
Real-time consumer updates dashboards (Redis for fast queries)
Batch consumer detects patterns (Spark or BigQuery)
Alert on cost and failure loops (Slack, PagerDuty)

‍

This isn't the only way to build agent observability. But after 60 days in production handling 2M+ events per day, it's what works for the client.

Links:

Confluent: Build Real-Time AI Agents with Kafka - Validation of the architectural rationale for Kafka in agent systems.
Uber Engineering: Kafka Tiered Storage - Empirical evidence for cost-reduction and performance at scale.
LinkedIn Engineering: Benchmarking Apache Kafka Scale - Proof of high-throughput capacity required for intensive agent telemetry.
Splunk: Monitoring Generative AI - Contextualizing the visibility gap in traditional monitoring tools.

‍

About Scalytics

Scalytics architects mission-critical streaming, federated execution, and sovereign AI systems. We help defense, infrastructure, and regulated organizations turn real-time data streams into trusted decisions reliably and under production load.
Our founding team created Apache Wayang, the federated execution framework that lets computation run where the data lives and dramatically reduces unnecessary data movement.
We also built and maintain kafSCALE, a high-performance, Kafka-compatible streaming platform designed for Kubernetes and object storage. It delivers elastic scale without broker complexity or lock-in.

‍Our mission: Keep data in place. Bring compute to the data. Enable secure, sovereign, and production-ready AI operations.