How We Solved the 45-Minute Debugging Crisis with Kafka-First Architecture
Imagine, your week starts, smoothly, good progress on all channels, and then, on Tuesday, an invoice-generation agent made 847 API calls to process a single request. The customer noticed when their monthly bill jumped 340%. We had logs. We had metrics. We had alerts. But we couldn't answer one question: which agent session caused this?
That debugging session took 4 hours. It should have taken 5 minutes.
This is the observability crisis in production agent systems. Here's how we solved it with Kafka-first architecture.
The Challenge: Traditional Observability Breaks for Agents
Standard APM tools assume short-lived, synchronous requests. A web request hits your API, processes in 200ms, returns a response. You trace it with request IDs and structured logs.
And now, the multi-agent systems break every assumption:
- Sessions span hours or days - An agent researches competitors, drafts content, schedules posts across 6 hours
- Sub-agents create independent execution chains - The main agent delegates to research, writing, and publishing sub-agents
- Context lives outside your application - Vector stores, external APIs, LLM reasoning aren't in your database
- The same agent handles hundreds of parallel sessions - Logs mix unrelated work streams
When something fails, your engineer spends hours reconstructing the session timeline from scattered CloudWatch entries. You're not debugging. You're doing forensic archaeology.
What We Learned From 40+ Production Agent Teams
We interviewed engineering leads running agents in production. The pattern was consistent. They needed three capabilities:
1. Session Replay "What did agent session X do, step by step?" Not log searching. Chronological event replay with full context.
2. Decision Context "Why did the agent choose this action?" Not just the API call. The reasoning, the model output, the confidence score.
3. Pattern Detection "How many sessions hit this failure mode?" Not one-off debugging. Analysis across thousands of sessions to find systemic issues.
Traditional APM tools can't deliver this. Datadog and New Relic trace HTTP requests through microservices. They don't replay agent reasoning chains or correlate events across 6-hour sessions.
We needed observability built for agent workflows.
Why Kafka for Agent Events
Most teams start by pushing agent telemetry to existing logging infrastructure. JSON blobs to CloudWatch, parse later, hope grep works.
That approach fails when you need to:
- Correlate events across services - Agent calls vector DB, triggers embedding pipeline, invokes LLM
- Replay sessions chronologically - Logs from 5 services arrive out of order
- Analyze patterns at scale - CloudWatch Insights times out on 15-minute queries
We evaluated three approaches:
Option 1: Database writes (Postgres, MongoDB)
- Strong consistency, SQL queries
- Write throughput collapses at 100K+ events/min
- Schema migrations block deployments when you need new fields
Option 2: Existing logging (CloudWatch, Datadog)
- Zero infrastructure changes
- Built for debugging applications, not replaying agent sessions
- Query costs explode when analyzing thousands of sessions
Option 3: Kafka event streaming
- Handles 1M+ events/sec without performance degradation
- Preserves exact ordering per session (partition by session_id)
- Multiple consumers read the same stream (real-time dashboard + batch analysis)
- Requires running Kafka infrastructure
We chose Kafka because agent telemetry is event-driven. You don't need ACID transactions. You need guaranteed ordering and replay capability.
Agent Memory Persistence: The Underserved Problem
Here's what we discovered: agent observability isn't just about tracing actions. It's about persisting memory.
An agent session builds context over time:
- Conversation history with the user
- Retrieved documents from vector search
- Tool call results (API responses, file reads)
- Reasoning chains and decision points
- Sub-agent delegation and results
Traditional logs capture outputs. They don't capture the full memory state that led to each decision.
Kafka as Agent Memory Store
When you partition Kafka topics by session_id, you get:
- Ordered memory timeline - Every event appended in sequence
- Replayable context - Reconstruct exact agent state at any point
- Persistent storage - Memory survives agent crashes and restarts
- Multi-consumer access - Real-time dashboard, batch analytics, audit logs
This transforms Kafka from a message queue into a memory persistence layer.
Memory Retention Strategy
We use tiered storage to balance performance and cost:
Hot tier (7 days): Kafka brokers store recent events for fast random access. Active sessions need sub-second lookups for dashboard queries and real-time debugging. This tier handles 95% of debugging requests.
Warm tier (30 days): Kafka's tiered storage moves older events to S3. Still queryable through Kafka consumers, but with higher latency (2-5 seconds vs 50ms). Used for post-incident reviews and pattern analysis.
Cold tier (90+ days): Events exported to Parquet files in S3. Batch analysis only via Spark or Athena. Compliance and long-term trend detection. Query time measured in minutes, not seconds.
This three-tier approach gives us fast debugging for recent issues while maintaining audit history at low cost. Storage cost dropped 73% compared to keeping everything hot in Kafka brokers.
Implementation: The Architecture We Built
Here's the production system we're running today.
Event Schema
Every agent action produces a structured event:
{
"event_id": "evt_9x7k2m",
"session_id": "sess_a1b2c3",
"agent_id": "invoice-generator-v2",
"timestamp": "2026-01-15T14:23:11.847Z",
"event_type": "tool_call",
"payload": {
"tool": "stripe.create_invoice",
"input": {"customer_id": "cus_xyz", "amount": 4500},
"output": {"invoice_id": "inv_abc", "status": "draft"},
"latency_ms": 1240
},
"context": {
"parent_event_id": "evt_8k3j1n",
"reasoning": "Customer requested invoice for milestone 2",
"model": "claude-sonnet-4-5",
"confidence": 0.94
}
}
Design decisions:
session_idas partition key guarantees ordering for replayparent_event_idlinks sub-agent calls to triggering eventsreasoningcaptures why the agent chose this actioncontextis flexible (add fields without breaking consumers)
Producer Integration
We built a TypeScript SDK wrapping the Kafka producer:
import { AgentTracer } from '@scalytics/agent-tracer';
const tracer = new AgentTracer({
sessionId: request.sessionId,
agentId: 'invoice-generator-v2',
kafka: { brokers: ['kafka-1:9092', 'kafka-2:9092'] }
});
// Automatic event capture with reasoning
await tracer.trackToolCall(
'stripe.create_invoice',
{ customer_id: 'cus_xyz', amount: 4500 },
async () => {
return await stripe.invoices.create({...});
},
{ reasoning: 'Milestone 2 completion', confidence: 0.94 }
);
// Manual event for custom actions
await tracer.logDecision({
action: 'skip_email',
reasoning: 'Customer disabled notifications',
confidence: 0.89
});
Producer gotchas we hit:
- Batching breaks ordering if you batch across sessions. We partition the producer pool by
session_id. - Fire-and-forget loses events during crashes. We use
acks=alland idempotent producers. This adds 15ms latency but prevents silent data loss. - Network failures block threads. We queue events in-memory and flush async. If Kafka is unavailable, events buffer for 60 seconds before the agent errors.
Consumer: Real-Time Dashboard
The ops dashboard consumes agent-events and updates Redis for fast queries:
from kafka import KafkaConsumer
import redis
import json
consumer = KafkaConsumer(
'agent-events',
group_id='session-dashboard',
bootstrap_servers=['kafka-1:9092'],
value_deserializer=lambda m: json.loads(m.decode('utf-8'))
)
redis_client = redis.Redis(host='redis-1', decode_responses=True)
for message in consumer:
event = message.value
session_key = f"session:{event['session_id']}"
# Append to timeline
redis_client.rpush(
f"{session_key}:events",
json.dumps(event)
)
# Update metadata
redis_client.hset(session_key, mapping={
'last_event': event['timestamp'],
'event_count': redis_client.llen(f"{session_key}:events"),
'status': event.get('status', 'active')
})
# 30-day TTL
redis_client.expire(session_key, 2592000)
The dashboard shows:
- Live session timelines (events stream in real-time)
- Current agent state (active tool call, reasoning)
- Cost tracking (LLM token usage per session)
Why Redis instead of querying Kafka directly: Kafka optimizes for sequential writes, not random reads. Fetching "last 50 events for session X" requires scanning the partition. Redis gives O(1) lookups.
Consumer: Batch Analytics
A second consumer runs hourly for pattern detection:
from kafka import KafkaConsumer
from pyspark.sql import SparkSession
consumer = KafkaConsumer(
'agent-events',
group_id='analytics-batch',
auto_offset_reset='earliest',
consumer_timeout_ms=10000
)
events = [msg.value for msg in consumer]
spark = SparkSession.builder.appName('agent-analytics').getOrCreate()
df = spark.createDataFrame(events)
# Find expensive sessions
expensive = df.filter(
df.event_type == 'llm_call'
).groupBy('session_id').agg(
sum('payload.cost_usd').alias('total_cost')
).filter('total_cost > 10')
# Detect retry loops
loops = df.filter(
df.event_type == 'tool_call'
).groupBy('session_id', 'payload.tool').agg(
count('*').alias('call_count')
).filter('call_count > 50')
This pipeline caught three production issues in week one:
- Agent retrying failed Stripe calls without backoff (164 calls in 90 seconds)
- Vector search running every loop iteration instead of caching
- LLM prompt including full conversation history on every call (10x token cost)
Standard logs and APM tools don't surface these patterns.
Production Toolchain: What We Use Daily
KShark: Session Replay Tool
KShark consumes Kafka events and reconstructs session timelines:
kshark replay sess_a1b2c3 --from "2026-01-15T14:00:00Z"
Output shows:
- Chronological event list with timestamps
- Tool calls with inputs/outputs
- Agent reasoning at each decision point
- Sub-agent delegations and results
- Performance metrics (latency per action)
We use this for debugging customer issues and post-incident reviews.
KDiff: Session Comparison
KDiff compares two sessions to find behavioral differences:
kdiff compare sess_a1b2c3 sess_b4c5d6 --show-divergence
This helps answer:
- Why did session A succeed but session B fail?
- What decision point caused different outcomes?
- Which reasoning chain led to the error?
KafMirror: Cross-Region Replication
KafMirror replicates agent events to disaster recovery regions:
kafmirror replicate --source us-east-1 --target eu-west-1 \
--topics agent-events --lag-threshold 10sThis ensures we can replay sessions even if the primary region fails.
The Results After 60 Days
Debugging time:
- Before: 2-4 hours reconstructing sessions from logs
- After: 5 minutes with KShark replay
Cost visibility:
- Before: Discovered overages 3 days later (AWS bill arrival)
- After: Slack alert within 10 minutes when session exceeds threshold
Incident response:
- Before: "Something broke, grep logs, good luck"
- After: Click session ID, view timeline, see exact failure event
Team confidence:
- Before: Engineers nervous to deploy agent changes
- After: Full visibility enables daily deployments
Production Gotchas We Learned
1. Topic retention surprises you
We started with 7-day retention. Someone needed to debug a 10-day-old session. The events were gone.
Solution: Tiered storage (7 days hot, 30 days warm, 90 days cold archive).
2. Partition count matters upfront
We started with 3 partitions. One chatty agent dominated partition 2, creating a hotspot.
Kafka can't increase partitions without breaking ordering guarantees. We created a new topic with 20 partitions and migrated consumers.
Plan for growth from day one.
3. Event size grows over time
Early events: 500 bytes. Then someone added full LLM prompts to context. Events ballooned to 50KB, killing throughput.
Solution: 10KB event limit. Large payloads go to S3 with references in events.
4. Consumer lag is your SLA
Dashboard consumers falling behind show stale session data. We monitor lag as a first-class metric.
Fix: Autoscale consumers based on lag (not CPU). Kubernetes HPA with custom metrics.
When NOT to Use This Approach
Kafka adds operational complexity. Don't use it if:
- Running fewer than 10 agents - CloudWatch is sufficient
- Sessions under 60 seconds - Standard APM works fine
- Existing streaming platform - Use Kinesis or Pub/Sub if you already run them
- No Kafka experience - Steep learning curve
Start simple. Graduate to Kafka when your tools break.
Scalytics: Production-Grade Infrastructure
Building this yourself requires:
- Kafka cluster management (brokers, ZooKeeper, schema registry)
- Consumer scaling and lag monitoring
- Tiered storage configuration
- Session replay tooling (KShark, KDiff)
- Cross-region replication
- Cost tracking and alerting
KafScale provides this as managed infrastructure and handles:
- Kafka operations and scaling
- Schema management and versioning
- Consumer orchestration
- Session replay dashboard
- Cost anomaly detection
KafClaw adds agent development tooling:
- Pre-built agent templates with tracing
- Local development environment
- Integration testing framework
- Production deployment pipeline
If you're running agents in production and debugging with grep, you're one incident away from a multi-hour outage.
The pattern we've validated:
- Stream agent events to Kafka (ordered, replayable, scalable)
- Real-time consumer updates dashboards (Redis for fast queries)
- Batch consumer detects patterns (Spark or BigQuery)
- Alert on cost and failure loops (Slack, PagerDuty)
This isn't the only way to build agent observability. But after 60 days in production handling 2M+ events per day, it's what works for the client.
Links:
- Confluent: Build Real-Time AI Agents with Kafka - Validation of the architectural rationale for Kafka in agent systems.
- Uber Engineering: Kafka Tiered Storage - Empirical evidence for cost-reduction and performance at scale.
- LinkedIn Engineering: Benchmarking Apache Kafka Scale - Proof of high-throughput capacity required for intensive agent telemetry.
- Splunk: Monitoring Generative AI - Contextualizing the visibility gap in traditional monitoring tools.
About Scalytics
Our founding team created Apache Wayang (now an Apache Top-Level Project), the federated execution framework that orchestrates Spark, Flink, and TensorFlow where data lives and reduces ETL movement overhead.
We also invented and actively maintain KafScale (S3-Kafka-streaming platform), a Kafka-compatible, stateless data and large object streaming system designed for Kubernetes and object storage backends. Elastic compute. No broker babysitting. No lock-in.
Our mission: Data stays in place. Compute comes to you. From data lakehousese to private AI deployment and distributed ML - all designed for security, compliance, and production resilience.
Questions? Join our open Slack community or schedule a consult.
