Scalytics | Kafka-First Multi-Agent Systems: Backpressure, Retries, DLQs & Replay with KafScale 01-system-overview

Dr. Mirko Kämpf

Multi-agent systems fail in production for predictable reasons: one agent gets overloaded and drops messages, retries create duplicate work, or agents deadlock waiting for each other. Framework-based orchestrators try to solve this with custom scheduling, rate limiting, and retry logic, then struggle when scale exceeds their design assumptions.

KafScale takes a different approach: leverage Kafka's primitives instead of reinventing them. This final part of the series shows how Kafka-first patterns provide production-grade guarantees without centralized coordination.

‍

The Orchestrator Bottleneck

Traditional multi-agent systems use a central orchestrator:

Agent A ──┐
          ├──> Orchestrator ──> Agent B
Agent C ──┘

‍

Problems at scale:

Single point of failure: Orchestrator down = whole system down
Backpressure doesn't propagate: Agent B slow? Orchestrator queues requests, eventually OOMs
Retry complexity: Orchestrator must track which agent failed, persist state, schedule retries
No replay: Agent B crashed during processing? Data lost.

Kafka's architecture is fundamentally different:

Agent A ─┐
         ├──> Topic (durable log) ──┐
Agent C ─┘                          ├──> Agent B
                                    └──> Agent D

‍‍

Key insight: The topic is the orchestrator. But unlike a traditional orchestrator, it's distributed, durable, and scales horizontally.

‍

Pattern 1: Consumer Groups for Load Balancing

When Agent B gets overwhelmed, KafScale uses Kafka consumer groups for automatic load distribution:

# Two instances of Agent B, same consumer group
@kafscale.consumer(topic="tasks.pending", group="agent-b-workers")
def process_task(event):
    # Kafka guarantees each event goes to only one instance
    result = expensive_computation(event)
    emit(topic="tasks.completed", data=result)

‍‍

Deploy 3 instances → Kafka distributes partitions:

Instance 1: processes partitions 0, 1, 2
Instance 2: processes partitions 3, 4, 5
Instance 3: processes partitions 6, 7, 8

Instance 2 crashes → Kafka automatically rebalances:

Instance 1: partitions 0, 1, 2, 3, 4
Instance 3: partitions 5, 6, 7, 8

No orchestrator reconfiguration. No manual load balancing. Kafka handles it.

KafScale advantage: Add capacity by deploying more containers. Remove capacity by scaling down. Kafka's consumer group protocol handles coordination; your agents just consume and produce.

‍

Pattern 2: Backpressure Through Partition Lag

When Agent B can't keep up, orchestrators typically:

Queue requests (until memory exhausted)
Drop requests (data loss)
Apply rate limiting (manual tuning, brittle)

Kafka provides automatic backpressure via partition lag:

@kafscale.consumer(
    topic="tasks.pending",
    group="agent-b-workers",
    max_poll_records=10,  # Fetch 10 messages max
    session_timeout_ms=30000  # 30s heartbeat
)
def process_task(event):
    result = slow_ml_inference(event)  # Takes 2s per event
    emit(topic="results", data=result)

‍

If processing can't keep up:

Partition lag increases: Messages accumulate in Kafka
Consumer stays healthy: It's still processing (just slowly)
No data loss: Messages durably stored in Kafka
Upstream visibility: Monitoring shows lag, trigger autoscaling

Compare to orchestrator: Agent B slow → orchestrator's queue grows → OOM crash → data loss.

KafScale advantage: Kafka's log is a shock absorber. Temporary slowdowns don't cause failures, they cause lag, which is observable and manageable.

‍

Pattern 3: Exactly-Once Semantics

Agent failures create duplicate processing risks. Consider fraud detection:

Agent consumes transaction event
Calls external API (risk score)
Emits fraud decision to Kafka
Agent crashes before committing Kafka offset

On restart, agent reprocesses the transaction → duplicate risk score API call → duplicate fraud decision.

KafScale uses Kafka transactions for processing:

@kafscale.transactional_consumer(
    topic="transactions.pending",
    output_topic="fraud.decisions"
)
def detect_fraud(event):
    risk_score = call_risk_api(event["transaction"])  # External call
    
    decision = {
        "transaction_id": event["transaction"]["id"],
        "risk_score": risk_score,
        "decision": "block" if risk_score > 0.8 else "allow"
    }
    
    # Kafka commits offset + output message atomically
    return decision

‍

Kafka guarantees:

If agent crashes after API call but before Kafka commit → offset not committed → event reprocessed → duplicate API call (can't avoid)
If agent crashes after Kafka commit → offset committed → event not reprocessed → no duplicate output

This is idempotent producer + transactional offset commits. Downstream agents see each decision exactly once, even if the detector crashes mid-processing.

KafScale advantage: Orchestrators rarely provide exactly-once semantics across agent boundaries. Kafka does, natively.

‍

Pattern 4: Dead Letter Queues for Failures

Some events can't be processed: malformed data, external API down, invalid business logic. KafScale handles this with dead letter queues:

@kafscale.consumer(
    topic="orders.pending",
    dlq_topic="orders.failed",
    max_retries=3
)
def process_order(event):
    try:
        validate_order(event)
        charge_payment(event["payment_method"])
        emit(topic="orders.completed", data=event)
    except PaymentDeclined as e:
        # Business-level failure → send to DLQ with context
        raise IrrecoverableError(f"Payment declined: {e}")
    except TemporaryError as e:
        # Transient failure → retry
        raise

‍

Retry flow:

Processing fails → KafScale retries (max 3 times)
Still failing → event moved to orders.failed DLQ
Original topic continues processing (no blocking)

DLQ consumption:‍

# Separate agent monitors DLQ
@kafscale.consumer(topic="orders.failed")
def handle_failed_orders(event):
    alert_ops_team(event)
    log_to_datadog(event)
    
    # Manual recovery possible:
    # - Fix data issue
    # - Re-emit to orders.pending

‍

KafScale advantage: DLQs are just Kafka topics. Query them with SQL (ksqlDB), replay them after fixing issues, route to different agents based on error type.

‍

Pattern 5: Saga Pattern for Multi-Agent Workflows

Complex workflows require coordination: booking flights and hotels and rental cars, with rollback if any step fails.

KafScale implements Saga pattern through event choreography:

user.booked_trip
   ↓
flight.booking.requested → [Flight Agent] → flight.booking.confirmed
                                          ↘ flight.booking.failed
   ↓
hotel.booking.requested → [Hotel Agent] → hotel.booking.confirmed
                                        ↘ hotel.booking.failed
   ↓
trip.booking.completed

‍

Compensation on failure:

@kafscale.consumer(topic="hotel.booking.failed")
def handle_hotel_failure(event):
    # Hotel failed → cancel confirmed flight
    emit(topic="flight.cancellation.requested", data={
        "booking_id": event["flight_booking_id"],
        "reason": "compensating_transaction"
    })
    
    emit(topic="trip.booking.failed", data={
        "trip_id": event["trip_id"],
        "failed_step": "hotel",
        "status": "rolled_back"
    })

‍‍

No orchestrator tracking state. Each agent:

Consumes events from its input topic
Performs local work
Emits success or failure events
Other agents react to failures (compensation)

KafScale advantage: Saga state is the Kafka log itself. To debug a failed booking, query the log, every step is recorded. To replay, re-emit the triggering event.

‍

Pattern 6: Request-Reply Without Blocking

Sometimes agents need synchronous-feeling interactions (ask Agent A a question, wait for reply). KafScale provides non-blocking request-reply:

# Requester agent
@kafscale.consumer(topic="user.questions")
def ask_knowledge_agent(event):
    question = event["question"]
    correlation_id = uuid.uuid4()
    
    emit(topic="knowledge.requests", data={
        "correlation_id": correlation_id,
        "question": question,
        "reply_to": "user.answers"
    })
    
    # Don't block — continue processing other events
    # Reply will arrive asynchronously on user.answers

# Knowledge agent
@kafscale.consumer(topic="knowledge.requests")
def answer_question(event):
    answer = query_knowledge_base(event["question"])
    
    emit(topic=event["reply_to"], data={
        "correlation_id": event["correlation_id"],
        "answer": answer
    })

# Requester consumes reply
@kafscale.consumer(topic="user.answers")
def handle_reply(event):
    # Match correlation_id to original request (local state store)
    original_request = state_store.get(event["correlation_id"])
    send_to_user(original_request["user_id"], event["answer"])

‍

This is async request-reply => no blocking, no timeouts, full auditability.

KafScale advantage: Unlike HTTP request-reply, Kafka-based async messaging survives agent restarts. If the knowledge agent crashes mid-processing, it restarts and continues—no lost requests.

‍

Pattern 7: Event Replay for Time Travel

Production issues often require "what if" analysis: what would the fraud detector have decided if we used the new model yesterday?

KafScale enables event replay:

# Original fraud detector (v1 model)
@kafscale.consumer(topic="transactions", group="fraud-detector-v1")

# New fraud detector (v2 model, different consumer group)
@kafscale.consumer(topic="transactions", group="fraud-detector-v2")

‍

To test v2 against yesterday's traffic:

Deploy v2 with a new consumer group
Set offset to yesterday's starting point
v2 replays events, emits decisions to fraud.decisions.v2 topic
Compare fraud.decisions.v1 vs fraud.decisions.v2

No data regeneration needed:Kafka retains the log (retention configurable).

KafScale advantage: A/B test agent behavior, debug production issues, validate model updates, all without reprocessing live traffic.

‍

Putting It Together: Production Architecture

Here's a full KafScale multi-agent system handling e-commerce orders:

order.received (Kafka Topic)
   ↓
[Inventory Agent] ──> inventory.checked
   ↓
[Fraud Agent] ──> fraud.cleared / fraud.flagged
   ↓
[Payment Agent] ──> payment.completed / payment.failed
   ↓
[Fulfillment Agent] ──> shipment.created
   ↓
[Notification Agent] ──> notification.sent
   ↓
order.completed (Kafka Topic)

‍

Operational characteristics:

Each agent scales independently (consumer groups)
Failures don't cascade (DLQs + retries)
Full audit trail (every state transition in Kafka)
No orchestrator to fail (choreography via topics)
Replay for debugging (reprocess events from any timestamp)

‍

Compare to orchestrator-based systems:

Orchestrator becomes bottleneck at scale
Orchestrator crash = system down
No native replay (state in memory/database)
Centralized retry logic (complex, brittle)

‍

Why Kafka-First Wins

The patterns above aren't hacks - they're Kafka's design intent:

Durability: Logs survive failures, enable replay
Partitioning: Horizontal scaling without coordination
Consumer groups: Built-in load balancing
Transactions: Exactly-once guarantees across agents
Decoupling: Agents don't know about each other

KafScale doesn't fight these primitives. The result: multi-agent systems that scale, survive failures, and remain debuggable in production.

‍

From Theory to Practice

This series covered:

Part 1: Cognitive Lenses (Interpreter/Exposer/Enhancer patterns)
Part 2: Agent Abstraction (portable components, framework independence)
Part 3: Kafka-First Patterns (backpressure, retries, coordination)

The common theme: Kafka isn't just a message bus anymore, it's the operational backbone for multi-agent systems. By aligning agent design with Kafka's strengths, KafScale enables production-grade deployments without orchestrator complexity.

Ready to build Kafka-first agents? Start with KafScale's starter templates: github.com/KafScale/onboarding-tutorials

About KafScale: Production-grade multi-agent systems on Kafka. Stateless agents, durable logs, zero orchestrator overhead. Learn more at kafscale.io

‍
The full series

Part 1 - How KafScale transforms raw Kafka data into agent-ready context: KafClaw Agents as Cognitive Lenses
Part 2 - The Agent Abstraction Problem: Building portable components that survive framework churn

About Scalytics

Scalytics architects and troubleshoots mission-critical streaming, federated execution, and AI systems for scaling SMEs. When Kafka pipelines fall behind, SAP IDocs block processing, lakehouse sinks break, or AI pilots collapse under real load, we step in and make them run.

Our founding team created Apache Wayang (now an Apache Top-Level Project), the federated execution framework that orchestrates Spark, Flink, and TensorFlow where data lives and reduces ETL movement overhead.

We also invented and actively maintain KafScale (S3-Kafka-streaming platform), a Kafka-compatible, stateless data and large object streaming system designed for Kubernetes and object storage backends. Elastic compute. No broker babysitting. No lock-in.

Our mission: Data stays in place. Compute comes to you. From data lakehousese to private AI deployment and distributed ML - all designed for security, compliance, and production resilience.

Questions? Join our open Slack community or schedule a consult.