Kafka backup, disaster recovery, point-in-time restore

One spine.
Every Kafka.
Mirrored to S3.

kaf-mirror live-syncs Confluent, Apache Kafka, Redpanda, or MSK to a KafScale spine on S3. Immutable. Point-in-time recoverable. Apache 2.0 on both ends. Production stays untouched. The same infrastructure becomes your AI agent platform.

live mirror · prod → bdr
streaming
production
CONFLUENT
untouched · real-time SLA
any kafka source-agnostic
kaf-mirror
live sync
franz-go · idempotent · regex map
112 ms replication lag
kafscale
BDR spine
stateless brokers · .kfs to S3
412,803 segments
S3 · object lock · immutable 11 nines durability
Events mirrored
84,217,000
PITR window
unbounded
License
Apache 2.0
Kafka backup, disaster recovery, point-in-time restore

One spine.
Every Kafka.
Mirrored to S3.

Immutable. Point-in-time recoverable. Apache 2.0 on both ends. Production stays untouched. Insurance costs a fraction of a second cluster.
Replication is not backup

If you can delete prod with one command, your DR cluster deletes with it.

Cluster Linking, MirrorMaker 2, and MSK Replicator are excellent at one thing: keeping a hot standby in sync with production. None of them protect against a misfired `kafka-topics --delete`, a ransomware attack that walks through the producer credentials, or a retention misconfiguration that drops three weeks of audit data overnight. The deletion replicates. The corruption replicates. The compromised credentials replicate. A Kafka backup and disaster recovery strategy needs something replication cannot provide: an immutable, point-in-time, off-vendor copy of the log.

The three failure modes

Where hot replication quietly fails you.

Every production Kafka incident postmortem we have read in the last two years sits in one of three buckets. None of them are solved by the standby cluster you already pay for. They are solved by an immutable, time-indexed copy of the log that lives outside the production blast radius.

01 · Operator error

The accidental delete propagates.

An engineer runs `kafka-topics --delete` against production thinking it is staging. A `terraform apply` removes 14 topics. MirrorMaker 2 and Cluster Linking do exactly what they were configured to do: mirror the deletion to the DR cluster. Now you have two copies of nothing. KIP-382 itself notes MirrorMaker is "insufficient for many use cases, including backup, disaster recovery, and fail-over scenarios."

02 · Ransomware in the stream

Credentials get compromised.

A producer key leaks. An attacker writes malformed events into a payments topic, or worse, deletes consumer offsets and resets retention to one hour. The standby mirrors the writes. Replication is not isolation. The only defense is a copy that the production credentials cannot reach, stored under different keys, under retention rules production cannot modify.

03 · No rewind

Replication shows current state. Backup shows history.

A bug ships at 09:14 and writes corrupt events for 47 minutes before someone notices. You need the log as of 09:13. A hot standby gives you the log as of right now, with the same corruption. Point-in-time recovery requires immutable segments indexed by timestamp, which is exactly what KafScale .kfs segments on S3 with object lock provide.

The architecture

Production untouched. BDR on S3 economics.

The pattern uses two open-source components Scalytics already maintains. kaf-mirror is a high-performance franz-go replicator with an enterprise dashboard, AI anomaly detection, and built-in compliance reporting. KafScale is a stateless, Kafka-protocol-compatible streaming spine that writes immutable .kfs segments to S3. Together they form a backup-DR target that does not require a second production-sized cluster, does not lock you to a vendor, and does not double your storage bill.

production · untouched
Source cluster
confluent / apache
msk / redpanda
real-time SLA preserved
kaf-mirror · live sync
Replication engine
franz-go · idempotent producer
same-name or regex mapping
dashboard · ai anomalies · audit
kafscale + s3 object lock
BDR target
stateless brokers · etcd metadata
.kfs segments · immutable
11 nines durability
Point-in-time recovery works because every .kfs segment carries its commit timestamp in object metadata. To rewind a topic, KafScale lists segments under the topic prefix, filters to those that closed before the target timestamp, and exposes them as a recovered topic that consumers can replay. The procedure for the operator is one command: kafscale-cli restore --topic orders --to "2026-05-13T14:23:00Z". The data path is the same Kafka wire protocol your consumers already speak.
Three sources · one BDR target

Connect any Kafka. The mirror config is six lines.

kaf-mirror reads its base configuration from configs/default.yml and stores runtime job state in SQLite. The schema is identical for every Kafka-protocol source. What changes between Confluent, Apache, and Redpanda is the bootstrap address and the SASL mechanism. The target block points at the KafScale proxy DNS endpoint. Once both clusters are configured, replication jobs are created interactively with mirror-cli jobs add, or via the REST API. The actual schema below comes from the upstream scalytics/kaf-mirror repository.

Confluent Cloud → KafScale BDR

PLAIN SASL · API key auth · TLS required
configs/default.yml
# kaf-mirror · Confluent Cloud → KafScale BDR target
# Confluent issues SASL/PLAIN API keys. Confluent egress is
# charged per GB; increase fetch_max_bytes and linger_ms to
# amortize. Production traffic on Confluent is unaffected.
 
source_cluster:
  name: "confluent-prod"
  brokers: "pkc-xxxxx.eu-west-1.aws.confluent.cloud:9092"
  security:
    enabled: true
    sasl_mechanism: "PLAIN"
    sasl_username: "${CC_API_KEY}"
    sasl_password: "${CC_API_SECRET}"
    tls:
      enabled: true
 
target_cluster:
  name: "kafscale-bdr"
  brokers: "kafscale-proxy.bdr.svc.cluster.local:9092"
  security:
    enabled: true
    sasl_mechanism: "SCRAM-SHA-256"
    sasl_username: "${KAFSCALE_USER}"
    sasl_password: "${KAFSCALE_PASS}"
 
# Tune for Confluent egress economics
consumer:
  fetch_max_bytes: 52428800     # 50 MB batches
  auto_offset_reset: "earliest"
 
producer:
  idempotent: true
  acks: "all"
  compression: "zstd"
  linger_ms: 50
create the replication job
# Job creation via mirror-cli (interactive)
./mirror-cli login admin
./mirror-cli jobs add \
  --source confluent-prod \
  --target kafscale-bdr \
  --topics "orders.*,payments.*,audit.*" \
  --exclude "_confluent.*,__.*" \
  --mapping same-name

Apache Kafka (self-managed) → KafScale BDR

SASL/SCRAM-SHA-512 · mTLS · on-prem or self-hosted
configs/default.yml
# kaf-mirror · Apache Kafka → KafScale BDR target
# Self-managed Apache Kafka with SCRAM-SHA-512 and TLS.
# Works identically against Strimzi, AKHQ-managed clusters,
# and Confluent Platform (the self-hosted product).
 
source_cluster:
  name: "prod-apache-kafka"
  brokers: "kafka-prod-1.internal:9093,kafka-prod-2.internal:9093,kafka-prod-3.internal:9093"
  security:
    enabled: true
    sasl_mechanism: "SCRAM-SHA-512"
    sasl_username: "${KAFKA_USER}"
    sasl_password: "${KAFKA_PASS}"
    tls:
      enabled: true
      ca_cert: "/etc/kaf-mirror/ca.pem"
      verify_hostname: true
 
target_cluster:
  name: "kafscale-bdr"
  brokers: "kafscale-proxy.bdr.svc.cluster.local:9092"
  security:
    enabled: true
    sasl_mechanism: "SCRAM-SHA-256"
    sasl_username: "${KAFSCALE_USER}"
    sasl_password: "${KAFSCALE_PASS}"
 
producer:
  idempotent: true
  acks: "all"
  compression: "zstd"
create the replication job
./mirror-cli login admin
./mirror-cli jobs add \
  --source prod-apache-kafka \
  --target kafscale-bdr \
  --topics "orders.*,payments.*,audit.*" \
  --exclude "*.tmp,*.debug" \
  --mapping same-name \
  --batch-size 50000 \
  --compression zstd
 
# Inspect job state at any time
./mirror-cli jobs status full

Redpanda → KafScale BDR (regex namespace)

SCRAM-SHA-256 · regex topic mapping
configs/default.yml
# kaf-mirror · Redpanda → KafScale BDR target
# Redpanda speaks the Kafka wire protocol natively. The only
# config delta from Apache is the bootstrap address pattern.
# Mapped topics are namespaced under bdr.* on the target.
 
source_cluster:
  name: "redpanda-prod"
  brokers: "seed-0.redpanda-prod.internal:9092,seed-1.redpanda-prod.internal:9092,seed-2.redpanda-prod.internal:9092"
  security:
    enabled: true
    sasl_mechanism: "SCRAM-SHA-256"
    sasl_username: "${RP_USER}"
    sasl_password: "${RP_PASS}"
    tls:
      enabled: true
 
target_cluster:
  name: "kafscale-bdr"
  brokers: "kafscale-proxy.bdr.svc.cluster.local:9092"
  security:
    enabled: true
    sasl_mechanism: "SCRAM-SHA-256"
    sasl_username: "${KAFSCALE_USER}"
    sasl_password: "${KAFSCALE_PASS}"
 
producer:
  idempotent: true
  acks: "all"
  compression: "zstd"
create the replication job (regex mapping)
./mirror-cli login admin
./mirror-cli jobs add \
  --source redpanda-prod \
  --target kafscale-bdr \
  --topics "events.*" \
  --mapping regex \
  --pattern "events\.(.*)" \
  --replacement "bdr.events.\$1"
 
# events.orders   on Redpanda
# bdr.events.orders   on KafScale
Cost reality · planning estimate

BDR insurance for the cost of an S3 bucket.

Reference workload: 3 TB per day ingest, 30-day retention (90 TB), one mid-size mirrored estate. Storage and replication line items only. Networking egress and support contracts excluded. Source: vendor public pricing and AWS public pricing as of April 2026. The point is the ratio, not the exact dollar figure. Your numbers will move with throughput and region, but the column ordering does not.

BDR approach Storage cost / month PITR Source lock-in License
Confluent Cluster Linking (Enterprise → Enterprise) ~$38,000 (doubles primary) No Confluent both ends Proprietary
MSK Replicator (cross-region) ~$22,500 + cross-region transfer No MSK both ends Proprietary
Redpanda Shadowing + WCR ~$18,000 (Enterprise license required) Whole-cluster only Redpanda source Source available, Enterprise paywall
MirrorMaker 2 → second open Kafka cluster ~$12,000 (second cluster + ops) No None, but offset drift Apache 2.0
kaf-mirror → KafScale on S3 (this page) ~$2,100 (90 TB at $0.023/GB) Yes, per-topic timestamp None, any Kafka source Apache 2.0 on both ends
▸ Confluent figures derived from published eCKU pricing and the cluster-linking line item.
▸ MSK figures from the AWS MSK pricing page, kafka.m5.large × 3 brokers, 30-day retention.
▸ S3 storage at $0.023 per GB-month (us-east-1). API charges trivial at this scale.
▸ Compute for kaf-mirror and KafScale brokers is a small Kubernetes deployment, typically under $300/month.
Two outcomes · one deployment

BDR today. Agent spine tomorrow.

The same kaf-mirror plus KafScale infrastructure you deploy for backup and disaster recovery is the substrate AI agent workloads actually need. Production Kafka serves real-time SLAs in single-digit milliseconds. Agents need months of history, replayable decision logs, and long-retention prompt traces. Those two access patterns cannot share a cluster without one starving the other. So the BDR copy already solves the second problem: it is durable, queryable, isolated from production, and built for retention measured in years instead of hours. One Kubernetes deployment. One S3 bucket. Two strategic capabilities.

use case · BDR

Disaster recovery and compliance.

The reason this pattern gets budget approval. Immutable S3 segments, per-topic point-in-time restore, DORA Article 12 and NIS2 Annex II evidence packs from the same dashboard. Production stays untouched, the BDR cluster has no shared credentials, and the recovery procedure is one documented command.

  • Per-topic PITR via segment timestamps
  • S3 object lock, separate keys
  • Compliance reports daily, weekly, monthly
  • Restore tested by scheduler, not on the day
shared infrastructure

Same operator. Same bucket. Same credentials boundary.

Both outcomes deploy on the same Kubernetes operator, write to the same S3 bucket under different prefixes, and run under the same separation-of-duties model. No second platform team. No second procurement cycle. The BDR investment buys the agent platform at zero marginal infrastructure cost.

  • One KafScale operator on Kubernetes
  • One S3 bucket, prefix-namespaced workloads
  • One etcd ensemble for metadata
  • One audit trail across both use cases
use case · AI agents

Replay, retention, reasoning over history.

Agent workloads cannot run on the same brokers that serve real-time. Replay over months of history burns broker disk, fights production retention policies, and exposes the cluster to slow-consumer failure modes. KafScale segments live in S3 and are read directly by agents, processors, and replay tools, bypassing brokers entirely. The same data that protects you also powers the experimentation.

  • Decision logs, tool calls, prompt history
  • Replay any agent run from any point
  • Years of retention on S3 economics
  • Zero broker contention with production
Why this matters now. Most platform teams are being asked to support AI agent workloads on infrastructure that was designed for real-time streaming. Putting agent replay traffic on a Confluent or Apache cluster that also serves sub-10ms production paths is how you get on-call pages at 2 AM. The KafScale BDR target is already isolated, already durable, already on S3. Pointing agent workloads at it requires no new procurement and no new architecture review. See KafClaw for the agent runtime that runs on top of this spine.
Honest comparison · May 2026

Where this pattern fits, and where others fit better.

The Kafka BDR market is small but real, and every option has a niche where it is the right answer. Kannika Armory is the most direct comparison: a mature, purpose-built commercial product with strong DORA and NIS2 positioning. OSO kafka-backup is the closest open-source equivalent, although it is a CLI tool rather than a platform. The hyperscaler-native and Kafka-vendor-native options each work well when you are already committed to the matching ecosystem. The table below is what an honest procurement evaluation looks like.

Solution License Point-in-time recovery Source flexibility Compliance UI Doubles as agent platform Best for
Kannika Armory commercial · EU Commercial Yes, native Confluent, Apache Kafka Yes, mature No Regulated EU teams who want a single-purpose, sales-supported BDR product and accept closed-source.
OSO kafka-backup MIT · Rust CLI MIT Yes, ms precision Any Kafka No, CLI only No Single-team backups, manual runbooks, no compliance reporting requirement.
Confluent Cluster Linking proprietary Proprietary No, replication only Confluent both ends Confluent Console No Confluent-to-Confluent active-passive failover for the existing customer base.
MSK Replicator proprietary Proprietary No, replication only MSK both ends AWS Console No Single-AWS-account, same-region or cross-region MSK replication.
MirrorMaker 2 Apache 2.0 Apache 2.0 No, KIP-382 acknowledges this Any Kafka No No Migration, geo-replication, and hot-standby where offset drift is acceptable.
Redpanda Shadowing + WCR BSL + Enterprise BSL with Enterprise paywall Whole-cluster only Redpanda source Redpanda Console No Redpanda-native teams already on Enterprise tier.
Lenses Stream Reactor S3 Apache 2.0 connector Apache 2.0 No, batch sink Any Kafka via Connect No No Teams already operating a Kafka Connect cluster who want raw archive to S3.
Scalytics: kaf-mirror + KafScale Apache 2.0 both ends Apache 2.0 Yes, per-topic timestamp Any Kafka wire protocol Yes, with AI anomalies Yes, same infrastructure Teams who want open source on both ends, source-agnostic BDR, and one deployment that also serves AI agent workloads.
▸ Kannika Armory ships a polished commercial product with strong DORA and NIS2 framing. Where Scalytics differs: Apache 2.0 on both ends, source-available before procurement, and the BDR target is also the agent platform.
▸ OSO kafka-backup is the closest open-source equivalent on the backup side and is excellent at its scope. Where Scalytics differs: live mirror instead of batch, web dashboard, compliance reports, and an integrated streaming target.
▸ MirrorMaker 2 is the right tool for replication. KIP-382 itself states it is insufficient for backup and disaster recovery.
▸ Cluster Linking, MSK Replicator, Redpanda Shadowing are vendor-native and require the matching vendor on both ends.
Information current as of May 2026. Vendor product pages reviewed, public pricing referenced where available.
DORA · NIS2 · ISO 27001

Evidence regulators ask for. Built into the stack, not bolted on.

DORA Article 12 requires financial entities to maintain immutable backups with rigorously tested recovery procedures. NIS2 Annex II requires comparable controls for essential and important entities. Both frameworks require evidence: the audit trail, the retention proof, the recovery test. kaf-mirror generates compliance reports daily, weekly, or monthly with full audit trails. KafScale stores the actual log segments in S3 under object lock. The combination produces regulator-grade evidence without a separate audit pipeline.

DORA · Article 12

Backup and restoration of ICT systems

  • Article 12(1) · Immutable backupsS3 Object Lock in compliance mode prevents deletion and modification regardless of credentials. Retention rules enforced at the bucket policy layer.
  • Article 12(2) · Restoration procedurePer-topic point-in-time restore via documented kafscale-cli command. Reproducible. Tested via the kaf-mirror compliance scheduler.
  • Article 12(3) · SegregationBDR target runs on separate credentials, separate keys, separate cluster, separate bucket. Production credentials cannot reach BDR storage.
  • Article 12(4) · Testingkaf-mirror Compliance tab generates monthly restore-test reports with timestamps and recovered offsets.
NIS2 · Annex II

Cybersecurity risk-management measures

  • Annex II(1)(d) · Business continuityAir-gappable BDR target in a separate availability zone, region, or on-premises cluster. KafScale runs without internet egress.
  • Annex II(1)(e) · Supply chain securityBoth kaf-mirror and KafScale are Apache 2.0, source available, no proprietary control plane that can revoke access or change pricing.
  • 24-hour incident notificationkaf-mirror anomaly detection surfaces replication lag spikes, schema drift, and credential failures as alertable signals before they become incidents.
  • Audit trailRole-based access (admin/operator/monitor/compliance). Every state-modifying action is logged with actor, timestamp, and outcome.
Architecture commitments

Red lines, in plain text.

A buyer running BDR for a regulated process reads architectural commitments more carefully than they read marketing claims. Each line below is enforced in the build pipeline, documented in the source, and visible at the protocol layer.

  • Production traffic is never affected. kaf-mirror consumes via a read-only credential. The producer to KafScale is a separate process. Production brokers see one more consumer, nothing else.
  • The BDR target is reachable only from kaf-mirror. Network policies, distinct credentials, distinct encryption keys, distinct S3 bucket. A compromised producer key on prod cannot reach the BDR copy.
  • S3 segments are immutable under object lock. Compliance mode prevents deletion until the retention clock expires. Not even the bucket owner can override.
  • Point-in-time recovery uses segment commit timestamps, not offset mapping. Unlike MirrorMaker, there is no offset translation table that drifts under retention pressure.
  • The .kfs segment format is part of the public specification. If kaf-mirror or KafScale ever disappear, the segments in your bucket remain readable by any conformant reader.
  • No vendor control plane. Both tools run on your Kubernetes, your S3, your network. There is no external service to revoke access or rate-limit operations.
  • The license is Apache 2.0 on both sides. No BSL conversion clauses, no per-GiB usage fees, no commercial-use restrictions. Now and after a hypothetical acquisition.
Honest limits

What this pattern is not.

The same architectural choices that make this BDR pattern cheap also make it the wrong choice for a small set of workloads. We name them here because the cost of finding out after deployment is high.

not this

Sub-10ms latency failover

For synchronous trading or fraud-detection paths that require RPO=0 and millisecond failover, run a stretch cluster with synchronous replication across availability zones. KafScale produce latency is 200 to 500ms because writes commit to S3 before acknowledgement.

not this

Confluent feature replacement

ksqlDB, Schema Registry mirroring, Stream Designer, Connect-managed connectors. KafScale is transport and storage only. If those features are why you run Confluent, this is a BDR pattern alongside Confluent, not a replacement for it. Scalytics is a Confluent partner.

not this

Active-active multi-region

Active-active with offset-preserving bi-directional replication is a different problem with different tradeoffs. WarpStream Orbit and Confluent Cluster Linking with mirror topics both target that pattern. This pattern is active-passive BDR with an immutable rewindable copy.

Frequently asked before a briefing

The questions your CISO and your VP of Engineering will both ask.

Answered here so the briefing can go to your actual constraints. Repository links resolve to the upstream Apache 2.0 source. Architecture references map to public documentation.

Does this replace our Confluent or Apache Kafka cluster?
No. The pattern is explicitly additive. kaf-mirror sits next to your production cluster as one more read-only consumer. KafScale is a parallel spine for the BDR copy. Production traffic, ksqlDB, Connect, Stream Governance, and every other feature you rely on continues to work exactly as before. The first deployment can be one mirrored topic in a single namespace; the production team is not required to change a single line of producer code.
How does point-in-time recovery actually work?
KafScale segment writes carry their commit timestamp in S3 object metadata. The kafscale-cli restore command takes a topic name and a target timestamp, lists segments under the topic prefix in S3, filters to those that closed before the target time, and exposes them as a recovered topic that standard Kafka consumers can read. The procedure works without a running source cluster, which is why it qualifies as a backup rather than a replica.
What happens during a long target-cluster outage?
kaf-mirror does not require a separate disk buffer. The internal franz-go producer has an in-memory buffer with idempotent retries for transient interruptions. If KafScale is unavailable for an extended period, the consumer applies backpressure and stops advancing the offset on the source cluster. The source cluster itself acts as the durable buffer through its existing retention. When KafScale returns, replication resumes from the last committed offset, no data lost. This is documented in the kaf-mirror Resilience section.
Can this run air-gapped?
Yes. kaf-mirror and KafScale both run without internet egress. Container images mirror to your registry. The only requirements are a reachable S3-compatible endpoint (MinIO, Ceph, or on-prem object store) and a reachable etcd ensemble. Both can sit inside your network boundary or inside a classified enclave. The data path makes no external calls.
What is the realistic ingestion ceiling?
A single kaf-mirror replication job handles up to roughly 250 MB/s sustained with zstd compression and a 50 MB fetch batch. Larger estates run multiple jobs in parallel, partitioned by topic prefix. KafScale itself scales horizontally with stateless broker pods, and S3 is effectively unbounded for the throughput tier most BDR workloads need. The bottleneck is almost always source-cluster egress, not the BDR side.
What does Scalytics provide on top of the open source?
Architecture review for BDR and DR strategy, hardening for regulated environments, DORA and NIS2 evidence packs mapped to your specific topics and retention, integration with Apache Wayang for federated analytics over the BDR data, and 24/7 enterprise support contracts. The repositories remain Apache 2.0. The product roadmap is independent of any specific support engagement.
Will offsets match between source and BDR?
For PITR and disaster recovery, exact offset preservation across vendors is not the design goal. kaf-mirror preserves order within a partition, preserves message keys, and uses idempotent production to keep duplicates rare. The recovered topic on KafScale has its own offset space, which is the expected behavior for a backup target. If your workload requires offset-preserving replication for active-active or hot failover, that is a different pattern with different tradeoffs and we will say so on the briefing call.
How long does a full pilot take?
Two weeks for a working pilot in a non-production environment, including kaf-mirror deployment, one KafScale cluster on your Kubernetes, three to five topics mirrored, and one restore-test report generated. Four to six weeks for a production rollout with compliance evidence aligned to your specific DORA or NIS2 control set. The architecture sprint covers both phases.
Next step

Replace the standby cluster with an immutable copy.

Forty-five minutes. Architecture review, BDR pattern mapped to your topology, kaf-mirror configuration for your specific source, compliance evidence pack for DORA or NIS2. Bring the failure scenario that keeps your on-call team awake.