Scalytics | Kafka to Iceberg in Production What Breaks and How Teams Fix It

Alexander Alten

Search GitHub for "Iceberg sink" and you'll find 344 open issues in the main repository alone. Add "Kafka Connect Iceberg" and you get another 89. "Flink Iceberg checkpoint" brings up 127 more. "Hudi streaming" pulls up over 1,200.

We read through them.

Not all of them. But enough to see the same failures repeating across organizations, cloud providers, and stack combinations. Silent commits. Orphaned files. Metadata that vanishes on restart. Latency that climbs from seconds to 30 minutes with no config change.

These 47 tell the story.

‍

The Integration Tax Nobody Talks About

Every enterprise running Kafka eventually wants the same thing. Get streaming data into queryable tables. Iceberg won the format war. Confluent, Databricks, and Snowflake all support it. The architecture diagrams look clean.

Then you try to build it.

A 1 GiB/s streaming pipeline writing to Iceberg through connectors can cost $3.4 million annually in duplicate storage and transfer fees. That's before you run your first query. The data gets written to Kafka, copied to a connector, transformed, written to object storage, then registered in a catalog. Four hops. Four failure points. Four cost centers.

This isn't a vendor problem. It's an architecture problem.

The Integration Tax

Why a 1 GiB/s pipeline costs $3.4M annually.

Standard Architecture (4 Hops)

📡

Kafka
Broker

$$ Storage

Silent Failures

⚙️

Connect
Cluster

$$ Compute

Small Files

🔄

Transform
Job

$$ State

🧊

Iceberg
S3

$$ Ops

Storage-Native (KafScale)

📡

Direct
Stream

Zero Copy

🧊

Iceberg
Table

1x Storage

What Actually Breaks in Production

We're not speculating here. These are documented issues from teams running Kafka-to-Iceberg pipelines at scale.

Kafka Connect Silent Failures and Offset Hell

The Iceberg Kafka Connect Sink looks straightforward in demos. In production, teams discover the edge cases.

Silent coordinator failures. The connector fails silently when there's a mismatch between consumer group IDs. No coordinator gets elected. No commits happen. No error messages. Data flows in, nothing comes out. You find out from your analysts three hours later.

Dual offset tracking. Offsets are stored in two different consumer groups. One managed by the sink, one by Kafka Connect itself. Resetting offsets means resetting both. Miss one and you're either duplicating data or losing it.

Schema evolution crashes. Drop a column and recreate it with a different type. Something that happens in any evolving system. The sink crashes trying to write new data to the old column definition.

Version incompatibilities. The Avro converter wants version 1.11.4 but Iceberg 1.8.1 ships with 1.12.0. ClassNotFoundException at startup. Hope you like dependency debugging.

Timeout storms. Under load you'll see TimeoutException: Timeout expired after 60000ms while awaiting TxnOffsetCommitHandler. The task gets killed and won't recover without manual intervention.

Flink and Iceberg Small File Apocalypse

Flink is the standard answer for streaming to Iceberg. It's also where teams learn about the small file problem.

Thousands of KB-sized files. Streaming jobs with frequent checkpoints create tiny files. Sometimes in the kilobyte range. Query performance collapses. Metadata overhead explodes. Your "real-time" pipeline becomes slower than a nightly batch job.

Compaction conflicts. Run compaction on the same partition your streaming job is writing to. You get write failures or data corruption from Iceberg's optimistic concurrency control. The standard workaround is partition-aware compaction that only touches cold partitions. That requires custom scheduling logic.

Checkpoints don't mean commits. Files appear in S3. Checkpoints complete successfully. But the metadata file doesn't update. Your data exists but is invisible to queries. This happens intermittently across multiple tables with no clear pattern.

Recovery deletes your metadata. Restart a failed job from checkpoint and you might get FileNotFoundException. The metadata files were cleaned up during a previous commit cycle. Your checkpoint references files that no longer exist.

Tencent's internal solution. Their team built a custom DataCompactionOperator that runs alongside the Flink Sink because the out-of-box behavior "will be invalid" for production streaming workloads. That's a major engineering org admitting the default path doesn't work.

Hudi Latency Cliff and Log File Growth

Hudi was built for incremental processing. Teams using it for streaming discover the operational overhead.

30 minute latency in production. A Kafka to Spark Structured Streaming to Hudi pipeline on AWS showed latency growing to 30 minutes. The team ran the same pipeline without Hudi and wrote directly to S3. Latency stayed under 2 minutes.

Version upgrades break ingestion. Upgrading from 0.12.1 to 0.13.0 caused ingestion to fail on the second micro-batch. Reverting fixed it. Every upgrade becomes a risk assessment.

Log files grow indefinitely. MOR tables accumulate log files that compaction doesn't clean up. The growing file count slows compaction further. That increases batch wait times and cascades into data latency. Labeled "priority:critical production down."

High partition performance collapse. 20,000 partitions at 5,000 records per second produced 30+ minute lag even with 20 Glue workers. The only fix that helped was reducing partition count. Not always an option.

Connection pool exhaustion. Multiple streaming queries in one Spark job exhaust HTTP connections when metadata is enabled. Setting the pool to 1,000 connections "sometimes helps and sometimes does not."

The Operational Reality

These aren't edge cases. They're Tuesday.

Running Flink in production requires significant DevOps expertise for deploying, upgrading, scaling, monitoring, and debugging. The total cost of ownership including operational overhead plus specialized engineering teams is higher than the infrastructure cost alone.

You're not just running a connector. You're running a distributed system that needs care and feeding.

‍

The Approaches That Exist Today

Let's be clear about the options and what they actually provide.

Kafka Connect and Iceberg Sink (Open Source)

What it is: Community connector that reads from Kafka topics and writes Parquet files to Iceberg tables.

When it works: Lower throughput append-only workloads where occasional restart and offset reset is acceptable. Teams with Kafka Connect expertise already on staff.

The tradeoffs: Requires running and monitoring a Connect cluster. Exactly-once semantics require KIP-447 and Kafka 2.5+. Schema evolution needs careful handling. Debugging is non-trivial.

Apache Flink and Iceberg Sink (Open Source)

What it is: Streaming engine with native Iceberg integration for both batch and streaming writes.

When it works: Complex transformations before landing in Iceberg. Teams already running Flink for other workloads. Use cases requiring exactly-once guarantees with proper checkpoint configuration.

The tradeoffs: Operating Flink is a discipline unto itself. Small file problem requires compaction strategy. Checkpoint-to-commit semantics need deep understanding. Expect to build operational tooling.

Confluent Tableflow (Managed)

What it is: Confluent's managed service that represents Kafka topics as Iceberg tables without running connectors.

When it works: Confluent Cloud customers who want push-button Iceberg tables. Workloads where managed simplicity outweighs cost. Integration with supported catalogs like AWS Glue and Snowflake Open Catalog.

The constraints: Topics without a schema are not supported. Confluent managed storage doesn't sync to external catalogs. Flink on Confluent Cloud can't query the Iceberg tables directly. Upsert mode has a 30 billion unique key limitand additional charges coming in 2026.

Storage-Native Architectures (Emerging)

What they are: Systems that write directly to object storage in formats that can be exposed as both streaming logs and Iceberg tables without intermediate copying.

When they work: High throughput pipelines where the 4x storage multiplication of traditional architectures becomes cost prohibitive. Teams wanting to eliminate the connector layer entirely.

The tradeoffs: Newer ecosystem. Fewer managed options. Requires understanding of both streaming and lakehouse semantics.

‍

Storage-Native Architecture with KafScale

Here's the pattern that sidesteps the connector problem entirely. Write streaming data directly to object storage in a format that analytical tools can read without broker involvement.

KafScale takes this approach. Kafka-compatible streaming with stateless brokers and S3-native storage. Apache 2.0 licensed. Developed with support from Scalytics, Inc. and NovaTechflow.

The architecture separates concerns completely. Brokers handle streaming writes. S3 holds the data. Processors read directly from S3 to build Iceberg tables. Streaming and analytics never compete for the same resources.

The Iceberg Processor

The KafScale Iceberg Processor reads .kfs segments directly from S3, bypasses brokers entirely, converts data to Parquet, and writes to Iceberg tables. It works with Unity Catalog, Apache Polaris, and AWS Glue.

Zero broker load for analytical workloads.

The processor format is documented and public. Teams can build custom processors without waiting for vendors to ship features.

# KafScale Iceberg Processor configuration
apiVersion: kafscale.io/v1
kind: IcebergProcessor
metadata:
  name: events-to-iceberg
spec:
  source:
    topics:
      - events
      - transactions
    s3:
      bucket: kafscale-data
      prefix: segments/
  sink:
    catalog:
      type: unity
      endpoint: https://.cloud.databricks.com/api/2.1/unity-catalog/iceberg
      warehouse: /Volumes/main/default/warehouse
      credentials:
        secretRef: unity-catalog-credentials
    database: analytics
    tablePrefix: streaming_
  processing:
    parallelism: 4
    commitIntervalSeconds: 60

The processor queries etcd for topic and partition metadata, reads segments from S3, converts to Parquet, and commits to the Iceberg catalog. No Kafka Connect cluster. No Flink job. No broker resources consumed.

Unity Catalog Integration

For Databricks users, the Iceberg Processor writes directly to Unity Catalog via the REST API:

# Unity Catalog configuration
catalog:
  type: unity
  endpoint: https://.cloud.databricks.com/api/2.1/unity-catalog/iceberg
  auth:
    type: oauth
    client_id: ${DATABRICKS_CLIENT_ID}
    client_secret: ${DATABRICKS_CLIENT_SECRET}
  warehouse: /Volumes/main/default/warehouse

Tables appear in Unity Catalog with full governance. Databricks SQL, notebooks, and jobs query them like any other Iceberg table.

Apache Polaris Integration

For multi-cloud or vendor-neutral deployments, Apache Polaris provides an open source Iceberg REST catalog:

# Polaris catalog configuration
catalog:
  type: polaris
  endpoint: https://polaris.internal:8181/api/catalog
  warehouse: s3://lakehouse/warehouse
  credentials:
    secretRef: polaris-credentials

Snowflake, Databricks, Spark, Trino, and DuckDB can all query the same tables through Polaris.

When This Makes Sense

KafScale fits specific use cases. High throughput pipelines where connector overhead becomes cost prohibitive. Workloads where analytical queries shouldn't impact streaming performance. Teams that want to eliminate the connector layer entirely.

It's not a drop-in replacement for every Kafka workload. If you need sub-10ms latency or complex consumer group semantics, traditional Kafka still wins.

‍

Cross-Platform Queries with Apache Wayang

Once your data lands in Iceberg, you still need to query it. Often alongside data from other sources. PostgreSQL reference tables. Spark DataFrames. Other Iceberg catalogs.

Apache Wayang handles this. A federated data processing framework that optimizes query execution across multiple platforms. Spark, Flink, PostgreSQL, Java Streams. All from a single logical plan.

The founding team at TU Berlin built Wayang to answer a specific question. Can you automatically select the optimal execution platform for each operator in a data pipeline?

The answer was yes. With 10x speedup for mixed workloads compared to single-engine execution.

Disclosure: The Scalytics team includes the original creators of Apache Wayang. We contributed the project to the Apache Software Foundation where it recently graduated as a Top-Level Project.

Practical Example

One example could be that streaming events live in Iceberg via KafScale and customer reference data in PostgreSQL. and you want to join them.

WayangContext wayang = new WayangContext()
    .withPlugin(Spark.basicPlugin())     // For Iceberg scans
    .withPlugin(Postgres.plugin())        // For reference data
    .withPlugin(Java.basicPlugin());      // For final aggregation

DataQuanta<Event> events = wayang
    .readIceberg("unity://catalog/analytics/streaming_events")
    .filter(e -> e.timestamp > cutoff);

DataQuanta<Customer> customers = wayang
    .readTable("jdbc:postgresql://prod/customers");

events.join(customers, Event::customerId, Customer::id)
    .groupBy(pair -> pair.customer.segment)
    .aggregate(Aggregators.count())
    .collect();

‍
This results in a clean, federated data processing architecture which is actually operatable without exceeding the budget.

Federated Insight

Joining Streams + Reference Data without moving it.

High-Value Customer Alert

Apache Wayang Optimization

events.join(customers)
.where(e -> e.userId)
.isEqualTo(c -> c.id)
.withPlatform(AutoSelect)

🧊

Iceberg Stream

Billions of Events

🐘

Postgres DB

Customer Profiles

‍

What Is Now The Right Tool?

This post isn't arguing that Kafka Connect, Flink, or Tableflow are wrong choices for the task at hand. They're not.

Tableflow is the right answer for Confluent Cloud customers who want managed Iceberg tables without operational overhead. The simplicity is worth the constraints for many workloads.

Flink is the right answer for teams with streaming expertise who need complex transformations before landing data. The ecosystem and community support are unmatched.

Kafka Connect is the right answer for organizations already invested in the Connect ecosystem with operational patterns established.

The gap we address is what happens after the data lands. Querying across multiple catalogs. Joining streaming-derived tables with operational databases. Optimizing execution across platforms automatically.

That's the federated layer. That's what Apache Wayang provides. That's what KafScale simplifies at the storage layer.

‍

Getting Started

If you're exploring Kafka-to-Iceberg architectures:

Audit your current data movement. How many times does each byte get copied? What's the actual cost per GB landed in Iceberg?

Map your catalog landscape. Unity Catalog? Glue? Polaris? Hive Metastore? Understanding where tables are registered determines your integration options.

Identify cross-platform query needs. If all queries run in Databricks you might not need federation. If analysts use multiple tools you do.

To explore KafScale:

KafScale is Apache 2.0 licensed. The Iceberg Processor, storage format documentation, deployment guides, and Kubernetes operator are all open source.

For enterprise federated processing:

Scalytics Federated provides visual pipeline editor, cross-platform query optimization with Apache Wayang, Unity Catalog and Polaris integration, and Kubernetes-native deployment.

About Scalytics

Scalytics builds on Apache Wayang, the cross-platform data processing framework created by our founding team and now an Apache Top-Level Project. Where traditional platforms require moving data to centralized infrastructure, Scalytics brings compute to your data—enabling AI and analytics across distributed sources without violating compliance boundaries.

‍Scalytics Federated provides federated data processing across Spark, Flink, PostgreSQL, and cloud-native engines through a single abstraction layer. Our cost-based optimizer selects the right engine for each operation, reducing processing time while eliminating vendor lock-in.

‍Scalytics Copilot extends this foundation with private AI deployment: running LLMs, RAG pipelines, and ML workloads entirely within your security perimeter. Data stays where it lives. Models train where data resides. No extraction, no exposure, no third-party API dependencies.

For organizations in healthcare, finance, and government, this architecture isn't optional, it's how you deploy AI while remaining compliant with HIPAA, GDPR, and DORA.Explore our open-source foundation: Scalytics Community Edition

‍Questions? Reach us on Slack or schedule a conversation.

47 GitHub Issues That Might Explain Your Iceberg Latency

The Integration Tax Nobody Talks About

What Actually Breaks in Production

Kafka Connect Silent Failures and Offset Hell

Flink and Iceberg Small File Apocalypse

Hudi Latency Cliff and Log File Growth

The Operational Reality

The Approaches That Exist Today

Kafka Connect and Iceberg Sink (Open Source)

Apache Flink and Iceberg Sink (Open Source)

Confluent Tableflow (Managed)

Storage-Native Architectures (Emerging)

Storage-Native Architecture with KafScale

The Iceberg Processor

Unity Catalog Integration

Apache Polaris Integration

When This Makes Sense

Cross-Platform Queries with Apache Wayang

Practical Example

Federated Insight

What Is Now The Right Tool?

Getting Started

About Scalytics

Scalytics Copilot:
Real-time intelligence. No data leaks.

Launch your data + AI transformation.

47 GitHub Issues That Might Explain Your Iceberg Latency

The Integration Tax Nobody Talks About

What Actually Breaks in Production

Kafka Connect Silent Failures and Offset Hell

Flink and Iceberg Small File Apocalypse

Hudi Latency Cliff and Log File Growth

The Operational Reality

The Approaches That Exist Today

Kafka Connect and Iceberg Sink (Open Source)

Apache Flink and Iceberg Sink (Open Source)

Confluent Tableflow (Managed)

Storage-Native Architectures (Emerging)

Storage-Native Architecture with KafScale

The Iceberg Processor

Unity Catalog Integration

Apache Polaris Integration

When This Makes Sense

Cross-Platform Queries with Apache Wayang

Practical Example

Federated Insight

What Is Now The Right Tool?

Getting Started

About Scalytics

Scalytics Copilot: Real-time intelligence. No data leaks.

Launch your data + AI transformation.

Scalytics Copilot:
Real-time intelligence. No data leaks.