Apache Wayang Proposals Mapped as a Knowledge Graph

Dr. Mirko Kämpf
//
CEO & co-founder
//
April 15, 2026

This year I mentor the Apache Wayang cohort in Google Summer of Code 2026, one of 185 organizations selected globally. Thirty-seven proposals arrived from 35 contributors in three weeks. The standard response is to read the titles, notice that 17 of them say "JDBC driver", and start filtering for originality. That is the wrong move.

Titles in GSoC proposals are written for discoverability, not precision. They cluster around known community gaps because contributors read the same wiki pages and issue trackers before they write. The real signal is in the body: which specific angle each contributor took, what they already understood about the codebase, where their thinking diverged from the crowd.

Reading 37 full proposals sequentially is how mentors normally surface that signal. It takes two days and produces notes that cannot be queried. We did something different. We ran the same pipeline Scalytics uses on operational data against the proposal corpus: extract structure, build a knowledge graph, compute pairwise similarity across all 666 pairs, cluster with KMeans, and render the result as a set of navigable views. The interactive graphs below are what came out.

What the knowledge graph shows

The force graph at the top maps 37 proposal nodes to 7 theme clusters. The edges between themes are weighted by shared concept co-occurrence, not editorial judgment. JDBC Driver and DataFrame API are the two dominant clusters and are also the most connected, because the underlying technical concepts - Java, Spark, query optimization - span both. Platform Registration and Optimization sits further out, connected primarily through Flink and cost modeling concepts that few other proposals mentioned.

The concept graph below the force graph inverts the typical relevance signal on purpose. Node size represents rarity, not frequency. The largest circles are the concepts that appeared in only a few proposals - "instance-aware", "cross-platform optimizer", "logical plan" - because those are where the novel technical thinking lives. Small circles like "java" and "maven" are near-universal and therefore carry little signal about what distinguishes one proposal from another.

Why 17 JDBC proposals are not 17 duplicate proposals

The TF-IDF analysis ran 833 features (1-2 gram tokens, minimum document frequency 2, maximum 0.85) across all 37 full proposal texts. Cosine similarity computed over all 666 pairwise combinations produced a mean of 0.141 and a median of 0.100. KMeans found its optimal silhouette score at k=4, not k=17. There are zero pairs above the 0.80 similarity threshold that would indicate near-duplication.

The scatter plot in the clustering section visualises this. The 17 JDBC proposals form a cluster in 2D TruncatedSVD space, but it is a wide, dispersed blob - not a collapsed point. Each contributor found a different angle on the same unsolved problem: some proposed Calcite-based query parsing, others focused on ResultSet streaming, others on type-mapping coverage. The problem statement is shared. The proposed solutions are distinct.

This matters for how we read the Apache Wayang roadmap. A topic that attracts 17 independent proposals in a single season is not noise. It is a reliable signal about where the community believes the project's current ceiling is. The mentor job is not to pick one and discard the rest. It is to understand what the 17 approaches collectively say about the problem, and which single angle has the best chance of shipping production-quality code in twelve weeks.

The approach behind the analysis

The pipeline that produced this page is the same one we use at Scalytics on enterprise operational data - support ticket corpora, incident report backlogs, internal RFC collections. The steps are the same: ingest structured documents, anonymize, extract concepts against a controlled vocabulary, compute similarity and clustering, render as a queryable graph rather than a flat list.

GSoC proposals are an unusually clean public dataset for demonstrating this: real documents, community-generated, structured enough to analyse, small enough to reason about transparently. Scalytics Copilot on GitHub is part of the open-source components that power this kind of federated document analysis, but any public LLM does the the trick, too. If you work on LLM-backed knowledge infrastructure or distributed data processing and want to talk through the architecture, the contact link in the embed above goes to the right place.

0 Proposals
0 Contributors
0 Themes
By theme
Scope

S = focused · M = feature · L = architectural

Status

New = first submission · Revision = updated

Knowledge graph — themes & proposals
Large nodes are themes. Small nodes are proposals. Edges connect proposals to their theme and themes to each other via shared concepts. Drag · hover · click to pin.
Scroll to zoom · drag to reposition · click a node to pin its connections

Seven themes, one roadmap

Each theme shows the mentor synthesis — the problem statement the cohort was collectively proposing to solve — with the key concepts and technologies surfaced.

Each theme as an integrated design

Read each theme as a composite — the integrated features, technologies, and concepts the cohort is collectively asking for.

Concept graph — what the cohort actually talks about
Larger nodes are rarer, higher-signal concepts — where the novel thinking lives. Small nodes are commodity tooling. Edge thickness = co-occurrence frequency.

Are the proposals actually duplicates?

Seventeen proposals for a JDBC driver. Eight for a DataFrame API. We ran TF-IDF (833 features, 1-2 grams) over the full text of all 37, cosine similarity on all 666 pairs, and KMeans on the vectors — ignoring titles entirely.

The answer is no. Zero near-duplicates above the 0.80 threshold. The cohort explored the same problem from 35 different directions.

Mean similarity
Median similarity
Near-duplicates ≥0.80
KMeans best k
Theme tightness
Avg pairwise similarity inside each theme. Loose = distinct content behind shared vocabulary.
loose moderate tight
Proposal space — TruncatedSVD 2D
Wide blobs = 35 distinct angles. If the cohort were copying, each theme would collapse to a point.
The reframe

Thirty-five contributors took the same community-known problems and each found a different angle. The mentor job is not picking between copies. It is picking between perspectives.

How it was built

The same pipeline Scalytics runs on operational data — pointed at GSoC proposals instead.

Step 1
Ingest
Pull all public GSoC proposals mentioning Apache Wayang. Normalize and anonymize before any analysis.
Step 2
Knowledge graph
Extract themes, technologies, topics. Build proposal-to-theme edges and theme co-occurrence edges weighted by shared signal.
Step 3
Guidance
For each cluster, generate the mentor guidance a reader would produce after reading all 37 proposals back to back.
This is what Scalytics does at enterprise scale. We point the same pipeline at your operational data — support tickets, incident reports, RFCs — and produce the same kind of navigable map. Themes you didn't know you had. Connections you couldn't see.

What this cohort says about where Apache Wayang needs to go

Three themes account for 31 of 37 proposals: JDBC Driver (17), DataFrame API (8), and Datalake-Friendly Backends (6). Read together, they describe the same gap from three entry points. SQL tooling cannot reach Wayang because there is no driver. Python data engineers cannot use Wayang because the API is JVM-native. Modern lakehouse stacks based on Iceberg, Trino, and DataFusion are not first-class execution targets.

All three are interface gaps, not core engine gaps. The cross-platform optimizer at Wayang's core - the component that decides whether a given operator runs on Spark, Flink, or Postgres based on cost — is already production-quality. What is missing is the surface area that makes it reachable from the modern data stack.

The two Platform Registration and Optimization proposals are the exception. They address the optimizer itself: specifically, the assumption that all registered Spark or Flink instances are equivalent. Instance-aware routing would let Wayang make cost-based decisions that account for the actual capacity and latency of each registered platform. That is core engine work, and the two proposals addressing it had the highest internal similarity score in the cohort (0.792) - meaning they converged on the same technical understanding from independent starting points. That is a signal worth acting on.

About Scalytics

Scalytics architects and troubleshoots mission-critical streaming, federated execution, and AI systems for scaling SMEs. We help organizations turn streams into decisions - reliably, in real time, and under production load. When Kafka pipelines fall behind, SAP IDocs block processing, lakehouse sinks break, or AI pilots collapse under real load, we step in and make them run.

Our founding team created Apache Wayang (now an Apache Top-Level Project), the federated execution framework that orchestrates Spark, Flink, and TensorFlow where data lives and reduces ETL movement overhead.

We also invented and actively maintain KafScale (S3-Kafka-streaming platform), a Kafka-compatible, stateless data and large object streaming system designed for Kubernetes and object storage backends. Elastic compute. No broker babysitting. No lock-in.

Our mission: data stays in place. Compute comes to you. From data lakehouses to private AI deployment and distributed ML - all designed for security, compliance, and production resilience.

Questions? Join our open
Slack community or schedule a consult.
back to all articles
Unlock Faster ML & AI
Free White Papers. Learn how Scalytics  streamlines agentic data pipelines, enabling businesses to achieve rapid AI success.

The experts for mission-critical infrastructure.

Launch your data + AI transformation.