This year I mentor the Apache Wayang cohort in Google Summer of Code 2026, one of 185 organizations selected globally. Thirty-seven proposals arrived from 35 contributors in three weeks. The standard response is to read the titles, notice that 17 of them say "JDBC driver", and start filtering for originality. That is the wrong move.
Titles in GSoC proposals are written for discoverability, not precision. They cluster around known community gaps because contributors read the same wiki pages and issue trackers before they write. The real signal is in the body: which specific angle each contributor took, what they already understood about the codebase, where their thinking diverged from the crowd.
Reading 37 full proposals sequentially is how mentors normally surface that signal. It takes two days and produces notes that cannot be queried. We did something different. We ran the same pipeline Scalytics uses on operational data against the proposal corpus: extract structure, build a knowledge graph, compute pairwise similarity across all 666 pairs, cluster with KMeans, and render the result as a set of navigable views. The interactive graphs below are what came out.
What the knowledge graph shows
The force graph at the top maps 37 proposal nodes to 7 theme clusters. The edges between themes are weighted by shared concept co-occurrence, not editorial judgment. JDBC Driver and DataFrame API are the two dominant clusters and are also the most connected, because the underlying technical concepts - Java, Spark, query optimization - span both. Platform Registration and Optimization sits further out, connected primarily through Flink and cost modeling concepts that few other proposals mentioned.
The concept graph below the force graph inverts the typical relevance signal on purpose. Node size represents rarity, not frequency. The largest circles are the concepts that appeared in only a few proposals - "instance-aware", "cross-platform optimizer", "logical plan" - because those are where the novel technical thinking lives. Small circles like "java" and "maven" are near-universal and therefore carry little signal about what distinguishes one proposal from another.
Why 17 JDBC proposals are not 17 duplicate proposals
The TF-IDF analysis ran 833 features (1-2 gram tokens, minimum document frequency 2, maximum 0.85) across all 37 full proposal texts. Cosine similarity computed over all 666 pairwise combinations produced a mean of 0.141 and a median of 0.100. KMeans found its optimal silhouette score at k=4, not k=17. There are zero pairs above the 0.80 similarity threshold that would indicate near-duplication.
The scatter plot in the clustering section visualises this. The 17 JDBC proposals form a cluster in 2D TruncatedSVD space, but it is a wide, dispersed blob - not a collapsed point. Each contributor found a different angle on the same unsolved problem: some proposed Calcite-based query parsing, others focused on ResultSet streaming, others on type-mapping coverage. The problem statement is shared. The proposed solutions are distinct.
This matters for how we read the Apache Wayang roadmap. A topic that attracts 17 independent proposals in a single season is not noise. It is a reliable signal about where the community believes the project's current ceiling is. The mentor job is not to pick one and discard the rest. It is to understand what the 17 approaches collectively say about the problem, and which single angle has the best chance of shipping production-quality code in twelve weeks.
The approach behind the analysis
The pipeline that produced this page is the same one we use at Scalytics on enterprise operational data - support ticket corpora, incident report backlogs, internal RFC collections. The steps are the same: ingest structured documents, anonymize, extract concepts against a controlled vocabulary, compute similarity and clustering, render as a queryable graph rather than a flat list.
GSoC proposals are an unusually clean public dataset for demonstrating this: real documents, community-generated, structured enough to analyse, small enough to reason about transparently. Scalytics Copilot on GitHub is part of the open-source components that power this kind of federated document analysis, but any public LLM does the the trick, too. If you work on LLM-backed knowledge infrastructure or distributed data processing and want to talk through the architecture, the contact link in the embed above goes to the right place.
What this cohort says about where Apache Wayang needs to go
Three themes account for 31 of 37 proposals: JDBC Driver (17), DataFrame API (8), and Datalake-Friendly Backends (6). Read together, they describe the same gap from three entry points. SQL tooling cannot reach Wayang because there is no driver. Python data engineers cannot use Wayang because the API is JVM-native. Modern lakehouse stacks based on Iceberg, Trino, and DataFusion are not first-class execution targets.
All three are interface gaps, not core engine gaps. The cross-platform optimizer at Wayang's core - the component that decides whether a given operator runs on Spark, Flink, or Postgres based on cost — is already production-quality. What is missing is the surface area that makes it reachable from the modern data stack.
The two Platform Registration and Optimization proposals are the exception. They address the optimizer itself: specifically, the assumption that all registered Spark or Flink instances are equivalent. Instance-aware routing would let Wayang make cost-based decisions that account for the actual capacity and latency of each registered platform. That is core engine work, and the two proposals addressing it had the highest internal similarity score in the cohort (0.792) - meaning they converged on the same technical understanding from independent starting points. That is a signal worth acting on.
About Scalytics
Our founding team created Apache Wayang (now an Apache Top-Level Project), the federated execution framework that orchestrates Spark, Flink, and TensorFlow where data lives and reduces ETL movement overhead.
We also invented and actively maintain KafScale (S3-Kafka-streaming platform), a Kafka-compatible, stateless data and large object streaming system designed for Kubernetes and object storage backends. Elastic compute. No broker babysitting. No lock-in.
Our mission: data stays in place. Compute comes to you. From data lakehouses to private AI deployment and distributed ML - all designed for security, compliance, and production resilience.
Questions? Join our open Slack community or schedule a consult.
