How Loomkindle Uses Agentic Discovery to Cut Schema Mapping Time by 80%
The traditional approach to building a semantic catalog starts with a data dictionary spreadsheet and 6-8 weeks of interviews. Analytics engineers work through source tables column by column, asking upstream teams what each field means, manually writing descriptions, and building a YAML metric registry that is outdated before the interviews are finished. We've watched teams do this on both sides — the team trying to build the catalog and the team being interviewed — and the result is almost always a catalog that captures 60-70% of the important context, with the remaining 30-40% locked in individual heads, and a maintenance debt that grows faster than the team can address it.
Agentic schema discovery is a different approach. Instead of starting from blank documentation and asking humans to fill it in, it starts from the data itself — column names, data types, value distributions, sample values, and the semantic context embedded in naming conventions — and uses automated reasoning to propose canonical mappings. In our early deployments, this approach produces an 80% reduction in initial schema mapping time. Here's how it works technically, and where the human decisions still live.
What the discovery agent actually reads
When Loomkindle connects to a warehouse (Snowflake, BigQuery, Redshift, or DuckDB), the schema discovery agent performs a structured analysis of each source table before proposing any mappings. It reads:
Structural signals
Column name, data type, nullable/not-null, primary key designation, and foreign key relationships. A column named account_uuid of type VARCHAR(36) with a not-null constraint and a uniqueness index is structurally consistent with a primary entity identifier. The agent uses this to propose that the column is likely a join key that should be tracked as a required field in any downstream contract.
Distribution signals
The agent samples up to 10,000 rows from each column and analyzes cardinality, null rate, value range, and value distribution. A column with 4 distinct values, 0% null rate, and values active, churned, trial, paused is structurally consistent with a subscription status enum. The agent proposes an enumeration contract with exactly those accepted values.
Naming convention signals
Column names encode significant semantic information when read in context. monthly_recurring_revenue in a table called subscriptions is almost certainly an MRR field. created_at in any table is almost certainly a row creation timestamp. The agent applies a vocabulary of business domain terms to propose semantic categories for columns where the name carries clear intent.
Existing dbt metadata
When the warehouse already has dbt models, the agent reads schema.yml files for any column descriptions that have been written, any existing tests (not_null, accepted_values, relationships), and the model DAG structure. These existing annotations become high-confidence seeds for the catalog — the agent treats documented dbt columns as authoritative and focuses discovery effort on the undocumented columns.
The proposal format: structured review, not automatic enforcement
The output of discovery is not an automatically-published catalog. It's a structured set of proposals that an analytics engineer reviews, modifies, and approves before they become canonical definitions. This distinction matters a lot.
The agent produces proposals in a format that looks like a pull request: each proposed mapping shows the column, the proposed semantic category, the confidence level, the supporting evidence (which signals contributed to the proposal), and an action required: approve, modify, or reject. High-confidence proposals (typically 70-85% of the total on a typical SaaS warehouse) can be batch-approved in a review session. Low-confidence proposals surface the cases where human judgment is required — usually complex business logic encoded in ambiguously named columns.
In our experience, an analytics engineer working through a fresh 40-table Snowflake schema spends roughly 45 minutes reviewing and approving a discovery output that took 4-6 weeks to produce manually. The agent handles the obvious cases; the engineer handles the ambiguous ones. That's the 80% time reduction in practice.
Continuous discovery: staying current after the initial catalog
The initial catalog build is the starting point, not the goal. The value of agentic discovery compounds over time because it runs on every schema change, not just on initial connect.
When an upstream Snowflake table adds a column, renames a column, or changes a data type, the discovery agent:
- Detects the schema diff on the next incremental load run
- Analyzes the new or changed column using the same structural, distribution, and naming signals as the initial discovery
- Computes the blast radius: which existing catalog metrics depend on the affected columns
- Generates a structured change proposal that routes to the metric owners of affected downstream definitions
- Blocks promotion of the changed data until the proposal is reviewed and approved or the metric owner signs off on the impact
This continuous loop is what turns a one-time catalog build into a living system. The catalog doesn't go stale because every schema change triggers a discovery run and a review workflow. The analytics engineer's job shifts from "manually maintain documentation" to "review and approve proposals generated from the data itself."
Where the agent's confidence is lower — and why that's useful information
Not all columns produce high-confidence proposals. The cases where the agent produces low-confidence proposals are often exactly the cases where human documentation adds the most value. Specifically:
- Cryptic internal codes: A column named
acq_chnl_cdwith integer values 1-8 has a perfectly uniform distribution and no null values. The agent cannot infer that values 1-4 are organic acquisition channels and 5-8 are paid channels based on an internal convention established at company founding. Low-confidence proposal + flag for human documentation. - Columns where business logic diverges from naming: A column named
revenuethat actually contains gross margin after partner fees — not revenue by any standard accounting definition — cannot be discovered from the column name alone. Low confidence, flag for owner review. - Columns with high cardinality free text: A
notesordescriptionfield with unique values in every row tells the agent very little about semantic intent. These are flagged as requiring human annotation rather than auto-proposed.
The low-confidence proposals are useful not just as flags for human attention but as a coverage metric. An analytics engineer can see at a glance how many columns in the catalog are well-documented versus still pending human annotation, which tables have the highest documentation debt, and which metric dependencies run through poorly-documented columns — meaning those metrics are the ones most at risk when the underlying schema changes.
The catalog as a living operational system
Schema mapping time is the visible metric — 80% reduction in initial build time. But the more durable value is what happens after the catalog is built: upstream schema changes surface as structured reviews in 45 minutes instead of causing debugging sessions days later. Metrics have named owners who receive change notifications automatically. Contract checks run against every incremental load. Freshness SLAs are attached to metrics that business teams actually depend on.
The catalog built through agentic discovery is not documentation. It's an operational system that actively monitors, proposes, and enforces the semantic consistency of your analytics stack. The analytics engineer is the decision-maker in that system — approving proposals, resolving ambiguities, owning the definitions that matter — but they're making decisions on structured inputs from automated analysis, not building documentation from scratch.
That shift — from documentation authors to proposal reviewers — is what reclaims the 60-70% of analytics engineering time currently spent on ETL maintenance. Not by eliminating the need for human judgment, but by applying it where it actually adds value.