June 21, 2026

Bootstrapping the Glossary: Let the Model Draft Its Own Domain Dictionary

We've argued before that the association layer — the dictionary mapping your users' words to your schema's names — is the highest-leverage artifact in a knowledge bot, and that you should write it by hand, backwards from the questions you most need answered. That advice holds. But it has a ceiling.

It works beautifully for the first dozen tables. It does not work for a thousand. And it really doesn't work when every new client arrives with a brand-new schema you've never seen, and the clock is running. Hand-authoring every alias for every table is weeks of tedious archaeology, most of it spent on tables nobody will ever query. At production scale, manual capture isn't disciplined — it's a bottleneck.

So you change the division of labor. The model drafts the glossary; the human curates it.

The model is good at this, given the right look

Here's the move: for each table, hand the model the table's name, its columns, and a small sample of real rows, and ask it to produce a handful — three to six is plenty — of domain terms in your users' language. Not column names. The words a person would actually use.

This plays directly to what models are good at. Shown a table called H_OSOBA with columns for names, birthdates, and a title code, plus ten sample rows, a model will readily infer "this is the master record of people" and offer employee, person, staff, personnel. It's pattern recognition over structure and examples, which is exactly the kind of inference language models do well. You're not asking it to reason about your business; you're asking it to read a table and name it the way a human would.

Input shaping is the actual craft

The naive version of this fails immediately, because you cannot pour a whole table into a prompt. A wide table with a million rows would blow the context window, cost a fortune, and bury the signal — the same context-flooding trap that ruins retrieval.

So the real skill is shaping the input. Cap it hard: at most around sixty columns, ten sample rows, a couple hundred characters per cell. That's enough for the model to understand what the table is — the column names carry most of the meaning, the sample rows confirm it — without drowning it. It's the lean-packet discipline applied to a different stage: give the model exactly enough to make the call, and not one token more.

Guard against the obvious failure

There's a specific, predictable way this goes wrong: the model gives back column names dressed up as aliases. You ask for what people call the employee table and it hands you "TITUL_PRED_KOD." That's not a domain term; it's the very jargon the glossary exists to translate.

So you check the output. If the generated associations look suspiciously like the schema they were supposed to describe, you re-run that table with a stricter prompt that spells out the difference between a technical name and a human word. It's a cheap, automatic guardrail — a tiny self-correction loop, the same instinct as the judge that re-checks a query — and it keeps the auto-generated glossary from quietly poisoning itself.

A second, purely operational guard: generating associations for a thousand tables is a burst of model calls, and cloud providers push back on bursts. Throttle the rate, run in modest batches of ten to twenty tables, and retry on the rate-limit and overload responses. Unglamorous, but it's the difference between a pipeline that finishes and one that dies halfway through a large schema.

Draft fast, curate where it counts

Auto-generation is a draft, not a verdict. It gets you most of the way across the whole schema in minutes instead of weeks — but the tables your real questions actually hit deserve a human pass. A domain expert skims the generated terms, fixes the ones the model guessed wrong, and adds the institutional words no sample of data could reveal (the internal nickname for a department, the acronym only insiders use).

This is the right shape for knowledge capture in general: the model drafts, the expert curates, the system persists. Auto-generation alone leaves errors in the long tail; pure hand-authoring never finishes. Together they get you a glossary that's broad and sharp, at a cost that actually scales.

And persistence matters as much here as in the hand-written case. Write the generated associations to a versioned file and re-seed them through a migration, so a database rebuild — which happens constantly in development — never wipes the work. The glossary lives in source control, not in a database that gets dropped on Friday.

Where this goes next

Sample rows are the most available source of signal, but they're not the only one. The same pipeline can draw from wherever your domain's language already lives: data dictionaries, internal wikis, the documentation a long-serving team wrote years ago. Anywhere the meaning of H_OSOBA has already been written down in human terms, the bootstrapping process can mine it — turning scattered, half-forgotten documentation into a structured glossary the bot can actually use.

That's the real point. The most valuable knowledge in your organization is tacit and unstructured, spread across data, docs, and people's memory. Bootstrapping is how you extract it at scale without asking someone to type it all out by hand — assisted capture, curated by experts, persisted as an asset. It's what turns the association layer from a charming demo trick into something that survives contact with a real, thousand-table production database.

Facing a sprawling schema you can't possibly annotate by hand? Bootstrapping the glossary is how we make the association layer scale — model-drafted, expert-curated, version-controlled. Let's scope what it would take for yours.