
There's a moment in every AI-on-data project that we now recognize instantly.
Your team hooked Claude to the warehouse and immediately discovered that the agent writes confident, syntactically correct SQL that returns the wrong answer – because it doesn't know that "revenue" needs to exclude free trials, or that the orders table double-counts before a pipeline fix in March, or that "active user" has three definitions and nobody agreed on.
So someone on the team has the insight: the model needs business context. Not just schema access – actual institutional knowledge about what the data means.
This is the right insight. What happens next is where it gets spicy.
"How hard can it be?"
The natural instinct – especially at engineering-led companies – is to build it. And the logic is good: "We know our data better than anyone. We'll stand up a knowledge base, document our metrics, feed it to the model. Give us a couple of weeks."
Your team absolutely can build this. The question is what it actually costs – not in dollars, but in time, focus, and ongoing maintenance burden. We've now talked to enough teams who've gone down this road to describe the common stages with uncomfortable precision.
Stage 1: Enthusiasm. Someone creates a GitHub repo. The team fires up Claude Code and starts documenting metrics. There's energy. The team writes up the top 20. Someone builds a script to feed it into the LLM's context window. Accuracy improves noticeably on those 20 metrics. Your team has always wanted to build a semantic layer but never had leadership buy-in – but now calling it "context" has convinced leadership that this is worthwhile.
Stage 2: The long tail. Turns out 20 metrics isn't enough. Real users don't ask about "revenue" – they ask about "revenue from enterprise accounts excluding the Japan market for the last full quarter, and can you compare that to the same period last year but adjust for the pricing change we made in February?" Every question surfaces five more pieces of context nobody documented. And what even counts as a single metric – is there one revenue metric or fifteen? Requests for new metrics in the #data-questions Slack channel grow faster than the team can respond. Tell me that's not true.
Stage 3: The quality crisis. Someone discovers that three of the documented definitions are wrong. Not obviously wrong – subtly wrong, in ways that produced plausible but incorrect numbers for weeks. This is much worse than obviously wrong. A subtly wrong number made it into a board deck. The team realizes they need a way to test the context, not just write it. Someone starts building an eval framework. This was not in the original scope. (On the cynical side, this might be perceived as a good thing by team members – another opportunity to grow their AI-focused resume.)
Stage 4: The security crisis. One of the security engineers gets wind of what the data team has been up to and realizes there's no governance on what's in the knowledge base. The context layer is largely written by Claude pulling from a single highly authorized user's credentials to Slack, GitHub, Linear, Confluence, and HubSpot. It's got raw data, PII, private notes, the whole shebang. Whoops.
Stage 5: The maintenance problem. The data warehouse didn't freeze while you were documenting it. Since the initial context layer landed, your company has launched a major new business model in a new geography. New tables appeared and schema migrations to accommodate the new business renamed key columns. The dbt model that defined qualified_lead changed. The context layer you built is already drifting. An engineer spends a week writing a drift detection script. It catches some things but not others.
Stage 6: The political problem. The data team documented how they define metrics. But finance has different definitions. Product has different definitions. Ops has definitions that differ in subtle but critical ways – and they're all correct for their respective purposes. Nobody anticipated that the context layer would need to capture whose definition to use when. This isn't a technical problem – it's an organizational one. Of course, the non-technical folks don't even have access to the context in GitHub, so they start spamming Slack with screenshots of issues. Now we need a project manager involved to help hash out who owns which metric and a process to ask the data team to make updates. And we have tasks living across Slack, Jira, Google Sheets, and GitHub.
Stage 7: Quiet abandonment. The context layer exists, but it's 60% complete, partially stale, and nobody's job to maintain. This didn't work. The analytics team is still in the loop because 10%+ of answers still need review. You're back where you started, minus a quarter, a demoralized team whose cool AI project is getting silently canned, and a vastly higher volume of data questions coming through since stakeholders have gotten a taste of self-serve – but can't. Now what?

Not every team hits all seven stages. Some get stuck at stage 3 and course-correct. Some power through to something that works. But the pattern is common enough that we think it's worth naming: the data context trap.
Why this is harder than it looks
The reason teams underestimate this isn't a lack of talent – it's that the problem looks like a documentation project. It's actually three separate hard problems stacked on top of each other.
Problem 1: Collection is much harder than pointing Claude Code at a few sources
The instinct is to point an LLM at your existing docs and call it context. This gets you surprisingly far on day one – and then you discover why it's not enough.
Your context is scattered across dbt models, Confluence pages, Slack threads, SQL comments, and people's heads. Your LookML repo has hundreds of files and the LLM ignores most of them.
Different sources disagree with each other – and the model just picks whichever source it retrieved and runs with it. Without a scalable pipeline that can process all of the content and resolve conflicts, your context layer is just a bunch of vibe-coded markdowns.
And then there's permissions. Whoever builds the extraction pipeline – usually your most senior data engineer – has broad access. The context layer inherits that access implicitly. Every user of the AI agent can now surface institutional knowledge that was never meant to be broadly accessible. When security discovers this, the project stalls while you retrofit an authorization layer that was never in the original design.
Problem 2: Validating that context is right is hard
This is the part that bites hardest. Without context, an AI agent gives uncertain answers – and users learn to double-check. With wrong context, it gives confident answers that happen to be incorrect. The blast radius is much larger. Going through a hundred markdown files to validate that they are right is basically an impossible task.
Even if you do go through those markdowns, keeping context correct is genuinely hard. Definitions conflict across teams. Source tables change without notice. Business logic evolves quarter to quarter. You can't just write context once and ship it.
The answer is that you need continuous validation – real evals that test whether the context actually produces correct answers on real questions, run against live data, automatically. A wiki doesn't do this. A Notion page doesn't do this. Even a well-maintained dbt docs site doesn't do this. You need an expert-validated eval framework that comprehensively and continuously tests your knowledge base.
Problem 3: Maintenance is a full-time job that nobody wants
Let's be honest about what maintaining a context layer actually means. Every time a schema changes, someone needs to check if the context is still valid. Every time a business definition evolves, someone needs to update the documentation and verify it. Every time a new table appears, someone needs to document what it contains and how it relates to existing tables.
This is unglamorous, ongoing work. It doesn't ship features. It doesn't produce impressive demos. It doesn't advance anyone's career. And yet, without it, the context layer starts decaying from day one. We've seen context layers go from "helpful" to "actively harmful" in as little as three months when maintenance lapses.
The companies that build this successfully in-house are the ones that can dedicate a full-time role to it. Most can't justify that – especially when the person maintaining the context layer could be doing actual analysis instead.

The compound problem
Here's what makes this especially tricky: this isn't one of those times where you can solve 2 problems and get away without solving the third. Rather, this is a 3-legged stool, and even one of the issues above breaks everything.

If collection is incomplete, validation can't catch what's missing. If validation isn't automated, maintenance becomes manual review of every piece of context after every change. If maintenance lapses, the context you collected becomes actively dangerous.
You can't solve them independently. And solving them together – at the level of reliability needed for production AI agents – is a significant engineering effort. We know because it's what we've spent years building.
What we learned building Delphina
This is the part where we talk about what we built. We didn't start by building a product. We started by living this problem – first at Uber, then by working with design partners who were going through exactly the cycle we described above.
A few things we learned the hard way:
- Context has to be systematically extracted, not just authored. The majority of useful context already exists somewhere – in your dbt models, your dashboard definitions, your query logs, your existing documentation. Asking humans to write everything from scratch is why wikis die. The foundation should be automated extraction, with human input reserved for the knowledge that genuinely only lives in people's heads. But just pointing a coding agent at a repo isn't enough — you need a systematic and scalable process to extract context.
- Validation has to be continuous, not one-time. We generate evals from your existing dashboards and reports – the numbers your team already trusts. Then we run those evals continuously against live data. When the agent's answer drifts from the established ground truth, we can identify exactly which piece of context needs updating and why.
- Maintenance has to be automated by default. Schema changes, new tables, evolving definitions – the context layer has to monitor the data environment and update itself incrementally. Humans only enter the loop when human judgment is genuinely required: resolving conflicting definitions, confirming business logic changes, or validating edge cases.
- The org-wide collection problem has a solution. When domain experts can contribute through channels they already use, the knowledge actually flows. For many users, this means reviewing AI-suggested definitions in real time, confirming or correcting context in a lightweight interface, or validating answers in Slack. The trick isn't building a new tool for people to use. It's meeting them in their existing workflow.
When building it yourself makes sense
We want to be honest here – there are teams that build this in-house and make it work. They tend to share a few characteristics:
- They have engineers to spare. Not "we could theoretically reallocate someone" – actually spare, as in the opportunity cost of dedicating 2-3 engineers to context infrastructure for 6+ months is genuinely low.
- They treat it as a product, not a project. It gets a roadmap, an owner, and ongoing investment – not a one-time sprint.
- They've already solved the organizational problem. They have established processes for metric governance across teams, and the political work of aligning definitions is done or well underway.
If that's you – genuinely – then building in-house might be the right call. You know your data, you have the resources, and you're willing to commit to the ongoing maintenance.
For everyone else, it's an opportunity cost question
We built Delphina so your data team can focus on what they're actually good at.
Not because they can't build context infrastructure – but because they shouldn't have to.
If you're in month 1 of the cycle we described above and wondering whether to keep going – or if you're in month 5 and looking for a way out – we'd love to talk. Delphina can turn around a context layer in a couple of hours – and make sure it stays correct.