Digesting a codebase before a model reads it

Across every organisation I’ve worked in, documentation is either missing or, once written, out of date. So, we’ve stopped treating it as something people maintain and made agents regenerate it on every code change.

TL;DR

  • estate-wiki is an internal, self-updating wiki for the Jollyes backend estate: one page per Bitbucket repo, regenerated by a scheduled agent that re-reads the code on a cadence, if there’s been a change. The docs are a generated artefact, so they cannot rot.
  • Each repo read produces two views from one pass: a README human view and a CLAUDE.md machine view, rendered together with a structured facts blob that agents consume via an MCP.
  • To help produce better outputs, key files from the repos (which are too convoluted, big, or unsafe to feed a model raw like a DAG or SSIS package) are digested deterministically first.
  • There’s a fun symmetry between each file being pre-digested into an information-rich summary and the project as a whole condensing the estate into a summary other agents query: like Russian dolls of information.

The stack

estate-wiki runs a single, provider-agnostic review agent against the Jollyes Bitbucket estate inside an ephemeral ECS task. The model call sits behind a neutral interface, so the same agent runs on OpenAI, Anthropic, or others. It auto-discovers repos from the workspace, so the live wiki now covers a few hundred with no config edit when a new repo appears.

Airflow DAG
        │
        ▼
ECS Fargate task (review agent, service token)
        │  clone → HEAD → skip-if-unchanged?
        ▼
scope to git-tracked files  ──►  digesters (.dtsx, DAGs)
        │
        ▼
ONE summarise pass ──► facts JSON + human view + machine view
        │  (model chosen by config: OpenAI · Anthropic · Bedrock)
        ▼
backend REST /api/private/*  ──►  Postgres (one row per repo)
        ▲
        └── read back by AI agents over the MCP

The agent never touches the database directly. It writes through /api/private/* with a service token, so the backend stays the single Postgres writer (and reader). The digesters box is the step I want to dwell on: it runs before the model and decides what the model is given to read.

Digesting files before reading them

Not every file is source code you can hand directly to a model. For both SSIS packages and Airflow DAGs, we have deterministic digesters that run before any LLM call. The model never sees raw files, for both security and better context.

For example, SSIS (.dtsx) packages are often huge XML documents (a single MAIN.dtsx can be ~800 KB) and may contain encrypted secrets. Passing the raw XML to a model would be slow, expensive, and could expose credentials. dtsxDigest parses the package into a compact, secret-masked JSON representation containing:

  • Connection managers (the actual source and destination systems)
  • Data-flow components in execution order
  • SQL executed by each step

The result reads like a concise “source → transform → destination” pipeline rather than hundreds of kilobytes of XML markup. Interestingly, despite the model being able to read the entire file into context easily, and the transformation being a simple deterministic script - the outcome of runs with a digested file is significantly better. It seems to structure the data better than the XML in a semantic way which works for getting the key logic flows: highlighting what’s most important for the documentation to the model.

I find it an interesting principle to think over: use cheap, fast and deterministic parsing to decide exactly what information reaches the model, and in what shape. The expensive LLM step only receives a clean, safe, structured, information-dense representation.

Two views, one source

Each repo is one Postgres row holding the Q&A, both rendered views plus a facts JSONB. The agent extracts the facts and renders both markdowns in a single pass: a README-flavoured human view for the helpdesk and new starters, and a CLAUDE.md-flavoured machine view for developers and AI agents.

The facts blob does three jobs: it seeds both views, it’s cheap grounding handed to the Q&A agent so it needn’t re-read a whole repo per question, and it’s machine-consumable over the MCP. Fields include languages, endpoints, env vars, data stores, integrations, deploy target, owners and key files plus category-specific dags[] and ssisPackages[].

The same idea extends across the estate: the whole estate condenses into one summary that other agents call over the MCP. Files digested for the model, nested inside an estate digested for other agents.

An example: stock, end to end

The wiki has a built-in Q&A per page, where the agent responds to user questions from the facts blob. Conceptually that’s simple, as the information is self-contained.

A much harder question is one that spans the whole estate, and here the wiki MCP comes into its own. I asked Claude Code: how does a “linked” pack/single SKU work, end to end? (A single 390g can, 61329, and its 12-pack, 61338, are the same product sold two ways, but stocked only as 12-packs, and counted as singles!)

Claude Code first searched across the wiki via the MCP for ‘stock’, ‘parent’ and ‘child’ and found the relevant repos. From here, the agent fanned out roughly 30 sub-agents across 17 repos, to first build the flow, and then adversarially verify each part of the final claim. The whole review cost a couple of million tokens, cheap because it ran small models over repos and only opened and read the key files with bigger models, and we managed to dig out a complex multi-system, multi-repo flow within 15 minutes.

(What’s more, the entire finding was then verified using the SQL MCP against our allocation and stock data, and double-checked against live point-of-sale (POS) APIs with a temporary token. Finally, the whole write-up was emailed to me from within the CLI using a new draft_emailsend_email tool chain in the SQL MCP (from claude@jollyes.com)!)

Closing

The ‘win’ isn’t that an agent can write your documentation. It’s that the documentation is never stale, and is readily available to downstream consumers in a useful format.

The stock example is where the system stops being just documentation and starts becoming operational infrastructure. That level of analysis is only possible because the estate has already been indexed, digested, and made searchable. The MCP compresses hundreds of pages of organisational knowledge (derived from millions of lines of code) into a form that an agent can navigate in seconds.

The wiki and MCP serve complementary roles. The wiki keeps the documentation accurate and discoverable; the MCP turns that knowledge into a machine-accessible interface. Together they allow agents to self-serve information at low cost and high speed to complete tasks more efficiently and accurately.

Written on June 9, 2026