Digesting a codebase before a model reads it

Across every organisation I’ve worked in, documentation is either missing or, once written, out of date. So, we’ve stopped treating it as something people maintain and made agents regenerate it on every code change.

TL;DR

estate-wiki is an internal, self-updating wiki for the Jollyes backend estate: one page per Bitbucket repo, regenerated by a scheduled agent that re-reads the code on a cadence, skipping repos that haven’t changed. The docs are a generated artefact, so they cannot rot.
Each repo read produces two views from one pass: a README human view and a CLAUDE.md machine view, rendered together with a structured facts blob that agents consume via an MCP.
To help produce better outputs, key files from the repos (like DAGs or SSIS packages which are too big, convoluted, or unsafe to feed in raw) are digested deterministically first.
There’s a fun symmetry between each file being pre-digested into an information-rich summary and the project as a whole condensing the estate into a summary other agents query: like Russian dolls of information.

The stack

estate-wiki runs a single, provider-agnostic review agent against the Jollyes Bitbucket estate inside an ephemeral ECS task. The model call sits behind a neutral interface, so the same agent runs on OpenAI, Anthropic, or others. It auto-discovers repos from the workspace, so the live wiki now covers a few hundred with no config edit when a new repo appears.

Airflow DAG
        │
        ▼
ECS Fargate task (review agent, service token)
        │  clone → HEAD → skip-if-unchanged?
        ▼
scope to git-tracked files  ──►  digesters (.dtsx, DAGs)
        │
        ▼
ONE summarise pass ──► facts JSON + human view + machine view
        │  (model chosen by config: OpenAI · Anthropic · Bedrock)
        ▼
backend REST /api/private/*  ──►  Postgres (one row per repo)
        ▲
        └── read back by AI agents over the MCP

The agent never touches the database directly. It writes through /api/private/* with a service token, so the backend stays the single Postgres writer (and reader). The digesters box is the step I want to dwell on: it runs before the model and decides what the model is given to read.

Digesting files before reading them

Not every file is source code you can hand directly to a model. For both SSIS packages and Airflow DAGs, we have deterministic digesters that run before any LLM call. The model never sees raw files, for both security and better context.

For example, SSIS (.dtsx) packages are often huge XML documents (a single MAIN.dtsx can be ~800 KB) and may contain encrypted secrets. Passing the raw XML to a model would be slow, expensive, and could expose credentials. dtsxDigest parses the package into a compact, secret-masked JSON representation containing:

Connection managers (the actual source and destination systems)
Data-flow components in execution order
SQL executed by each step

The result reads like a concise “source → transform → destination” pipeline rather than hundreds of kilobytes of XML markup. Interestingly, despite the model being able to read the entire file into context easily, and the transformation being a simple deterministic script - the outcome of runs with a digested file is significantly better. It seems the digest structures the data better for the model than the raw XML does: it organises the logic semantically, surfacing the key flows the documentation needs to capture.

I find it an interesting principle to think over: use cheap, fast and deterministic parsing to decide exactly what information reaches the model, and in what shape. The expensive LLM step only receives a clean, safe, structured, information-dense representation.

Two views, one source

Each repo is one Postgres row holding both rendered views plus a facts JSONB. The agent extracts the facts and renders both markdowns in a single pass: a README-flavoured human view for the helpdesk and new starters, and a CLAUDE.md-flavoured machine view for developers and AI agents.

The facts blob does three jobs: it seeds both views, it’s cheap grounding handed to the Q&A agent so it needn’t re-read a whole repo per question, and it’s machine-consumable over the MCP. Fields include languages, endpoints, env vars, data stores, integrations, deploy target, owners and key files plus category-specific dags[] and ssisPackages[].

The same idea extends across the estate: the whole estate condenses into one summary that other agents call over the MCP. Files digested for the model, nested inside an estate digested for other agents.

An example: stock, end to end

The wiki has a built-in Q&A per page, where the agent responds to user questions from the facts blob. Conceptually that’s simple, as the information is self-contained.

A much harder question is one that spans the whole estate, and here the wiki MCP comes into its own. I asked Claude Code: how does a “linked” pack/single SKU work, end to end? (A single 390g can, 61329, and its 12-pack, 61338, are the same product sold two ways, but stocked only as 12-packs, and counted as singles!)

Claude Code first searched across the wiki via the MCP for ‘stock’, ‘parent’ and ‘child’ and found the relevant repos. From here, the agent fanned out roughly 30 sub-agents across 17 repos, to first build the flow, and then adversarially verify each part of the final claim. The whole review cost a couple of million tokens, cheap because small models did the broad sweep and bigger models only opened the key files, and we managed to dig out a complex multi-system, multi-repo flow within 15 minutes.

(What’s more, the entire finding was then verified using the SQL MCP against our allocation and stock data, and double-checked against live point-of-sale (POS) APIs with a temporary token. Finally, the whole write-up was emailed to me from within the CLI using a new draft_email → send_email tool chain in the SQL MCP (from claude@jollyes.com)!)

Closing

The ‘win’ isn’t that an agent can write your documentation. It’s that the documentation is never stale, in a format ready for agents to use. The wiki keeps it accurate and discoverable for people; the MCP turns the same knowledge into an interface for machines.

The stock example is where the system stops being documentation and starts becoming operational infrastructure. That level of analysis is only possible because the estate has already been indexed, digested, and made searchable - we compress hundreds of pages of organisational knowledge, derived from millions of lines of code, into a form an agent can navigate in seconds.

The principle is the same at all scales: spend cheap, deterministic effort deciding what reaches the model, and in which format, then the expensive model will do better work. Digest the file for the agent; digest the estate for the agents that follow.

Written on June 9, 2026