Back to Home

The economics of code reuse in AI development: a three-point benchmark

Referencing code is ~14× cheaper than regenerating it — measured, with the methodology and the losses published.

Every AI development tool claims to save you money. Almost none of them show you a measurement. I build Stellify, an AI-native Laravel platform, and before marketing its economics I decided to benchmark them properly — against Claude Code working directly on plain files, in a fair environment, with the numbers published whichever way they came out.

The claim under test

I want to be precise about scope, because it's where most tool benchmarks go wrong. I have never claimed Stellify generates new code more cheaply than a file agent, or edits code more cheaply. The one efficiency claim I make is about reuse: once functionality exists in Stellify, the next build that needs it shouldn't pay to generate it again. Stellify stores code as structured data — files, methods and statements as linked records — so proven functionality can be pulled into new work without passing through the model at all.

If that's true, it should show up as a specific, falsifiable pattern: the cost of reusing a module should be roughly flat regardless of the module's size, while the cost of building the same module from scratch should scale with the code. That's the shape this benchmark went looking for.

Method

Both arms performed the identical task to completion: deliver a working module of a given size into a project. One arm was Claude Code working natively on files (its Read and Edit tools, prompt caching enabled — I wanted its best game, not a strawman). The other arm reused the module already present in Stellify's corpus via MCP. Same model — Claude Opus 4.8 (claude-opus-4-8), pinned in every run — token counts taken from Claude Code's own OpenTelemetry output, and billed dollars recorded alongside raw tokens, because tokens alone flatter whichever arm caches better.

Every reuse run was forensically verified before I accepted its numbers: a per-message breakdown of output tokens proving the reuse arm emitted no module code (its ~1,200–1,550 tokens are wiring and integration only), and verification that the module the new project consumed is identical to the corpus original — assembled through the production pipeline, byte-for-byte, with the corpus confirmed unchanged after every consumer build. The reuse arm was run under two distinct mechanisms (more on this below), twice per module each, and the two mechanisms mutually confirm each other's numbers. The evidence pack, methodology notes, prompts and run outputs are published here.

Three module sizes, to establish a curve rather than a data point.

The result

Module Reuse: output tokens From scratch: output tokens Advantage Reuse: cost From scratch: cost
Report, ~130 lines 1,536 3,067 2.0× $0.38 $0.22 — reuse loses
Payroll, ~641 lines 1,196–1,316 13,209 ~10× $0.29–0.38 $0.59 — up to 2× cheaper
Ledger, ~1,174 lines 1,249–1,545 22,292 ~14× $0.50–0.53 $0.92 — ~1.8× cheaper
0 5k 10k 15k 20k Output tokens 130 641 1,174 Module size (lines of code) From scratch: scales with the code Reuse: flat, ~1,200–1,550 2.0× ~10× ~14×
Output tokens against module size: the flat reuse band against the rising generation line.

The story is the flat column. Reusing a module costs roughly 1,200–1,550 generation tokens whether the module is 130 lines or 1,200 — every one of nine reuse runs across the whole benchmark, under both mechanisms, landed inside that band. Building the same module from scratch costs what generation always costs: it scales with the code, at roughly 19–24 output tokens per line at every size we measured.

Which means the advantage isn't a percentage. It's a curve. 2× at 130 lines, ~10× at 641, ~14× at 1,174 — and nothing in the data suggests a ceiling; the ratio simply widens with the size of the functionality you're not regenerating.

The boundary, stated plainly

Look at the first row. At small module sizes, reuse loses on cost — the fixed overhead of the MCP round-trips outweighs the generation it avoids. The blended-cost crossover sits somewhere around 400–500 lines (interpolated between our measured points, so treat it as approximate). Below it, a plain file agent is the cheaper tool. Above it, reuse wins outright and keeps widening.

The same honesty applies beyond this benchmark's scope. In earlier tests this week, Stellify's MCP path was more expensive than Claude Code's native tools for greenfield generation of small builds, and meaningfully more expensive for repeated edits to existing code — the current edit tooling works at method granularity where Claude Code's Edit swaps a two-line snippet, and that gap is a known item on our roadmap (statement-level edit operations; the storage model is already statement-level, the tooling isn't yet). Those numbers are in the appendix. I'm publishing them for a simple reason: the cheapest way to reproduce this benchmark is a small one, and if you run a small case you will get the opposite result to the headline. You should know that before you start, not discover it and conclude the rest was cherry-picked.

Tokens, in other words, are not the pitch. The pitch is what the structure does — and the token curve is one measurable consequence of it.

Why the reuse line is flat

Because the module never passes through the model — and as of this week, because nothing is even copied.

When a build reuses existing functionality, Stellify references the canonical module: the consumer project holds lightweight reference edges that resolve to the original records at assembly time. In the benchmarked runs, reusing a ~1,174-line module wrote 60 reference edges and a few file shells — zero method, statement or clause rows duplicated. The model emits only the wiring that connects the module in, which is why the cost is flat regardless of what's inside it. The canonical original is untouched by consumer builds — we verify this byte-for-byte after every run.

And when a consumer changes a referenced module, copy-on-write kicks in at the point of divergence: edit one statement and only that statement forks locally; edit a method and the method materialises for you while its untouched siblings stay referenced; the canonical module is never mutated by a referrer's edit. Borrow freely, diverge precisely, original intact — a file agent structurally cannot do this, because with files every borrow is a copy and every divergence is a rewrite.

Earlier runs in this benchmark used Stellify's previous mechanism — a record-level copy — and produced the same token band, which is itself worth noting: the economics come from the module bypassing the model, under either mechanism. Both sets of runs are in the evidence pack.

One boundary, stated as plainly as the cost one: references are not yet version-pinned. Locking, pinned versions, and controlled upgrade propagation (improvements to a canonical module flowing to its consumers deliberately, rather than live) are the next stage of this architecture — built on the same reference edges you're seeing here, but not yet something I'll claim.

What the curve buys

One more number from the benchmark, because it's the whole business model in a single contrast: seeding the ~1,174-line ledger module into the corpus cost $36.64 and 212 agent turns — once. Every consumer that references it thereafter pays about $0.50.

That's the point of storing code as data. The generation cost of solid functionality is paid once, by whoever builds it; everyone after that inherits working, verified code for the price of the wiring. It's why Stellify has an app store (Constellation) of production apps you can pull from rather than rebuild — and it's why the platform's economics improve with everything you've already built, rather than resetting to zero on every feature the way file-based generation does.

Anything can generate your app. The interesting question is what happens after — whether what you build accumulates or decays. This benchmark measures one consequence of building on structure. The rest of them are harder to put in a table.

Caveats, and an invitation

For the sceptics, as it should be: reuse arms are n=2 per module under each of two mechanisms (mutually confirming); the file-agent arms are single runs — separations of 10–14× against measured rep noise of about ±5%. The crossover is interpolated. One module family per size, one model; I haven't yet tested whether the pattern holds for, say, frontend components. Identity was proven by full content comparison through the production assembly pipeline. During verification the harness itself surfaced — and I then repaired and re-verified — a corpus-integrity issue caused by cross-project row sharing; the incident writeup is in the evidence pack, because a benchmark that hides its accidents isn't one you should trust. It's all in the methodology notes.

The harness is published with the evidence pack. Run it, break it, tell me what I should measure next — and if there's a module shape you'd want benchmarked before trusting this, say so and I'll run it.

Stellify is free to try — 20 AI messages a month, no card: stellisoft.com. The full methodology and evidence: github.com/Stellify-Software-Ltd/code-reuse-benchmark.


Appendix: the tests reuse didn't win

Single small greenfield build, blended cost: file agent $0.66 vs Stellify MCP $0.93. Ten sequential edits to an existing file: file agent $0.99 vs $3.87 — the native Edit tool swaps small snippets in fewer turns, while our current MCP edit tooling re-emits at method granularity and carries the MCP context overhead on every turn. Root cause understood; statement-level edit tooling is on the roadmap. Raw runs in the evidence pack.