Devlog #7: Slowly But Surely

01 Sep, 2025

It’s been one of those stretches where I feel like I’m spinning plates — but when I take a step back, the threads line up pretty neatly. I’ve been stress-testing The Planner’s Assistant against a real major application, building up local AI capacity, and exploring how to ground models in planning theory. All of it points to the same idea: if planning AI is going to stick, it has to be done properly.

Stress-testing on Earl’s Court

The Earl’s Court application (~3.3GB of PDFs, 147 documents) has become my benchmark. It’s not an abstract dataset; it’s the kind of oversized, messy case officers know all too well. Running my codebase against it has been equal parts frustrating and illuminating.

The first run only managed to process six files before stalling. Annoying, yes — but also a reality check. Paper designs look neat; live applications fight back.
At this scale, processing takes hours. That’s not just a technical bottleneck, it’s a UX problem: no planner will wait in front of a locked-up screen. The challenge is to design workflows that make the outputs usable and explainable despite the scale.
The exercise has already shaped the roadmap: The Planner’s Assistant won’t just parse documents — it will need to show its working as it goes.

Stress-testing like this is slow and occasionally demoralising, but it’s also the only way to build tools that survive contact with reality. Earl’s Court is exposing the cracks that actually matter.

Local AI as Infrastructure

Those tests hammered home another point: local inference isn’t optional. On my setup the NPU (48 TOPS) maxed out, the GPU strained at the edges, and still it ground through. If I had to rely on API calls for jobs like this, costs would spiral and stability would suffer.

That’s why I’ve treated new hardware as an investment, not a luxury. It lets me:

Run larger models locally without waiting for cloud quotas.
Re-run experiments freely without watching token bills.
Explore hybrid GPU/NPU scheduling in ways planners themselves could one day replicate in their offices.

Some people upgrade wardrobes; I upgrade hardware. The humour aside, the serious point is: local capacity is what makes planning AI practical and sustainable.

Grounding in Planning Theory

But hardware alone isn’t enough. The bigger intellectual thread is how to make models reason credibly. That’s where distillation comes in: feeding long-context models the NPPF, PINS precedents, and the core planning texts. Not site-specific policies — too narrow, too biased — but the backbone of UK planning itself.

This isn’t only about cutting down hallucinations. It’s about:

Consistency: models grounded in theory won’t contradict themselves as often.
Auditability: outputs can be traced back to sources planners already recognise.
Credibility: if AI is going to support professional judgement, it has to be trained on the same intellectual spine planners are trained on.

Hardware and methodology meet here. The compute makes the distillation experiments possible. The distillation makes the compute meaningful.

Why Fine-Tuning Matters More Than Wrappers

It’s easy to think a Retrieval-Augmented Generation (RAG) system wrapped around a chatbot is “good enough.” Fetch some snippets, let the model summarise, job done. But that’s brittle, and it risks souring planning AI’s reputation before it even begins. Fine-tuning on planning theory is a slower but far more credible path:

Depth vs. Recall

RAG + wrapper: pastes snippets and hopes the model stitches them together. Results hinge on retrieval quirks.
Fine-tuning: bakes in an understanding of planning concepts and their structure, so the model reasons with them directly.

Consistency

RAG: answers shift depending on chunks retrieved or prompt wording.
Fine-tuning: locks the intellectual spine into the model itself, making reasoning steadier and less prompt-sensitive.

Professional Credibility

RAG: can feel like a chatbot bolted onto someone else’s API — good for demos, easy to dismiss.
Fine-tuning: shows deliberate methodology. Training on the NPPF, PINS precedents, and classic texts means outputs are auditable and rooted in the same sources planners trust.

Sustainability

RAG: lives on repeated API calls and fragile wrappers.
Fine-tuning: once trained, can run locally or cheaply, without locking into a single vendor. Costs are predictable, infrastructure is under control.

Wrappers make quick demos. Fine-tuning makes credible tools. If the goal is professional adoption, methodology is the product.

One Thread

So while it might look like I’m juggling different things — pipelines, hardware tinkering, distillation — it’s really one continuous arc. Testing against live applications. Building the capacity to run it locally. Grounding models so they reason like planners rather than autocomplete engines.

The pace is slow by design. That’s how to make sure planning AI is robust, transparent, and worth taking seriously.

Next Steps

Rerun the pipeline on a smaller controlled batch of PDFs and validate clean JSON output.
Push hardware tests further to see how hybrid GPU/NPU loads behave.
Start drafting a “planning theory corpus” for distillation experiments.
Keep logging progress so the invisible work doesn’t stay invisible.