My Notes on the Progression from Context to Prompt to Harness engineering in making GPT LLMs Useful: (TUESDAY) MAMLMs
The Age of Harness Engineering: taming the stochastic parrots before they eat your sead-corn codebase. We’ve scaled the models and blown out the context windows—and discovered that reliability lives s
The Age of Harness Engineering: taming the stochastic parrots before they eat your sead-corn codebase. We’ve scaled the models and blown out the context windows—and discovered that reliability lives somewhere else entirely. Context and prompt tricks got us very impressive demos, but were clearly not enough; now we say “this time! for sure!” and bet on the new frontier of “harness engineering” to turn a stochastic parrot into a halfway dependable co‑worker…
As I see it, hype vendors promised million‑token context windows and “autonomous agents”; what we got was context rot, soft failures, and ten‑step workflows that succeed maybe a third of the time. Thus the shift from context and prompt cleverness to the unglamorous but essential work of building tests, tools, and symbolic scaffolding around LLMs so they can actually do work. Sometimes. In particular domains. For today’s LLMs talk like geniuses and reason like a distracted intern with no memory of yesterday’s mistakes. They are best treated not as budding minds but as components inside a larger harness that enforces truth conditions, cleans up messes, and decides what they’re even allowed to see.
We have, over in comments on <https://braddelong.substack.com/p/ai-is-eating-platform-monopolist/comment/272975403>
, the following:
David Thomson: ‘On getting your clever hans to do useful things with code, not just tokenmax junk I’ve come to the conclusion that context management is everything and context rot is the core hard constraint you have to deal with in an llm system. No matter the models advertised context window my new rule of thumb is about 50% or 100k tokens then they get dumb. Even the 1m token models. 100k is probably a little conservative but context rot is usually a cliff and I can’t be bothered testing. The quality of manipulating and recalling the information collapses rapidly and if you need precision (unlike general chat) don’t compact, keep context as clean as you can in each conversation with as minimal info as you can possibly give it. Each time it processes the conversation, errors in the conversation crowds out the signal from the core model so it can’t manipulate and reconstitute info as accurately. Hence you get more confabulation and errors. The Mutica kanban solution is a big change in how effective I’ve found the things at coding. I suspect this fail mode is true in other domains too - it’s just more hidden with soft fails. So for example it’s pretty hopeless very quickly at looking at long amounts of prose like a novel or screenplay…
I think this is broadly right. Let me lay out things as I see them right now, starting with the unexpected and semi-catastrophic success of the ChatGPT3 technology demonstration:
Context Engineering
In the beginning was context engineering.
Not that we called it that. We called it “stuffing as much as we could into 4096 tokens and praying”, for 4096 tokens was the limit of what the LLM could eat. That constraint was so tight that the only way to get more than bare remarkable sentence-by-sentence verbal fluency—the only way to get anything even semi-convincing as a simulacrum of a real sustained interaction with a Turing‑level entity—was to husband that context window as if it were gold. Every system prompt, every example, every prior turn was a line item in a very small budget. You had one page of paper on which to write down everything the machine needed to “remember” before you asked your next question.
In that world, the art was almost entirely what do I dare put in, and what must I leave out? You learned very quickly that a few extra paragraphs of irrelevant chatter would crowd out the one example that actually anchored the model’s behavior. You discovered, painfully, that if you tried to have the thing read a long contract or a long code file, its “understanding” degraded into word salad. What we now politely call “context overflow” was then just “it went dumb.”
So: primordial context engineering—before the name—was ruthless curation under a hard 4K token ceiling. You were playing Tetris with instructions, examples, and snippets of prior dialogue. The better you played, the more the model looked like an entity that could track a conversation, reason about what had gone before, and give you something that felt, at least from arm’s length, like coherence. You found a trusted database, searched it, and then gave the LLM three paragraphs that your RAG—Retrieval-Augmented Generation—system thought was most relevant. You asked it to remix those. and you prayed.
And, people thought, it is amazing what it can do; as the frontier labs scale it, it will get better and make its way across the finish line.
Indeed: in some ways this is state-of-the-art. This is what my SubTuringBradBot Telegram ‘bot <https://web.telegram.org/k/#@SubTuringBradBot> uses. And I regard it as successful. But that is because it is a RAG-engine limited to remixing catechism answers from the scrubbed-and-trusted delong_qa.db SQLite file. I am now devoting energy to hand-creating and trying to automate the creation of a /dbs/corrections/ datastore to improve that by doing a second pass. I am not looking for a smarter LLM than ollama/gemma4:26b, and would not want one. That would start hallucinating on me. AGAIN.
And so we moved on.
Prompt Engineering
The wizards relaxed the context-window constraint. 4096 tokens became 8192, then 32768, then 131072, then 1048576. The marketeers told us, repeatedly, that “context is no longer a bottleneck”, in a triumph of wishful hype over software reality. Why a triumph of hype? Because it turns out, as David Thomson notes above, that it appears that above 131072 tokens of context current LLMs at least become really dumb.
There is such a thing as context rot.
It happens at what is a small fraction of the now-advertised context windows. Above that, the model’s ability to manipulate and reconstitute information, to pull out what matters and ignore what does not, first degrades and then collapses. On a local model or a cheap cloud budget this is half-hidden as the thing slows to a crawl. Throwing unlimited NVIDIA GPUs at it it becomes very clear that it has lost the thread: misreads, half‑hallucinated summaries, small slips do not just accumulate but crowd out the genuine signal you had hoped the machine would pick up. It carries more and more cruft forward and the simulacrum of intelligence dies.
And so people pivoted. You followed the advice of OpenAI, Anthropic, Google, and company and gave the machine everything rather than a narrow three paragraphs from RAG you can no longer pretend that “just give it everything” works, you can try instead to control how you ask.
Thus phase two: prompt engineering: the discipline of “what you say.” The model is taken to be a black box with impressive latent capability. You are told that if you only phrase your request correctly—specify the role, break the task into steps, provide the right few‑shot examples, insist on JSON output, and add the obligatory “think step by step”—the network will reveal its inner wisdom.





