Reading harnesses made me care about Transformers again • Andrew Zhang

I used to keep agent harnesses and Transformer internals in separate drawers. Harnesses were the engineering drawer: loops, tools, approvals, checkpoints, streaming UI. Transformers were the study drawer: attention, feed-forward blocks, RoPE, normalization, KV cache. I knew both drawers existed. I just did not open them at the same time very often.

Reading harness code made that separation feel fake. I went in looking for the clever loop. The loop was there, sure. But the parts I kept thinking about afterward were less glamorous: where messages cross a boundary, where a tool result becomes state, where a human can say yes or no, and why a long context starts behaving like a messy room instead of a bigger notebook.

This was a slightly annoying realization because it made both subjects harder to ignore. I could no longer say, "I am only doing application-layer agent work, so I do not need to care how the model handles position." I also could not hide inside model details and pretend the harness was just plumbing. The plumbing is where the model's weaknesses become somebody's product bug.

Field note

The harness decides where the model is allowed to act. Transformer internals explain why the same model gets expensive, forgetful, brittle, or weirdly confident once the run stretches out.

agent loop tool boundary context budget position signal state drift eval trace

The harness layer

I stopped looking for the loop and started looking for the boundaries

The first time I read through an agent harness, I wanted the satisfying part: model thinks, model calls tool, model observes result, model keeps going. It is a neat story. Too neat. The real education was in the surrounding machinery. How does the CLI become a runtime? Where do tools get registered? Which object is the message? What gets validated before a side effect? Where can a human interrupt the run without pretending the whole thing never happened?

OpenHarness was a good trigger because it made the harness feel like an operating layer instead of a demo wrapper. I did not walk away thinking every project should copy its shape. I walked away thinking that serious agent systems always grow boundary code: adapters, schemas, runtime bundles, event streams, approval gates, trace surfaces. Boring names. Important jobs.

It also changed how I read recent agent benchmarks. TheAgentCompany, LongCLI-Bench, and the 2026 horizon-length work are easy to summarize as "agents still fail at long tasks." That is true, but it misses the part I care about. A long task puts pressure on every casual boundary in the harness. Tool outputs pile up. State starts to blur. Recovery stops being a nice extra. The model can keep moving after the system has lost the plot.

I used to think a good harness was mostly about exposing enough tools. Now I think that is only the first chapter. A tool surface without a state policy is just a larger place to get lost. A trace without a replay story is mostly decoration. An approval button without a clear side-effect boundary is theater with a nicer UI. These are not deep thoughts, but they are the kind of boring thoughts I wish I had taken seriously earlier.

The same thing shows up in my own projects. In a biomedical evidence workflow, the expensive mistake is rarely "the model forgot the word PubMed." The mistake is more ordinary and more dangerous: a retrieved paper becomes indistinguishable from a model draft, a warning gets buried in history, a reviewer cannot tell why a claim survived, or a later step treats scratch context as evidence. The harness either keeps those boundaries visible, or the model will happily blur them for you.

Tool Can the model call it safely?

Typed arguments, preflight checks, narrow verbs, and useful error states.

State Can the run explain itself?

Visible plans, durable checkpoints, stop reasons, and resumable artifacts.

Human Can someone take control?

Approval gates, interrupt points, review surfaces, and plain traces.

Eval Can failure repeat on purpose?

Task-level regression cases, failure buckets, and replayable run logs.

The model layer

Transformer basics became useful again once I cared about long runs

I do not think every agent engineer needs to rederive attention on a whiteboard. But I do think the basic mechanics are worth keeping close by. Attention is the reminder that long context has a cost. KV cache is the reminder that a streaming agent already has a latency and memory profile before the product feature even exists. Positional encoding is the reminder that "just add more context" is not a neutral instruction.

The older Lost in the Middle result still bothers me in a useful way: models can use long context unevenly. The newer long-horizon agent papers make the same discomfort show up at the system level. The evidence might be somewhere in the prompt, the plan somewhere in history, and the tool result somewhere in an earlier event. That does not mean the next action will pick up the right piece.

RoPE and normalization are not topics I reach for because they make a post sound serious. I reach for them when I need to remember that the model has an internal geometry. Order, distance, scale, accumulated noise. None of that stays politely inside the paper. When an agent loses the plot after twenty actions, the blame might sit in the harness, the prompt, the tool result, or the model's handling of a crowded context. Usually, annoyingly, it is a mix.

Feed-forward blocks also became less abstract to me once I stopped treating the model as a pure retrieval surface. The attention story is easier to remember because it has a clean visual: tokens looking at other tokens. But the feed-forward part is where a lot of token-local transformation happens. For agent work, the practical lesson is simple enough: the model is not a database cursor moving through my prompt. It is repeatedly reshaping representations under pressure from the current context.

RMSNorm sounds even further away from application code, but I still find it useful as a mental reminder. Deep stacks need stabilizers. Agent systems do too. A long run without checkpoints is a deep stack of social promises: the prompt says remember this, the history says something changed, the tool result says maybe try again, the UI says keep going. I do not want stability to depend on the model politely holding all of that together.

Attention

Helps me think about what the model can actually attend to when the run gets long.

RoPE

Reminds me that position and distance are design constraints, not harmless metadata.

KV cache

Makes latency and memory cost visible in streaming, multi-turn workflows.

Normalization

Keeps me skeptical of deep stacks where small instability becomes user-facing drift.

The research layer

The recent papers mostly made me less impressed by smooth demos

I do not read agent papers as scoreboards anymore. I read them as failure catalogues. That sounds more pessimistic than I mean it. Failure catalogues are useful. They save me from believing that my local bug is always just my local bug.

TheAgentCompany is useful because the tasks look uncomfortably mundane: office-style work across roles and tools. That is exactly where agent demos usually stop being magical and start being software. The hard part is not one spectacular reasoning jump. It is staying oriented across email-like context, files, instructions, tool calls, and partial progress.

LongCLI-Bench hits a similar nerve from the command-line side. A short coding task can hide a lot. A long CLI task asks whether the agent can plan, run commands, read results, recover from bad assumptions, and keep the end goal alive after several unglamorous steps. That is much closer to the kind of work I actually care about.

The horizon-length study gave me a phrase for a feeling I already had: longer action chains are their own problem. They are not short tasks repeated more times. Each step adds another chance to damage state, drift from the goal, or make recovery harder. That is why I now distrust agent claims that only show the happy path.

And then there is AI Agents That Matter, which is less about one benchmark and more about the discipline around measuring agents. I like it because it makes the uncomfortable point explicit: if the evaluation is loose, the system can look better without becoming better. That is a very agent-specific trap, because a fluent failed run still feels like activity.

TheAgentCompany Office work is boundary work

Multiple roles and tools make state discipline more important than a prettier loop.

LongCLI-Bench The shell exposes drift quickly

Command-line tasks punish vague plans, weak recovery, and missing stop reasons.

Horizon length Longer changes the task

More steps mean more accumulated damage unless the harness keeps checking itself.

AI Agents That Matter Evaluation is part of the system

A run that cannot be replayed is hard to improve without fooling myself.

Where the layers meet

The practical question is what should stay in context

This is where the two drawers finally touch for me. Every agent run is a context policy, whether I admit it or not. The harness decides what enters the model, in what order, with what labels, and for how long. The model then makes its own imperfect choice about what to use. If I treat context like a bucket, I get exactly that kind of system: everything goes in, nothing has priority, and debugging turns into archaeology.

The better pattern is to make the harness opinionated. A tool result that matters becomes an artifact with an ID, not another paragraph in chat history. A plan has a current step, a success condition, and a stop reason. Evidence sits somewhere different from scratchpad. User memory does not quietly overrule the current task. The model can still make mistakes. The harness should at least make those mistakes expensive enough to notice.

This is also why I keep coming back to evaluations. AI Agents That Matter criticized agent evaluations for being too easy to game or too weakly controlled. I read that less as a benchmark complaint and more as a design warning. If the harness cannot replay a task, isolate a failure, and tell whether a change helped, then the system is still mostly theater. The model may be strong. The engineering loop is not.

The funny thing is that context policy sounds like a product decision until it breaks. Then it becomes a model behavior question, a retrieval question, a UI question, and sometimes a privacy question. Should the model see the whole previous conversation? Should it see the raw tool output or a normalized artifact? Should it see rejected evidence? Should it see the user's old preference if the current task contradicts it? None of those questions live cleanly in one layer.

This is where I think the Transformer basics stop being background knowledge and start being design pressure. If long context is uneven, I should not bury the critical warning in the middle of a giant prompt. If position matters, I should care about where the current objective sits. If serving cost matters, I should care about whether I am dragging stale history through every turn because I was too lazy to summarize or checkpoint. The model will not send me a polite invoice for bad context design. It will just get slower and weirder.

Raw output Tool output

Promote important results into typed artifacts with IDs, status, warnings, and provenance.

Chat history Conversation history

Summarize, pin, or discard based on whether the next action needs it.

Memory User preference

Separate durable collaboration rules from task-local assumptions.

Long run Long workflow

Checkpoint state so recovery and handoff have something concrete to hold.

What I would study now

I would learn the internals only up to the point where they change decisions

If I were turning this into a study plan, I would skip the full deep learning syllabus for now. I would learn the parts that change what I build. Attention and context behavior, because they affect retrieval, prompt layout, summarization, and drift. KV cache and serving cost, because agent UX depends on latency whether I want to think about it or not. RoPE and long-context limits, because a larger window still needs a policy. Benchmark design, because without eval discipline I am very easy to fool.

The harness side is even more concrete. Read a real runtime from process startup to tool execution. Trace one message as it moves from UI to model to tool to event stream. Find where approval lives. Find where state persists. Find where errors become recoverable, or where they quietly disappear. Then write a tiny eval that would catch one bug you actually saw. That last part is where the learning usually starts to hurt, which is why it is useful.

If I had a free weekend, I would make the study path very literal. Morning: read one harness entry point and draw the runtime objects. Afternoon: follow one user message until it becomes a model request. Evening: read one paper or note about long context, then go back to the harness and ask what the system is assuming about context. The next day I would repeat the same thing for tools: schema, validation, execution, observation, retry, evaluation. Not glamorous. Much better than reading ten abstract posts about "agentic workflows."

I would also keep a small notebook of failures. Not a polished postmortem, just the ugly list: tool called with wrong argument, plan step skipped, stale memory used, long prompt ignored current goal, retry made things worse, eval missed it. That list tells me what to study next. If the failure is about context placement, I go back to attention and long-context behavior. If it is about cost, I go back to KV cache and serving. If it is about drift, I go back to checkpointing and eval design.

The part I keep forgetting

The model is not the only thing that needs architecture

There is a lazy version of this argument where "understand the model" becomes a way to sound serious. I do not want that version. I have seen enough systems where the author knew the vocabulary and still shipped a mushy harness. Knowing what RoPE is does not magically give you state discipline. Knowing attention complexity does not give you a recovery policy. It only gives you fewer excuses.

The useful version is more modest. Model internals help me ask better questions of the harness. Why is this much history still in the prompt? Why is this warning not pinned? Why do we rely on the model to remember the current objective instead of storing it as run state? Why does the UI show a smooth stream but no checkpoint? Why can the agent continue after a tool result that should have forced a stop?

Once I ask those questions, the architecture work becomes less vague. The harness is not a wrapper around intelligence. It is the place where I decide what kind of mistakes I am willing to see twice. That is a harsher definition, but it has been more useful to me.

The note I want to keep

I no longer separate "agent engineering" from "model internals" as cleanly as I used to. The harness gives the model a world to act in. The Transformer gives that action its cost, limits, and failure shape. I can ignore either side and still build a demo. I just cannot explain why it falls apart once the demo turns into a longer task.

So my personal rule is narrower than "learn the internals." Study Transformer basics when they change a system decision. Read harness code when it exposes a boundary. The useful knowledge is the knowledge that changes what I log, what I store, what I retry, what I hide from context, and when I ask the human to come back into the loop.