Andrew Zhang
Writing

May 25, 2026 / 10 min read

Learning agents again after the hype

A personal field note on why tools, evals, memory discipline, and long-horizon workflow design mattered more than the old promise of autonomous magic.

Anthropic note TheAgentCompany MemoryArena MCP

If you had asked me last year how to learn agents, I probably would have answered with a reading list. Frameworks, papers, orchestration patterns, memory systems, evals. All useful, but also a very convenient way to avoid the awkward part: building something small enough that its failure is impossible to hide.

The first few things I built did not fail in an impressive way. They failed like normal software. A tool returned the wrong shape. A run kept going after it had already lost the plot. The model sounded confident about a state it had not actually preserved. I kept wanting the failure to say something deep about intelligence. Most of the time it said something boring about interfaces.

That was the humbling part. I wanted to feel that I was learning a new class of system, something grander than ordinary software. What helped was admitting that the important questions were plain: what did the tool return, why did the run continue, what state survived, and who gets to say "stop."

Field note

The version of agent engineering I trust now is less romantic than the one I started with. It is workflow design under model uncertainty: tool contracts, state boundaries, checkpoints, recovery, and evals.

workflows before autonomy tools are interface protocols matter memory needs policy long horizons still hurt
01 Autonomy is not the product

The useful unit is a bounded workflow I can replay and question.

02 Memory is not just retrieval

State helps only when the system knows what to keep and what to forget.

03 Reliability is still the bottleneck

Long tasks are where the demo story starts to meet the engineering story.

Wrong start

I started with the wrong mental model

My early picture was embarrassingly tidy. Take a strong model, wrap it in a loop, give it memory, let it call tools, and now you have an agent. I liked that picture because it made the whole field feel learnable in one sitting. It also hid almost every hard part.

I remember one run where nothing failed loudly. That made it worse. The model kept producing plausible next steps, the logs looked busy, and for a while I mistook motion for progress. Then I tried to explain, in one sentence, what the system believed at that point in the run. I could not do it. If I cannot name the state, I do not really understand the system.

The cleanest reset for me came from Anthropic's Building effective agents note. The useful distinction was not "agents are powerful." It was the much duller split between workflows and agents, plus the reminder that agents are usually just LLMs using tools with environmental feedback in a loop. Once I accepted that, the work stopped feeling mystical. The hard part was interface design, checkpointing, and deciding where the loop was allowed to stop.

I also had to stop trusting repo labels too quickly. A project can have folders called planner, memory, and executor and still behave like a prompt with nicer packaging. The useful questions are less glamorous. What object crosses the boundary between steps? What can be retried safely? What tells the system that it is done, blocked, or confused?

I believed Autonomy was the feature

The more turns and freedom I gave the system, the more "agentic" it felt.

What held up Boundaries were the feature

The runs I trusted had explicit tools, clear outputs, and obvious places to interrupt.

I believed Memory meant more context

I treated memory like a bigger backpack for the same chat loop.

What held up Memory meant better state

The useful question was what should survive between steps, not what could be stored.

Reality check

Recent benchmarks made the problem harder to romanticize

By early 2026, the papers were starting to line up with the failures I was seeing in small builds. TheAgentCompany places agents inside a simulated company and asks them to finish real office work across multiple roles and tools. LongCLI-Bench does something similar for long command-line programming tasks and reports pass rates below 20% even for strong agents. I did not read those papers as a takedown of agents. I read them as a reality check for people like me, people who wanted the clean demo to generalize faster than it actually does.

That mattered more than I expected. Before reading them, I kept assuming my own failed runs mostly came from inexperience. Some of them did. But not all of them. The messiness was not only in my code. The field itself was still learning how brittle these systems become once a task is long, tool-heavy, and slightly ambiguous.

The May 2026 horizon-length study pushed the point even further. It isolates horizon length and shows that longer action chains create their own training bottleneck. That clicked for me immediately. A long task asks for more reasoning, but it also magnifies every weak assumption in the workflow: fragile tool outputs, vague success criteria, poor recovery, and state that quietly drifts away from the original task.

Since then, I watch demos differently. A polished browser run can still be impressive. I just do not stop there anymore. Can I replay it? Can it recover after a bad tool result? Does it know when to hand control back to a human? If it fails twice, does it fail for the same reason? Those are the questions I wish I had asked earlier.

TheAgentCompany

Useful for remembering that office-style agent work is multi-tool, stateful, and still brittle.

LongCLI-Bench

Useful for seeing how quickly coding agents stall when the command line task stops being short.

Horizon-length study

Useful for separating "the model is weak" from "the action chain itself is the problem."

Tooling trend

The ecosystem got more concrete once the tool layer did

The industry shift that mattered to me was not everyone saying "agents" more often. It was the way tool and context boundaries started to become less hand-wavy. The Model Context Protocol mattered to me for that reason. A protocol does not magically solve reliability, but it makes one thing harder to ignore: context is an interface.

That sounds small, but it changed how I think. For a while, "context" in my head was this soft substance that the model absorbed if I fed it enough notes and documents. Protocol thinking broke that illusion. What is being passed in? In what shape? With what trust level? What happens if that connection lies, stalls, or returns half the thing I expected? Those questions are not glamorous, but they are the questions that keep a system from becoming a pile of vibes.

This was not obvious to me at first. I used to think the main upgrade path would come from better prompting and larger models. Now I think a lot of the practical progress comes from cleaner contracts: structured tool calls, typed outputs, stable context transport, and permission checkpoints where side effects begin.

Once I started seeing it that way, "agent engineering" stopped feeling like a brand-new discipline. It looked more like distributed systems work with an unusually slippery planner in the middle. That framing helped me. The problems became familiar enough to debug.

Tool schema

The model gets a narrower surface, and humans get something they can test and version.

Context transport

What enters the run is not magic. It arrives through a protocol, an adapter, or a retrieval path.

Human checkpoints

Approvals, confirmations, and side-effect boundaries become part of the design, not an afterthought.

Eval loop

Teams stop asking whether the run looked smart and start asking whether it behaved the same twice.

Pain points

Where current agent systems still hurt, and where I would still invest

When people ask whether agents "work," I never know how to answer. Work for what? A demo? A daily workflow? A tool that someone else has to debug next month? The pain points only become clear once you ask the system to survive repeated use.

Tool reliability is still first on my list. A model can be broadly capable and still be sloppy about the exact argument shape, the preconditions for a call, or the difference between "tool available" and "tool useful right now." This is why I still think tool grammar is worth real investment. Clear schemas, narrower actions, validation before side effects, and local verification make the whole system less theatrical and more real.

Right behind that is state drift. This one has become the most familiar failure mode for me. A run starts with a decent plan, makes a few correct moves, then slowly forgets what success was supposed to look like. More memory does not fix that by itself. The work worth doing is explicit run state, step checkpoints, resumability, and a memory policy that separates instruction, artifact, evidence, and scratchpad.

Recovery is the next weak spot. Current agents are still much better at continuing than at noticing they should stop, reassess, or ask for help. That is why I still think subgoal checks, stop reasons, confidence gates, and human handoff rules deserve more attention than clever prompt choreography. The systems I trust know how to lose confidence in a legible way. That sounds like a small thing until you watch an agent confidently dig itself deeper.

Then there is evaluation. This is still where too many projects become hand-wavy. A pleasant-looking run can hide a lot of instability. I would keep investing in replayable traces, failure taxonomies, task-level evals, and regression suites that measure whether the agent stayed on task, used the right tools, preserved the right state, and stopped for the right reason. This work is not flashy. It is also where I think the next real gains are hiding.

Pain Tool misuse

The model often knows a tool exists without really knowing how to use it cleanly.

Worth investing Tool grammar and validation

Tighter schemas, safer arguments, better preflight checks, and smaller action surfaces.

Pain State drift over long runs

The run slowly stops solving the original task even though each local step looks plausible.

Worth investing Checkpointed state

Explicit plan state, resumable runs, and memory policies that say what deserves to persist.

Pain Bad recovery behavior

Agents keep going after weak evidence, partial failure, or obvious ambiguity.

Worth investing Recovery and handoff

Stop reasons, confidence gates, approvals, and rules for when the human should re-enter.

Pain Demo optimism

One clean run looks convincing even when the system is unstable across repeats.

Worth investing Eval and observability

Replayable traces, failure buckets, and regression tests that watch behavior instead of vibes.

Memory

Memory stopped feeling abstract once I had to keep a run coherent

Memory was another place where my thinking had to get less poetic. I used to treat it like a nicer chat history, or a retrieval add-on that would make the agent feel more personal. Then I read MemoryArena and recognized the real problem immediately. Agents have to recall old information, but they also have to learn from earlier sessions, carry the right state forward, and use it later inside a dependent task. That is a different problem.

I ran into this in a very ordinary way. A system would do fine inside one run, then fall apart the moment I expected it to remember a previous decision without re-importing the whole transcript. Sometimes it remembered too much and dragged stale assumptions into a new task. Sometimes it remembered too little and acted like the earlier correction never happened. Both failures looked like "memory" bugs from the outside. Underneath, they were state management bugs.

That changed my temperament a bit. I used to get excited whenever I made the system remember more. Now I get suspicious first. Extra memory can make a demo look smoother while quietly making the next week of debugging worse. I had to learn that forgetting is part of the design too, at least if you want the system to stay usable.

Once I started building with that lens, memory looked less like a single subsystem and more like a set of contracts. What counts as scratch state for the current run? What counts as a durable user preference? Which artifacts are evidence, and which are only working notes? Which earlier decisions should constrain the next tool call, and which should be allowed to die quietly? Those questions mattered more than the choice of vector store.

Scratch state

Short-lived plan steps, intermediate outputs, and the next action.

User memory

Stable preferences, repeated corrections, and collaboration rules.

Artifact memory

Files, notes, retrieved documents, and outputs that can be inspected later.

Checkpoint memory

What lets a run pause, recover, and continue without pretending nothing happened.

Learning path

The path that finally worked for me was smaller

I made progress once I stopped trying to "learn agents" as one giant topic. That phrase is too big. It lets you read forever. What helped was learning one layer at a time and keeping the loop inspectable.

I also had to stop using reading as a substitute for building. I read a lot of repos and papers early on, and some of that was useful. But the real learning happened when I had a concrete task in front of me and could feel where the system became slippery. Reading gave me vocabulary. Building gave me judgment.

More specifically, I stopped starting from multi-agent orchestration. That was too far downstream. If one agent cannot use one tool cleanly, adding three more agents mostly gives you three more places to hide the bug. I also stopped treating memory as a prerequisite. First I needed a task, a tool surface, and a way to tell real success from a nice-looking run.

01 Start with one bounded task

Pick one job with one obvious success condition, then expose the tools plainly.

02 Log decisions and stop reasons

If a run fails, I want to know whether it was blocked, confused, or simply done.

03 Write evals before more autonomy

It is too easy to add loops and too hard to tell whether they improved anything.

04 Add memory last

State only becomes durable after I know which facts deserve to survive the current run.

This sequence is slower if the goal is to ship a flashy demo in a weekend. It is faster if the goal is to understand why the system broke on Tuesday and how to keep that exact breakage from returning on Friday.

Skills now

What I now look for in someone learning agents

I care much less about whether someone can talk fluently about agents. I care more about whether they can make one concrete workflow hold together. Can they define a tool well enough that the model uses it cleanly? Can they decide what state should persist and what should not? Can they create an eval that catches a fake success? Can they make the failure legible enough that another engineer can debug it without reading tea leaves?

I have also changed what impresses me. A year ago I was much more likely to be impressed by a clever orchestration diagram or a smooth demo clip. Now I pay attention to the boring artifacts: run traces, retry behavior, tool schemas, review checkpoints, and whether the team can explain a bad result without pretending it was mysterious. That is where my trust comes from now.

This is less exciting than the old autonomy story. I think it is the better bet. The agent engineers I learn from usually behave like careful systems people. They trim scope, keep interfaces plain, store evidence instead of vibes, and accept that human review still belongs inside the loop for anything that matters.

If I had to condense the skill set, I would make it less fancy: choose the right workflow, keep state and side effects under control, write evals that catch regressions, and know when the model should stop and ask for help. That is a better definition of "learning agents" than the one I started with.

The correction I needed

I still think agents matter. I just stopped treating them as a shortcut. The ones I trust feel less like a brilliant assistant and more like a careful operator with tools, logs, checkpoints, and permission to admit that the task is not done yet. That was the correction I needed.

If I were starting again, I would begin with one narrow workflow, one real tool, and one eval I trust. Then I would make the state visible, add recovery, and only after that ask whether the system deserves more freedom. Less dramatic, yes. Also much more useful.

That is also where I still think the field should spend energy. Not on making agents sound a little smoother. Not on adding another layer of autonomy language. I would spend that energy on tool use, state discipline, recovery, handoff, and evaluation. Those are still the places where the next real gains feel available.