The model is not the hard part anymore. The hard part is what happens when a tool call fails quietly, when retries stack up behind a cheerful UI, and when the "small" prompt edit you merged on Tuesday changes routing in a way your evals never touched. I keep seeing teams treat the agent as the product and the harness as plumbing. In production, that is backwards: the harness is what users actually feel.
The 2026 writeups on shipping agents for real skew the same direction: less hero demo, more guardrails, budgets, and operational honesty. See Viqus on lessons from deploying AI agents in production, Harness Engineering's notes from the field, and Kevin Tan's production AI agent playbook. I am not summarizing those posts here; they are representative of the failure cluster I mean.
Where it actually breaks
Tooling often fails without drama. A 500 from a dependency, a partial JSON payload, a search index that returns nothing useful, and the model still narrates forward. The user gets an answer that sounds finished. Support gets the opposite of a stack trace.
Orchestration you cannot bisect is orchestration you will not fix under pressure. If you cannot replay a run from logged intents, tool inputs, and outputs, you are debugging vibes. That is fine for a hackathon. It is expensive for a team on call.
"We will notice if it goes weird" is not a validation layer. Structured checks on tool arguments, post-conditions, and refusal or escalation rules are. Golden sets and named eval scenarios belong in that same harness: versioned like any other code, not only in a notebook. Before you chase the next benchmark, ask whether the system can detect drift while nobody is watching.
Stateless retries and ever-growing context windows show up on the invoice before they show up in accuracy charts. Caps, backoff, and explicit truncation policy belong in the harness, not in a note in Slack.
What I put in the harness
Timeouts, idempotency keys for side effects, and allowlisted argument shapes. The model proposes; the runtime enforces. That is the same spirit as least-privilege tool design, which matters as soon as the LLM sits on real APIs.
Telemetry for agent failure modes should be first-class: tool errors, retry storms, empty retrievals, judge abstentions. Pair product metrics with quality signals so you can tell a model regression from a downstream bad deploy.
Human gates stay where the blast radius is real: refunds, account closures, irreversible writes. The goal is not to remove humans. The goal is to keep automation inside a perimeter where the exceptions still deserve a person.
If a change touches routing or tools, I want named scenarios in CI failing in a pull request before they fail for a customer. Not every check belongs in a notebook.
Trust, usage, and the verification gap
Surveys and commentary this year keep hitting an awkward split: people use AI assistants heavily and still do not trust the output end to end. Stack Overflow's team wrote about closing the developer AI trust gap in terms of workflow and verification, not bigger models. I read that as a hiring signal as much as a product signal. How you review agent changes, what you put in a PR description, and whether you can show a failing eval before and after says more than adjectives on a resume.
A note on portfolio prose (hybrid close)
Most of this post is not about writing style. Still, if you are shipping longform next to code, it helps to edit the way you would edit an agent pull request: look for generic uplift, fake depth, and tidy templates. Wikipedia's guide Signs of AI writing lists patterns editors see in bulk, from inflated "importance" language to vague attributions ("experts say") and over-clean listicle rhythm. In this repo, `humanizer-main/SKILL.md` turns that into a practical pass: scan for the tells, rewrite in plain clauses, then ask what still sounds obviously synthetic and cut it.
A short checklist I run on drafts:
Named source, date, or constraint instead of "many believe."
Cut promotional padding. If a sentence exists only to sound impressive, delete it.
Prefer "is" and "has" over "stands as" and "serves as."
One concrete example beats three parallel buzzwords.
Use commas or periods instead of piles of em dashes.
Read aloud. If every sentence lands the same length, break the rhythm on purpose.
Same idea in two domains: make failure modes visible, make boundaries explicit, and ship something a skeptical reader can verify.