GPT-5 failed the wrong test
This post isn’t really about GPT-5. Sure, it launched and people are somewhat disappointed. It’s the why that bugs me.
They expected AGI, the AI god, but instead got merely the best model in the world. v disapointng
A few days before the GPT-5 launch I read this paper, Agentic Web: Weaving the Next Web with AI Agents. It’s not my normal kind of paper, it’s not very academic. There’s no math in it, no architecture. It just paints a picture of the future.
And that’s the lens I saw GPT-5 through.
The paper describes three eras of the internet:
- PC Era — Wikipedia, Craig’s List, etc.; users actively seek information
- Mobile/Social Era — Tik Tok, Insta, etc.; content is pushed via recommendation algorithms
- Agentic Web — user merely expresses intent

When I weigh the strengths of GPT-5, it feels poised and ready for the agentic web.
How do I vibe test an LLM?
I use it. If it changes how I work or think, then it’s a good LLM.
o3 dramatically changed how I work. GPT-4 did as well. GPT-5 didn’t, because it’s the end of the line. You can’t really make a compelling LLM anymore, they’re all so good most people can’t tell them apart. Even the tiny ones.
I talked to a marketing person this week. I showed them Claude Code. They don’t even write code, but they insisted it was 10x better than any model they’d used before, even Claude. I’d echo the same thing, there’s something about those subagents, they zoom.
Claude Code is software.
Sure, there’s a solid model behind it. But there’s a few features that make it really tick. Replicate those and you’re well on you’re way.
GPT-5 is for the agentic web
The first time I heard agentic web I almost vomited in my mouth. It sounds like the kind of VC-induced buzzword cess that I keep my distance from.
But this paper..
I want AI to do all the boring work in life. Surfing sites, research, filling out forms, etc.
Models like GPT-5 and gpt-oss are highly agentic. All the top models are going in that direction. They put them in a software harness and apply RL and update their weights accordingly if they used their tools well. They’re trained to be agents.
I hear a lot of criticism of GPT-5, but none from the same people who recognize that it can go 2-4 hours between human contact while working on agentic tasks. Whoah.
GPT-5 is for the agentic web.
yeah but i hate ads
Well okay, me too. Not sure where that came from but I don’t think that’s where this is going. Well, it’s exactly where it’s going, but not in the way you’re thinking.
The paper talks about this. People need to sell stuff, that won’t change. They want you to buy their stuff. All that is the same.
The difference is agents. In the agentic web, everything is mediated by agents.
You don’t search for a carbon monoxide monitor, you ask your agent to buy you one. You don’t even do that, your agent senses it’s about to die and suggests that you buy one, before it wakes you up in the middle of the night (eh, yeah, sore topic for me).
You’re a seller and you’re trying to game the system? Ads manipulate consumers, but consumers aren’t buying anymore. Who do you manipulate? Well, agents. They’re the ones making the decisions in the agentic web.
The paper calls this the Agent Attention Economy, and it operates under the same constraints. Attention is still limited, even agent attention, but you need them to buy your thing.
The paper makes some predictions, they think there will be brokers (like ad brokers) that advertise agents & resources to be used. So I guess you’d game the system by making your product seem more useful or better than it is, so it looks appealing to agents and more agents use it.
I’m not sure what that kind of advertising would look like. Probably like today’s advertising, just more invisible.
Benchmarks
The only benchmark that matters is how much it changes life.
At this point, I don’t think 10T parameters is really going to bump that benchmark any. I don’t think post-training on 100T tokens of math is going to change much.
I get excited about software. We’re at a point where software is so extremely far behind the LLMs. Even the slightest improvements in an agent harness design yield outsized rewards, like how Claude Code is still better than OpenAI codex-cli with GPT-5, a better coding model.
My suspicion is that none of the AI models are going to seem terribly appealing going forward without massive leaps in the software harness around the LLM. The only way to really perceive the difference is how it changes your life, and we’re long past where a pure model can do that.
Not just software, but also IT infrastructure. Even small questions like, “when will AI get advertising?” If an AI model literally got advertising baked straight into the heart of the model, that would make me sad. It means the creator’s aren’t seeing the same vision.
We’ve talked a lot about the balance between pre-training and post-training, but nobody seems to be talking about the balance between LLMs and their harnesses.
Areas for growth
Before we see significant improvement in models, we’re going to need a lot more in:
- Memory — stateful agents that don’t forget you
- Harnesses — the software around the LLM inside the agent
- Networking & infra — getting agents to discover and leverage each other
Probably several other low-hanging areas.