Of Benchmarks, Speed and Cost
The AI conversation is fixated on the wrong numbers
AI has a metric problem.
I want to discuss three metrics, and why focusing on them is distracting people from the real work that needs to be done.
Benchmarks
Let’s make this one short. Benchmarks were good in the beginning because they provided a simple way for people to 1) compare models to each other and 2) compare the newer version of a model to its previous versions. The more advanced these models have become, however, the more they have either maxed out specific benchmarks to the point where a 3% increase from one version to the next is not meaningful when both models score in the 90% range. Or the AI companies have started training their models to improve their results on specific benchmarks, yielding benchmark gains that do not reflect real-world improvement.
Either way, benchmark results alone are no longer a reflection of the expectations you should have for a specific model when you implement it for your specific use cases.
Speed
For speed, I want to ask a seemingly simple question:
“At what point does our ability to create outstrip our ability to be thoughtful about what we’re creating?”
I think we’re already here. You can write entire essays, create music and videos, and make software applications in hours. How thoughtful can you be about the subject of an essay in a single day? How thoughtful can you be about an end user experience for your software application in a single day?
Speed is almost always the wrong metric. Even in car racing, which is one of the human activities most singularly focused on speed, focusing on speed alone is a recipe for disaster. Race cars have brakes for a reason. “Fast is slow and slow is fast” - even race cars have to slow down in order to go fast.
When we take all the things we can do with AI and try to optimize for speed, we are like a race car driver trying to go through a hairpin turn at 100mph. We give ourselves the illusion of efficient speed, all the way up until we realize we can’t actually move the car in the direction it needs to go in without slowing down first.
Let’s translate that into the Software Engineering world. Optimizing for speed would have me, as a Solution Architect, talk to my customer once. Gather as much information as I can in a single session, and then start a vibe coding session with my AI to create a piece of software to solve my customer’s problems. In this scenario, my AI might ask questions I don’t have answers to. If I’m optimizing for speed, I will want to make assumptions to answer those questions, rather than take time to schedule another meeting with my client. The more assumptions I make, the faster I’m able to go, so I keep making assumptions. And before I know it, I have a fully functional piece of software that has made tens if not hundreds of wrong assumptions about what my client really wants. I might have a working piece of software, but I certainly don’t have one that meets my client’s requirements, since I never bothered to ask after my initial conversation.
Cost
The first issue with cost as a GenAI metric is that cost per token is misleading. The Stanford AI Index will tell you that the cost per token of a GPT 3.5-level model has gone down from $20 per million tokens in November 2022 to $0.07 per million tokens in October 2024, representing a 280-fold reduction in cost.
But here is the thing. You wouldn’t use a GPT 3.5-level model for most of your GenAI use cases in April 2026. As the models have evolved, so have our use cases and our needs for increasing levels of intelligence. In April 2026, Anthropic’s latest model costs $25 per million output tokens and $5 per million input tokens, so overall on the same cost scale between input and output tokens as those GPT 3.5-level models from November 2022.
What makes this worse is that you actually need more tokens in order to handle your more sophisticated use cases in April 2026 than you needed in November 2022. Not only has the per-token price not really gone down, the actual cost of carrying out specific tasks has gone up because the relative complexity of those tasks has gone up and each task actually requires more tokens.
So you’ll at least have more reliable numbers if you measure your cost by use case instead of by tokens, but I would still argue that cost is the wrong metric to focus on.
We cannot talk about cost without talking about value.
$1,000/month would be very expensive for a telephone line because a) most of us don’t need a landline and b) the service is so commoditized that it would be easy to find much cheaper options. So the service has little value, and there are plenty of people willing to provide that small value at a correspondingly small price, therefore the $1,000/month cost can be considered high.
What’s the equivalent reasoning for GenAI costs? In the US, $200/month buys you the most expensive subscription plan that either Anthropic or OpenAI have to offer. For the sake of this discussion, let’s assume you’re able to max out this subscription and need to get 4 more subscriptions to do all the things you’re going to do with AI. That’s $1,000/month.
Is that expensive?
Well, what are you using it for? At $100/hour, you’d need to save 10 hours/month in order to have those subscriptions pay for themselves. $100/hour in the US is a reasonable rate for several roles I rely on regularly: an editor like one I might use for this blog. A website administrator like one I’d use to manage my web site. An Excel guru like one I might have used for the complex pivot table and reporting charts I created earlier this week. A data engineer like one I might have used to convert raw data for a dashboard. As an IT professional and a consultant, I go through these types of tasks multiple times every day, and I save hours worth of work multiple times every day. Even if that time was only worth $100/hour, I would only need 10 of those hours every month to make $1,000/month worth it. And I wouldn’t come near maxing out a single subscription, let alone five.
The real work
So if benchmarks, speed and cost are not the right things to focus on, what is?
First we have to recognize AI is not a toy, and it’s not a hobby. If you are using AI to solve problems, then the real work is to identify the value being created by solving those problems. That work forces you to answer the critical questions. What problems are you trying to solve? What are the leading indicators that will tell you that you have solved those problems successfully? What objective metrics can you test to validate your solutions? What subjective qualities do you need to consider when evaluating your solutions?
This assumes you already know what you are using AI for. If you don’t, that is the first problem to solve, and it is a harder one than any of these metrics can help you with.
Once you have those answers, you can ignore the benchmarks. When a new model comes out, you test it on the problems you are solving. If it works better, you use it. If it doesn’t, you don’t. If you can’t tell, then you have a new problem at the top of your list, which is to build an eval framework that works in your domain.
Speed and cost should not drive your decisions either.
Speed is already at a point where you cannot max out your speed and stay thoughtful about what you’re actually creating. If you create/build/write as fast as you already can, then you’re not creating/building/writing anything worth consuming. You’re just making a slop factory.
At current inference costs, our ability to create value should outstrip the cost needed to create that value by several orders of magnitude. There are not many white collar jobs that are being charged at less than $100/hour. At that rate, you need to replace just 2 hours of human work per month in order to justify the most expensive subscription OpenAI or Anthropic will sell you. If your people are unable to generate 2 hours of billable value from a full month of tokens, the problem is not the tools. It’s the deployment, the workflow, the training, or the use case itself. And those are the problems worth your attention.
AI’s metric problem is pulling attention away from work that actually matters. Forget benchmarks, speed and cost. The real question is whether your people have the time, the problems, and the tools to do work worth doing.


