The one-shot fallacy
Why better models will not fix the one-shot problem
Precision and accuracy are not the same
Most of us know two things about pi: its decimal part never ends, and we use it to calculate circles. Its well-known approximation, 3.14, is a useful place to start discussing the difference between precision and accuracy, and how that should impact the way we think about GenAI.
3.14 and 1.34 are equally precise numbers. Commenting on their precision only requires that we compare the number of decimal points they have, and does not require any reference to the value they’re trying to represent.
1.34, however, is a very inaccurate representation of Pi, since it doesn’t even match the first number.
In other words, a number’s precision, on its own, does nothing to inform its usefulness for a particular task. To determine how useful it is, we have to understand its accuracy as well as its precision. Making the distinction between the two is not always as easy as comparing 1.34 and 3.14.
The difference between 3.14 and 1.34 is 57.3%. It’s very easy to spot and will show up in your calculations right away. Take a circular pool with a 10-meter radius, and apply 2 x pi x radius to calculate its circumference, the difference between 2 x 3.14 x 10 (62.8 meters) and 2 x 1.34 x 10 (26.8 meters) is stark. You could intuitively walk around that pool and realize its circumference is a lot more than twice as long as its radius.
The difference between 3.14159 (the correct value for pi, to that decimal point) and 3.14278 (an equally precise, but incorrect value) is 0.038%. Your circumference calculation now yields 62.8318 meters versus 62.8556 meters, which is a lot less intuitive to detect as the wrong result.
The closer the relationship between accuracy and precision, the more difficult it is to spot the difference.
Make your numbers bigger, however, and the error from the wrong value of pi becomes a lot more noticeable. Using 6,371 km as the radius of planet Earth, and assuming for our purpose that it’s a perfect sphere, our calculation with the high-precision wrong value of pi yields a circumference number that is off by 15.2km.
The larger the scope, the broader the consequences of a lack of accuracy, even with high precision.
Fluency is not fidelity
When ChatGPT first burst on the scene, its first impressive trick was its ability to generate grammatically correct sentences and paragraphs. Whether or not you agreed with its interpretation of the history of the French Revolution, you couldn’t dispute that the way it wrote about it used proper grammar and proper structure.
GenAI has been fluent from the start. As it evolved into a tool that could generate programming language code, it kept its fluency in that domain as well.
But the challenge with writing code was never its syntax. It was never its adherence to the rules of a specific programming language. Programming languages, after all, were created specifically to have limited vocabulary and rules that are easy to follow and enforce.
The challenge has always been in making sure that the code actually implements the right solution to the right problem. Taking our circumference calculation example, I could write a program that uses 10 x pi x radius as the formula for circumference, instead of the correct 2 x pi x radius formula we used earlier. That code would compile and run cleanly. It would also produce the wrong answer, no matter how accurate our value of pi.
GenAI’s precision does not guarantee its accuracy. The interface we use to close that gap is natural language, and that is where the next problem starts.
Why natural language cannot close the gap
No matter how we interact with computers, the levels of abstraction ultimately translate to zeros and ones that computer chips can understand and execute natively.
Historically, the translation layers have been “deterministic”, which means that a specific set of instructions in a programming language will always translate to a predictable, fixed sequence of zeros and ones for the computer to execute.
LLMs are different from traditional software in two ways: their input is natural language rather than a programming language, and their output is probabilistic rather than deterministic. Both matter. This essay is about the first.
In 1978, several decades before LLMs were invented, Dijkstra wrote a short essay called “On the Foolishness of Natural Language Programming”. His main claim was that natural languages are inherently imprecise. The argument has been recycled every few years since, but the precision of LLM output now makes it harder to see why it matters. Here is an example.
Let’s consider the simple question: “Is James angry?”, which is a very reasonable question that you might ask before you try to have a delicate conversation with James. If he’s angry, you might wait until a later time. But in programming terms, it’s a question that is not answerable in a deterministic way without a lot of additional context. Who is James? What usually makes him angry? How does that manifest through his behavior?
If I had to implement this question with a programming language, I’d need to define the entity “James” and the state of “anger”. In doing so, I would map very specific attributes to both, and my program would manage those attributes over time, such that for any given timeline of my program, the entity James would be fully known, and its state of “anger” could be calculated precisely.
A natural language does not impose such discipline. I can simply tell an AI agent to perform a task on the condition that “James is angry”, and the LLM will assume what “angry” means in this context. And the inherent subjectivity of natural languages makes it difficult for me to be precise in my description of what makes James angry. He may be angry when he’s tired, when he’s frustrated, when he’s impatient. Natural languages have extensive vocabulary, and each adjective I use to try to explain James’s state of anger invites further context and definition.
For programming, having a limited vocabulary is a feature, not a bug. It forces the programmer to either simplify their instructions, or to build complex instructions carefully from simple building blocks.
The one-shot fallacy
In numbers, the distinction is precision and accuracy.
In writing, it is fluency and fidelity. An essay about the French Revolution can be fluent and wrong. Confident sentences. Clean structure. The right vocabulary. None of it tells you whether the essay is faithful to what actually happened in 1789. Fluency is a property of the surface. Fidelity is a property of the relationship between the surface and the thing it claims to describe.
In programming, the distinction is between code that runs and code that’s right. Code can run cleanly and solve the wrong problem. Cleanliness is a property of the syntax. Rightness is a property of the relationship between the code and the problem it claims to address.
LLMs are fluent in the same natural languages we are. That fluency makes their output feel correct, but correctness is not usefulness, and the gap has to be closed deliberately.
Pi has a true value. Approximations can be judged against it. The assumption with LLMs is the same: better models get closer to the right answer. But that assumption breaks, because the problems we try to solve with programming have no absolute true value. The right answer has to be constructed. That construction is the disambiguation process. It asks questions, surfaces assumptions, and builds context until the problem is defined well enough to solve.
A one-shot prompt skips the construction step. It doesn’t approximate a known target. There is no known target. It generates a target and an output simultaneously, fluently, and with no mechanism to check whether either is faithful to anything.
What’s worse, the fluency masks the absence of the process. The better the model, the better the fluency and the better the illusion.


