A clean question is a luxury
Most benchmark prompts are polite.
They hand the model the relevant information. They ask a bounded question. They usually do not include an old export, a renamed duplicate, a revised attachment, a protected source folder, and a final JSON file that must be written in exactly one place.
Real work is less polite.
In a document workflow, the model has to decide what counts. The old invoice may be a trap. The new one may use the same invoice number with a revision suffix. A bank payment may reference the old number but belong to the revised invoice. A quote may look like an invoice. A note from accounting may override the obvious-looking file.
That is not hard reasoning in the glamorous sense. It is worse: small, local, stateful, easy to get almost right.
Almost right is expensive
This is where chat demos flatter models.
If a model writes a plausible explanation, the user may feel progress. If it names the right vendor and quotes the right total, it looks as if the job is mostly done.
But in a workflow, mostly done is often still unpaid work.
If the JSON is in the wrong directory, the job failed. If the evidence path points to a file that does not exist, the job failed. If the model used the old attachment after the thread said to ignore it, the job failed. If it preserved the story but skipped the artifact, the job failed.
That sounds harsh only if the benchmark is treated like a conversation. It is not. It is closer to a small office task with a receipt at the end.
Public benchmarks become training targets
There is also the uncomfortable part.
Public benchmarks do not just measure the field. They shape it.
Once a score matters, teams have every reason to improve against that score. That can be honest and useful. Better data. Better post-training. Better synthetic tasks. Better eval discipline.
It can also make models very good at the kind of question everyone keeps asking.
That does not require some dramatic cheating story. The incentives are enough. If the same task style becomes famous, the ecosystem will learn toward it.
So a mainstream benchmark score is not a neutral weather report. It is partly weather, partly training signal, partly product marketing, and partly collective curriculum.
Useful, yes. Sufficient, no.
The local-model promise raises the bar
Local models are not just sold as chat toys.
They are sold as private assistants: run it on your machine, keep your documents local, automate your own boring work.
That is exactly the right direction. It is also exactly why the evaluation has to be meaner.
If a local model is supposed to help with private paperwork, then the test should include private-work-shaped problems: messy inputs, file choice, stale context, exact outputs, and source files that must not be modified.
The model should not get credit for sounding like it did the work. It should get credit when the files are there.
What practical benchmarks should punish
I want benchmarks that punish the stuff users actually suffer from: choosing the stale file, inventing an evidence path, writing prose when JSON was requested, changing protected inputs, missing the final checksum, and getting the main idea right while the artifact is wrong.
That last one matters most.
Modern models are often good at the main idea. The weak point is closure.
The boring end of the task is where usefulness lives.
Practical readout
Mainstream benchmarks are not useless.
They are just not the same thing as work.
They can show that a model is benchmark-smart. They cannot, by themselves, show that it is work-ready.
For that, the model has to survive the folder.