methodology

How to benchmark local LLMs for private documents

A useful local LLM benchmark should test more than chat quality. For private document work, the hard part is messy inputs, source selection, structured artifacts, hidden oracles, and exact workflow closure.

Local LLM benchmarking guide benchmark note infographic
method Practical score
primary Resolved
near miss Core pass
separate Visual sample

Most local LLM benchmarks answer a narrow question: how fast does a model run, or how well does it answer a clean prompt? Those are useful questions, but they miss the reason many people want local models in the first place.

The practical local use case is private work: invoices, folders, screenshots, exported CSVs, support notes, draft files, and documents you do not want to send to a cloud API. A benchmark for that world has to look more like work and less like a trivia quiz.

Start with the job, not the model

A local benchmark should begin with a task a person would actually delegate. For Local Model Bench, that task is intentionally boring: inspect synthetic paperwork, select the right source files, reconcile invoice details against bank and vendor data, and produce a final audit JSON.

That job is narrower than general intelligence and broader than one prompt. It includes reading documents, following instructions, preserving input files, creating intermediate artifacts, and finishing the exact output contract.

Use synthetic data

Private-document benchmarks should not use real private documents. The clean solution is synthetic data with realistic structure: invoice scans, payment exports, vendor records, purchase orders, email notes, duplicate files, and old drafts. The data can be public, repeatable, and inspectable without leaking anyone's actual finances or workplace material.

Synthetic does not have to mean easy. A case can include withdrawn invoices, revised attachments, stale bank exports, inactive vendors, short payments, or a note that says which file is final. The important part is that the expected answer is knowable and objective.

Separate visible checks from hidden oracles

Visible checks catch the obvious problems: did the model create `audit_result.json`, is the JSON parseable, are required keys present, did it create `proof.txt`, did it leave the incoming folder unchanged? These checks are useful during development because models and runners fail in very ordinary ways.

Hidden oracles catch the actual task logic. They decide whether the model selected the corrected invoice, ignored the draft, mapped the payment to the revised document, computed the right total, and cited acceptable evidence. Without hidden oracles, a benchmark rewards neat-looking answers rather than correct outcomes.

  • Visible checks: file exists, JSON parses, required fields present
  • Hidden oracles: correct invoice set, warnings, totals, evidence, proof code
  • Failure types: diagnostics only, not extra score inflation
  • Run outputs: public enough for inspection, without exposing ground truth

Score resolved and core separately

The most important scoring split is resolved versus core. Resolved means the whole case passed: final artifact, hidden oracle, proof code, evidence, protected-folder rules, and required intermediate files. Core pass means the central audit facts were right even if exact closure failed.

That split prevents two bad interpretations. It avoids calling every near miss a total failure. It also avoids giving a model a clean win when it understood the case but produced an unusable final artifact.

Make the model handle files

A local work benchmark should not only ask for a chat answer. Real private work involves files. The model should inspect a folder, identify active sources, ignore drafts, write normalized manifests, and keep protected input folders unchanged.

This is where agentic tasks matter. A model can be good at one-shot extraction and still weak at multi-step workflow closure. The folder task tests whether the model can move from reading to organizing to final output.

Keep text-only as a diagnostic mode

Image-based document benchmarks are realistic, but they mix two problems: vision/OCR and reasoning. A text-only companion benchmark removes raw image input and gives the model normalized extracts. If a model improves sharply, the issue was probably document reading. If it still fails, the issue is workflow logic or output discipline.

That diagnostic split also gives text-only local models a fair place in the project. They should not be scored as image readers, but they can still be scored as bookkeeping and structured-output workers.

What to publish

A useful public benchmark should show more than a score. It should show the prompt, the model answer, generated artifacts, visible checks, failure labels, model/runtime settings, hardware, and enough case material for readers to understand the task. It should not publish hidden solutions directly beside the prompt.

This is the balance Local Model Bench is aiming for: transparent enough to inspect, strict enough to be meaningful, and boring enough that passing the benchmark feels like real work rather than benchmark theater.

Model Context

Topic
Benchmark design
Use case
Private documents and local workflows
Data
Synthetic only
Primary metric
Resolved practical cases
Diagnostic modes
Text-only, SVG sample, failure taxonomy

Positioned As

  • This is a practical benchmark design guide, not a universal claim about model intelligence.
  • The goal is to test whether local models can produce useful artifacts from messy private-style inputs.
  • The method favors source selection, exact output contracts, hidden oracles, and visible run artifacts over vague model impressions.

What We Actually Tested

  • Local Model Bench currently uses generated invoice cases, agentic paperwork workflows, text-only diagnostics, and a separate City Plan SVG sample.
  • Local LM Studio runs are marked separately from API/reference rows.
  • The current local hardware baseline is a Mac mini M4 with 64 GB unified memory.
  • Scores should be read as practical capability signals, not universal rankings.

What Worked

  • Explains why the benchmark uses boring private-work tasks.
  • Separates image reading, reasoning, and final workflow closure.
  • Makes hidden oracles and public artifacts part of the methodology.

Where It Broke

  • A small synthetic suite is not a universal model ranking.
  • Manual review still matters for interpreting near misses.
  • More case families will be needed before making broad claims about local agents.

Readout

A good local LLM benchmark should ask whether the model can finish a real-ish job, not whether it can sound smart. For private documents, the job is files, evidence, warnings, exact JSON, and proof. That is the bar.