About

Small benchmark. Annoying work.

Local Model Bench tests whether models can handle the kind of private desktop work people actually want to keep local: synthetic invoices, messy folders, revised attachments, protected source files, and exact final artifacts.

Why local?

Paperwork is a natural use case for local models because the input often contains private data. The benchmark uses fully synthetic documents, but the workflow is modeled after real tasks: match records, pick the right file version, produce evidence, and avoid touching protected folders.

What is scored?

The main score combines resolved cases and core passes. A resolved case means the final result, hidden oracle, proof code, and workflow artifacts pass. A core pass means the central audit facts are right, even if the closure is imperfect.

What this is not

This is not a general intelligence ranking, a tax or financial tool, or a claim that one model is universally better. It is a practical signal for local-document workflows under reproducible prompts and visible outputs.

Current setup

Local LM Studio runs are executed on a Mac mini M4 with 64 GB unified memory. Reference/API rows are marked separately. The public site shows complete comparison runs and keeps incomplete experiments out of the main table.

How to read the site

Start with the leaderboard, then open a model note when a score needs context. The case pages show the synthetic input files without revealing the hidden answer key. Run pages show model outputs, checks, and failure types so the result can be inspected instead of just trusted.