gpt-oss-20b

2026-05-19T18-20-15Z_openai_gpt-oss-20b_paperwork_v3_text_extracts

30% Practical score
0/5 Resolved tasks
3/5 Core tasks

Verdict

practical run result
near miss

Core 3/5 · Visible 4/5 · Hidden 0/5

duplicate_risk_missedinvoice_classification_errorinvoice_id_format_errormissing_or_wrong_evidenceproof_code_error
Prompt set v3_paperwork_text_extracts.json
Run type local
Temperature 0
Top P 1
Missed tasks 5/5

Task Results

Compact view; open a row for prompt and raw answer
Task Category Score Failed Checks Failure Types Prompt / Answer
generated_invoice_case_01_text paperwork_text 3/4 hidden_oracle_pass missing_or_wrong_evidence, proof_code_error
Inspect

Prompt

TEXT-ONLY PAPERWORK VARIANT
Use `document_extracts.csv` as the normalized OCR/document reading.
No image files are attached in this run.
For the `evidence` field, cite the original `source_path` values from `document_extracts.csv` plus the CSV context files you used.
Do not cite `document_extracts.csv` itself as evidence.
Return final JSON only. No Markdown. No explanation.


## README_TASK.md
# Paperwork Trial v3 Calibration Case 01

This is synthetic benchmark bookkeeping. It is not financial, tax, or legal advice.

Audit the scanned document folder and write `audit_result.json` only.

Case ID: `P3-GEN-01`

Output keys exactly:

`case_id, approved_invoice_ids, review_invoice_ids, reject_invoice_ids, ignored_document_ids, total_approved_gross_cents, warnings_by_invoice, evidence, proof_code`

Rules:

- Treat the files in `scans/` as the source documents.
- Ignore documents that are explicitly not invoices.
- Match vendors by tax ID first, then by exact vendor name.
- `payment_match` requires paid bank rows for the invoice whose summed amount equals the invoice gross.
- `payment_short` applies when the paid bank amount is lower than invoice gross.
- `under_review_stamp` applies when the scanned invoice visibly has an under-review stamp.
- Approved invoices have no warnings and paid amount equals gross.
- Review invoices have warnings but are not reject-level. `payment_short` and `under_review_stamp` are review-level warnings, not reject-level warnings.
- Reject only if the invoice is impossible to process, uses an inactive vendor, or lacks a required PO.
- Sort all invoice-id arrays ascending.
- Allowed warning codes are exactly: `payment_short`, `under_review_stamp`.
- `warnings_by_invoice` must include every real invoice ID and sorted warning arrays.
- Warning arrays must be flat arrays of lowercase strings, never nested arrays and never prose labels.
- `ignored_document_ids` must include visible document IDs from ignored non-invoice documents, not filenames. Example: use `QT-6400`, not `quote_6400_not_invoice.png`.
- `total_approved_gross_cents` is the sum of approved invoice gross totals only.
- `evidence` must list the relative source file paths used in stable alphabetical order, including folder prefixes such as `scans/`.
- Include the relevant CSV files and every scanned document inspected in `evidence`, including ignored non-invoice scans.
- `proof_code = total_approved_gross_cents + sum(numeric parts of all real invoice IDs) + 97 * total_warning_count`.

Important: `Northwind Office Supply` is the customer, not the vendor.



## document_extracts.csv
source_path,document_id,document_type,vendor_name,tax_id,po_id,net_cents,tax_cents,gross_total_cents,visible_marks,notes
scans/inv_7801_paid.png,INV-7801,invoice,BluePeak Distributors,BP-4471,PO-4510,,,18737,paid stamp,Customer shown as Northwind Office Supply
scans/inv_7802_short_paid.png,INV-7802,invoice,StationHub LLC,SH-9982,PO-4510-B,,,5230,"received stamp; handwritten: Short paid $2.30",Customer shown as Northwind Office Supply
scans/inv_8422_under_review.png,INV-8422,invoice,Orion Field Services,OR-1109,PO-8422,,,42245,UNDER REVIEW stamp,Customer shown as Northwind Office Supply
scans/quote_6400_not_invoice.png,QT-6400,quote,Delta Repair Services,,,"",,0,NOT AN INVOICE DO NOT PAY,Estimate only; not an invoice payment request



## bank_export.csv
date,description,invoice_id,amount_cents,status
2026-04-23,BluePeak Distributors,INV-7801,18737,paid
2026-04-24,StationHub LLC,INV-7802,5000,paid
2026-04-27,Orion Field Services,INV-8422,42245,paid



## vendor_master.csv
vendor_id,name,tax_id,status
V-BP4471,BluePeak Distributors,BP-4471,active
V-SH9982,StationHub LLC,SH-9982,active
V-OR1109,Orion Field Services,OR-1109,active
V-NW001,Northwind Office Supply,NW-CUSTOMER,customer



## purchase_orders.csv
po_id,vendor_id,limit_cents,status
PO-4510,V-BP4471,20000,open
PO-4510-B,V-SH9982,6000,open
PO-8422,V-OR1109,50000,open



Return `audit_result.json` as a single JSON object only. /no_think

Workspace files available to the model

README_TASK.md
# Paperwork Trial v3 Calibration Case 01

This is synthetic benchmark bookkeeping. It is not financial, tax, or legal advice.

Audit the scanned document folder and write `audit_result.json` only.

Case ID: `P3-GEN-01`

Output keys exactly:

`case_id, approved_invoice_ids, review_invoice_ids, reject_invoice_ids, ignored_document_ids, total_approved_gross_cents, warnings_by_invoice, evidence, proof_code`

Rules:

- Treat the files in `scans/` as the source documents.
- Ignore documents that are explicitly not invoices.
- Match vendors by tax ID first, then by exact vendor name.
- `payment_match` requires paid bank rows for the invoice whose summed amount equals the invoice gross.
- `payment_short` applies when the paid bank amount is lower than invoice gross.
- `under_review_stamp` applies when the scanned invoice visibly has an under-review stamp.
- Approved invoices have no warnings and paid amount equals gross.
- Review invoices have warnings but are not reject-level. `payment_short` and `under_review_stamp` are review-level warnings, not reject-level warnings.
- Reject only if the invoice is impossible to process, uses an inactive vendor, or lacks a required PO.
- Sort all invoice-id arrays ascending.
- Allowed warning codes are exactly: `payment_short`, `under_review_stamp`.
- `warnings_by_invoice` must include every real invoice ID and sorted warning arrays.
- Warning arrays must be flat arrays of lowercase strings, never nested arrays and never prose labels.
- `ignored_document_ids` must include visible document IDs from ignored non-invoice documents, not filenames. Example: use `QT-6400`, not `quote_6400_not_invoice.png`.
- `total_approved_gross_cents` is the sum of approved invoice gross totals only.
- `evidence` must list the relative source file paths used in stable alphabetical order, including folder prefixes such as `scans/`.
- Include the relevant CSV files and every scanned document inspected in `evidence`, including ignored non-invoice scans.
- `proof_code = total_approved_gross_cents + sum(numeric parts of all real invoice IDs) + 97 * total_warning_count`.

Important: `Northwind Office Supply` is the customer, not the vendor.
document_extracts.csv
source_path,document_id,document_type,vendor_name,tax_id,po_id,net_cents,tax_cents,gross_total_cents,visible_marks,notes
scans/inv_7801_paid.png,INV-7801,invoice,BluePeak Distributors,BP-4471,PO-4510,,,18737,paid stamp,Customer shown as Northwind Office Supply
scans/inv_7802_short_paid.png,INV-7802,invoice,StationHub LLC,SH-9982,PO-4510-B,,,5230,"received stamp; handwritten: Short paid $2.30",Customer shown as Northwind Office Supply
scans/inv_8422_under_review.png,INV-8422,invoice,Orion Field Services,OR-1109,PO-8422,,,42245,UNDER REVIEW stamp,Customer shown as Northwind Office Supply
scans/quote_6400_not_invoice.png,QT-6400,quote,Delta Repair Services,,,"",,0,NOT AN INVOICE DO NOT PAY,Estimate only; not an invoice payment request
bank_export.csv
date,description,invoice_id,amount_cents,status
2026-04-23,BluePeak Distributors,INV-7801,18737,paid
2026-04-24,StationHub LLC,INV-7802,5000,paid
2026-04-27,Orion Field Services,INV-8422,42245,paid
vendor_master.csv
vendor_id,name,tax_id,status
V-BP4471,BluePeak Distributors,BP-4471,active
V-SH9982,StationHub LLC,SH-9982,active
V-OR1109,Orion Field Services,OR-1109,active
V-NW001,Northwind Office Supply,NW-CUSTOMER,customer
purchase_orders.csv
po_id,vendor_id,limit_cents,status
PO-4510,V-BP4471,20000,open
PO-4510-B,V-SH9982,6000,open
PO-8422,V-OR1109,50000,open

Agent actions

  

Model answer

{"case_id":"P3-GEN-01","approved_invoice_ids":["INV-7801"],"review_invoice_ids":["INV-7802","INV-8422"],"reject_invoice_ids":[],"ignored_document_ids":["QT-6400"],"total_approved_gross_cents":18737,"warnings_by_invoice":{"INV-7801":[],"INV-7802":["payment_short"],"INV-8422":["under_review_stamp"]},"evidence":["scans/quote_6400_not_invoice.png","scans/inv_7801_paid.png","scans/inv_7802_short_paid.png","scans/inv_8422_under_review.png","bank_export.csv","document_extracts.csv","purchase_orders.csv","vendor_master.csv"],"proof_code":42996}
generated_invoice_case_02_text paperwork_text 3/4 hidden_oracle_pass missing_or_wrong_evidence
Inspect

Prompt

TEXT-ONLY PAPERWORK VARIANT
Use `document_extracts.csv` as the normalized OCR/document reading.
No image files are attached in this run.
For the `evidence` field, cite the original `source_path` values from `document_extracts.csv` plus the CSV context files you used.
Do not cite `document_extracts.csv` itself as evidence.
Return final JSON only. No Markdown. No explanation.


## README_TASK.md
# Paperwork Trial v3 Calibration Case 02

This is synthetic benchmark bookkeeping. It is not financial, tax, or legal advice.

Audit the scanned document folder and write `audit_result.json` only.

Case ID: `P3-GEN-02`

Output keys exactly:

`case_id, approved_invoice_ids, review_invoice_ids, reject_invoice_ids, ignored_document_ids, total_approved_gross_cents, warnings_by_invoice, evidence, proof_code`

Rules:

- Treat the files in `scans/` as the source documents.
- Ignore documents that are explicitly credit notes or not invoice payment requests.
- Match vendors by visible vendor name and tax ID when available.
- `payment_match` requires paid bank rows for the invoice whose summed amount equals the invoice gross.
- `payment_short` applies when the paid bank amount is lower than invoice gross.
- `missing_po` applies when the scanned invoice visibly has no valid PO number or says `MISSING PO`.
- `inactive_vendor` applies when the scanned invoice visibly has a vendor-hold/inactive-vendor stamp or vendor records mark the vendor inactive.
- Approved invoices have no warnings and paid amount equals gross.
- Review invoices have warnings but are not reject-level. `payment_short` is review-level.
- Reject invoices with `missing_po` or `inactive_vendor`.
- Sort all invoice-id arrays ascending.
- Allowed warning codes are exactly: `inactive_vendor`, `missing_po`, `payment_short`.
- `warnings_by_invoice` must include every real invoice ID and sorted warning arrays.
- Warning arrays must be flat arrays of lowercase strings, never nested arrays and never prose labels.
- `ignored_document_ids` must include visible document IDs from ignored non-invoice documents, not filenames. Example: use `CN-10032`, not `credit_note_10032_credit_applied.png`.
- `total_approved_gross_cents` is the sum of approved invoice gross totals only.
- `evidence` must list the relative source file paths used in stable alphabetical order, including folder prefixes such as `scans/`.
- Include the relevant CSV files and every scanned document inspected in `evidence`, including ignored credit-note scans.
- `proof_code = total_approved_gross_cents + sum(numeric parts of all real invoice IDs) + 97 * total_warning_count`.

Important: `Northwind Office Supply` is the customer, not the vendor.



## document_extracts.csv
source_path,document_id,document_type,vendor_name,tax_id,po_id,net_cents,tax_cents,gross_total_cents,visible_marks,notes
scans/credit_note_10032_credit_applied.png,CN-10032,credit_note,BrightPath Office Solutions,BP-9200,PO-4488,,,-3725,CREDIT APPLIED,Original invoice INV-82210; credit note is not an invoice payment request
scans/inv_82415_paid.png,INV-82415,invoice,BrightPath Office Solutions,BP-9200,PO-4510,,,18737,paid stamp,Customer shown as Northwind Office Supply
scans/inv_82478_partial_payment.png,INV-82478,invoice,BrightPath Office Solutions,BP-9200,PO-4577,,,14144,"received stamp; handwritten: Partial payment of $100.00 received 5/02. Balance due: $41.44",Customer shown as Northwind Office Supply
scans/inv_82533_vendor_hold.png,INV-82533,invoice,BrightPath Office Solutions,BP-9200,MISSING PO,,,23794,"VENDOR HOLD; INACTIVE VENDOR; handwritten: Please provide PO or approval before processing",Customer shown as Northwind Office Supply



## bank_export.csv
date,description,invoice_id,amount_cents,status
2026-04-23,BrightPath Office Solutions,INV-82415,18737,paid
2026-05-02,BrightPath Office Solutions,INV-82478,10000,paid
2026-05-05,BrightPath Office Solutions,INV-82533,23794,pending



## vendor_master.csv
vendor_id,name,tax_id,status
V-BP9200,BrightPath Office Solutions,BP-9200,active
V-NW001,Northwind Office Supply,NW-CUSTOMER,customer



## purchase_orders.csv
po_id,vendor_id,limit_cents,status
PO-4510,V-BP9200,20000,open
PO-4577,V-BP9200,16000,open
PO-4488,V-BP9200,5000,closed



Return `audit_result.json` as a single JSON object only. /no_think

Workspace files available to the model

README_TASK.md
# Paperwork Trial v3 Calibration Case 02

This is synthetic benchmark bookkeeping. It is not financial, tax, or legal advice.

Audit the scanned document folder and write `audit_result.json` only.

Case ID: `P3-GEN-02`

Output keys exactly:

`case_id, approved_invoice_ids, review_invoice_ids, reject_invoice_ids, ignored_document_ids, total_approved_gross_cents, warnings_by_invoice, evidence, proof_code`

Rules:

- Treat the files in `scans/` as the source documents.
- Ignore documents that are explicitly credit notes or not invoice payment requests.
- Match vendors by visible vendor name and tax ID when available.
- `payment_match` requires paid bank rows for the invoice whose summed amount equals the invoice gross.
- `payment_short` applies when the paid bank amount is lower than invoice gross.
- `missing_po` applies when the scanned invoice visibly has no valid PO number or says `MISSING PO`.
- `inactive_vendor` applies when the scanned invoice visibly has a vendor-hold/inactive-vendor stamp or vendor records mark the vendor inactive.
- Approved invoices have no warnings and paid amount equals gross.
- Review invoices have warnings but are not reject-level. `payment_short` is review-level.
- Reject invoices with `missing_po` or `inactive_vendor`.
- Sort all invoice-id arrays ascending.
- Allowed warning codes are exactly: `inactive_vendor`, `missing_po`, `payment_short`.
- `warnings_by_invoice` must include every real invoice ID and sorted warning arrays.
- Warning arrays must be flat arrays of lowercase strings, never nested arrays and never prose labels.
- `ignored_document_ids` must include visible document IDs from ignored non-invoice documents, not filenames. Example: use `CN-10032`, not `credit_note_10032_credit_applied.png`.
- `total_approved_gross_cents` is the sum of approved invoice gross totals only.
- `evidence` must list the relative source file paths used in stable alphabetical order, including folder prefixes such as `scans/`.
- Include the relevant CSV files and every scanned document inspected in `evidence`, including ignored credit-note scans.
- `proof_code = total_approved_gross_cents + sum(numeric parts of all real invoice IDs) + 97 * total_warning_count`.

Important: `Northwind Office Supply` is the customer, not the vendor.
document_extracts.csv
source_path,document_id,document_type,vendor_name,tax_id,po_id,net_cents,tax_cents,gross_total_cents,visible_marks,notes
scans/credit_note_10032_credit_applied.png,CN-10032,credit_note,BrightPath Office Solutions,BP-9200,PO-4488,,,-3725,CREDIT APPLIED,Original invoice INV-82210; credit note is not an invoice payment request
scans/inv_82415_paid.png,INV-82415,invoice,BrightPath Office Solutions,BP-9200,PO-4510,,,18737,paid stamp,Customer shown as Northwind Office Supply
scans/inv_82478_partial_payment.png,INV-82478,invoice,BrightPath Office Solutions,BP-9200,PO-4577,,,14144,"received stamp; handwritten: Partial payment of $100.00 received 5/02. Balance due: $41.44",Customer shown as Northwind Office Supply
scans/inv_82533_vendor_hold.png,INV-82533,invoice,BrightPath Office Solutions,BP-9200,MISSING PO,,,23794,"VENDOR HOLD; INACTIVE VENDOR; handwritten: Please provide PO or approval before processing",Customer shown as Northwind Office Supply
bank_export.csv
date,description,invoice_id,amount_cents,status
2026-04-23,BrightPath Office Solutions,INV-82415,18737,paid
2026-05-02,BrightPath Office Solutions,INV-82478,10000,paid
2026-05-05,BrightPath Office Solutions,INV-82533,23794,pending
vendor_master.csv
vendor_id,name,tax_id,status
V-BP9200,BrightPath Office Solutions,BP-9200,active
V-NW001,Northwind Office Supply,NW-CUSTOMER,customer
purchase_orders.csv
po_id,vendor_id,limit_cents,status
PO-4510,V-BP9200,20000,open
PO-4577,V-BP9200,16000,open
PO-4488,V-BP9200,5000,closed

Agent actions

  

Model answer

{"case_id":"P3-GEN-02","approved_invoice_ids":["INV-82415"],"review_invoice_ids":["INV-82478"],"reject_invoice_ids":["INV-82533"],"ignored_document_ids":["CN-10032"],"total_approved_gross_cents":18737,"warnings_by_invoice":{"INV-82415":[],"INV-82478":["payment_short"],"INV-82533":["inactive_vendor","missing_po"]},"evidence":["bank_export.csv","document_extracts.csv","purchase_orders.csv","scans/credit_note_10032_credit_applied.png","scans/inv_82415_paid.png","scans/inv_82478_partial_payment.png","scans/inv_82533_vendor_hold.png","vendor_master.csv"],"proof_code":266454}
generated_invoice_case_03_text paperwork_text 1/4 visible_checks_pass, core_oracle_pass, hidden_oracle_pass duplicate_risk_missed, invoice_classification_error, invoice_id_format_error, missing_or_wrong_evidence, warning_code_error
Inspect

Prompt

TEXT-ONLY PAPERWORK VARIANT
Use `document_extracts.csv` as the normalized OCR/document reading.
No image files are attached in this run.
For the `evidence` field, cite the original `source_path` values from `document_extracts.csv` plus the CSV context files you used.
Do not cite `document_extracts.csv` itself as evidence.
Return final JSON only. No Markdown. No explanation.


## README_TASK.md
# Paperwork Trial v3 Calibration Case 03

This is synthetic benchmark bookkeeping. It is not financial, tax, or legal advice.

Audit the scanned document folder and write `audit_result.json` only.

Case ID: `P3-GEN-03`

Output keys exactly:

`case_id, approved_invoice_ids, review_invoice_ids, reject_invoice_ids, ignored_document_ids, total_approved_gross_cents, warnings_by_invoice, evidence, proof_code`

Rules:

- Treat the files in `scans/` as the source documents.
- Ignore documents that are explicitly quotes, credit notes, or not invoice payment requests.
- Match vendors by visible vendor name and tax ID when available.
- `payment_match` requires paid bank rows for the invoice whose summed amount equals the invoice gross.
- `duplicate_risk` applies when `previous_invoices.csv` has the same vendor_id and same gross amount as the current invoice.
- `payment_short` applies when the paid bank amount is lower than invoice gross.
- `under_review_stamp` applies when the scanned invoice visibly has an under-review stamp.
- `missing_po` applies when the scanned invoice visibly has no valid PO number or says `MISSING PO`.
- `inactive_vendor` applies when the scanned invoice visibly has a vendor-hold/inactive-vendor stamp or vendor records mark the vendor inactive.
- Approved invoices have no warnings and paid amount equals gross.
- Review invoices have warnings but are not reject-level. `duplicate_risk`, `payment_short`, and `under_review_stamp` are review-level.
- Reject invoices with `missing_po` or `inactive_vendor`.
- Sort all invoice-id arrays ascending.
- Allowed warning codes are exactly: `duplicate_risk`, `inactive_vendor`, `missing_po`, `payment_short`, `under_review_stamp`.
- `warnings_by_invoice` must include every real invoice ID and sorted warning arrays.
- Warning arrays must be flat arrays of lowercase strings, never nested arrays and never prose labels.
- `ignored_document_ids` must include visible document IDs from ignored non-invoice documents, not filenames.
- `total_approved_gross_cents` is the sum of approved invoice gross totals only.
- `evidence` must list the relative source file paths used in stable alphabetical order, including folder prefixes such as `scans/`.
- Include the relevant CSV files and every scanned document inspected in `evidence`, including `previous_invoices.csv` and ignored quote and credit-note scans.
- `proof_code = total_approved_gross_cents + sum(numeric parts of all real invoice IDs) + 97 * total_warning_count`.

Important: `Northwind Office Supply` is the customer, not the vendor.



## document_extracts.csv
source_path,document_id,document_type,vendor_name,tax_id,po_id,net_cents,tax_cents,gross_total_cents,visible_marks,notes
scans/case01_inv_7801_paid.png,INV-7801,invoice,BluePeak Distributors,BP-4471,PO-4510,,,18737,paid stamp,Customer shown as Northwind Office Supply
scans/case01_inv_7802_short_paid.png,INV-7802,invoice,StationHub LLC,SH-9982,PO-4510-B,,,5230,"received stamp; handwritten short-paid note",Customer shown as Northwind Office Supply
scans/case01_inv_8422_under_review.png,INV-8422,invoice,Orion Field Services,OR-1109,PO-8422,,,42245,UNDER REVIEW stamp,Customer shown as Northwind Office Supply
scans/case01_quote_6400_not_invoice.png,QT-6400,quote,Delta Repair Services,,,"",,0,NOT AN INVOICE DO NOT PAY,Estimate only; not an invoice payment request
scans/case02_credit_note_10032_credit_applied.png,CN-10032,credit_note,BrightPath Office Solutions,BP-9200,PO-4488,,,-3725,CREDIT APPLIED,Original invoice INV-82210; credit note is not an invoice payment request
scans/case02_inv_82415_paid.png,INV-82415,invoice,BrightPath Office Solutions,BP-9200,PO-4510,,,18737,paid stamp,Customer shown as Northwind Office Supply
scans/case02_inv_82478_partial_payment.png,INV-82478,invoice,BrightPath Office Solutions,BP-9200,PO-4577,,,14144,"partial payment of $100.00; balance due $41.44",Customer shown as Northwind Office Supply
scans/case02_inv_82533_vendor_hold.png,INV-82533,invoice,BrightPath Office Solutions,BP-9200,MISSING PO,,,23794,"VENDOR HOLD; INACTIVE VENDOR",Customer shown as Northwind Office Supply



## bank_export.csv
date,description,invoice_id,amount_cents,status
2026-04-23,BluePeak Distributors,INV-7801,18737,paid
2026-04-24,StationHub LLC,INV-7802,5000,paid
2026-04-27,Orion Field Services,INV-8422,42245,paid
2026-04-23,BrightPath Office Solutions,INV-82415,18737,paid
2026-05-02,BrightPath Office Solutions,INV-82478,10000,paid
2026-05-05,BrightPath Office Solutions,INV-82533,23794,pending



## vendor_master.csv
vendor_id,name,tax_id,status
V-BP4471,BluePeak Distributors,BP-4471,active
V-SH9982,StationHub LLC,SH-9982,active
V-OR1109,Orion Field Services,OR-1109,active
V-BP9200,BrightPath Office Solutions,BP-9200,active
V-NW001,Northwind Office Supply,NW-CUSTOMER,customer



## purchase_orders.csv
po_id,vendor_id,limit_cents,status
PO-4510,V-BP4471,20000,open
PO-4510,V-BP9200,20000,open
PO-4510-B,V-SH9982,6000,open
PO-4577,V-BP9200,16000,open
PO-8422,V-OR1109,50000,open
PO-4488,V-BP9200,5000,closed



## previous_invoices.csv
invoice_id,vendor_id,gross_total_cents,paid_date
INV-7600,V-BP4471,18737,2026-03-11
INV-82210,V-BP9200,3725,2026-04-18



Return `audit_result.json` as a single JSON object only. /no_think

Workspace files available to the model

README_TASK.md
# Paperwork Trial v3 Calibration Case 03

This is synthetic benchmark bookkeeping. It is not financial, tax, or legal advice.

Audit the scanned document folder and write `audit_result.json` only.

Case ID: `P3-GEN-03`

Output keys exactly:

`case_id, approved_invoice_ids, review_invoice_ids, reject_invoice_ids, ignored_document_ids, total_approved_gross_cents, warnings_by_invoice, evidence, proof_code`

Rules:

- Treat the files in `scans/` as the source documents.
- Ignore documents that are explicitly quotes, credit notes, or not invoice payment requests.
- Match vendors by visible vendor name and tax ID when available.
- `payment_match` requires paid bank rows for the invoice whose summed amount equals the invoice gross.
- `duplicate_risk` applies when `previous_invoices.csv` has the same vendor_id and same gross amount as the current invoice.
- `payment_short` applies when the paid bank amount is lower than invoice gross.
- `under_review_stamp` applies when the scanned invoice visibly has an under-review stamp.
- `missing_po` applies when the scanned invoice visibly has no valid PO number or says `MISSING PO`.
- `inactive_vendor` applies when the scanned invoice visibly has a vendor-hold/inactive-vendor stamp or vendor records mark the vendor inactive.
- Approved invoices have no warnings and paid amount equals gross.
- Review invoices have warnings but are not reject-level. `duplicate_risk`, `payment_short`, and `under_review_stamp` are review-level.
- Reject invoices with `missing_po` or `inactive_vendor`.
- Sort all invoice-id arrays ascending.
- Allowed warning codes are exactly: `duplicate_risk`, `inactive_vendor`, `missing_po`, `payment_short`, `under_review_stamp`.
- `warnings_by_invoice` must include every real invoice ID and sorted warning arrays.
- Warning arrays must be flat arrays of lowercase strings, never nested arrays and never prose labels.
- `ignored_document_ids` must include visible document IDs from ignored non-invoice documents, not filenames.
- `total_approved_gross_cents` is the sum of approved invoice gross totals only.
- `evidence` must list the relative source file paths used in stable alphabetical order, including folder prefixes such as `scans/`.
- Include the relevant CSV files and every scanned document inspected in `evidence`, including `previous_invoices.csv` and ignored quote and credit-note scans.
- `proof_code = total_approved_gross_cents + sum(numeric parts of all real invoice IDs) + 97 * total_warning_count`.

Important: `Northwind Office Supply` is the customer, not the vendor.
document_extracts.csv
source_path,document_id,document_type,vendor_name,tax_id,po_id,net_cents,tax_cents,gross_total_cents,visible_marks,notes
scans/case01_inv_7801_paid.png,INV-7801,invoice,BluePeak Distributors,BP-4471,PO-4510,,,18737,paid stamp,Customer shown as Northwind Office Supply
scans/case01_inv_7802_short_paid.png,INV-7802,invoice,StationHub LLC,SH-9982,PO-4510-B,,,5230,"received stamp; handwritten short-paid note",Customer shown as Northwind Office Supply
scans/case01_inv_8422_under_review.png,INV-8422,invoice,Orion Field Services,OR-1109,PO-8422,,,42245,UNDER REVIEW stamp,Customer shown as Northwind Office Supply
scans/case01_quote_6400_not_invoice.png,QT-6400,quote,Delta Repair Services,,,"",,0,NOT AN INVOICE DO NOT PAY,Estimate only; not an invoice payment request
scans/case02_credit_note_10032_credit_applied.png,CN-10032,credit_note,BrightPath Office Solutions,BP-9200,PO-4488,,,-3725,CREDIT APPLIED,Original invoice INV-82210; credit note is not an invoice payment request
scans/case02_inv_82415_paid.png,INV-82415,invoice,BrightPath Office Solutions,BP-9200,PO-4510,,,18737,paid stamp,Customer shown as Northwind Office Supply
scans/case02_inv_82478_partial_payment.png,INV-82478,invoice,BrightPath Office Solutions,BP-9200,PO-4577,,,14144,"partial payment of $100.00; balance due $41.44",Customer shown as Northwind Office Supply
scans/case02_inv_82533_vendor_hold.png,INV-82533,invoice,BrightPath Office Solutions,BP-9200,MISSING PO,,,23794,"VENDOR HOLD; INACTIVE VENDOR",Customer shown as Northwind Office Supply
bank_export.csv
date,description,invoice_id,amount_cents,status
2026-04-23,BluePeak Distributors,INV-7801,18737,paid
2026-04-24,StationHub LLC,INV-7802,5000,paid
2026-04-27,Orion Field Services,INV-8422,42245,paid
2026-04-23,BrightPath Office Solutions,INV-82415,18737,paid
2026-05-02,BrightPath Office Solutions,INV-82478,10000,paid
2026-05-05,BrightPath Office Solutions,INV-82533,23794,pending
vendor_master.csv
vendor_id,name,tax_id,status
V-BP4471,BluePeak Distributors,BP-4471,active
V-SH9982,StationHub LLC,SH-9982,active
V-OR1109,Orion Field Services,OR-1109,active
V-BP9200,BrightPath Office Solutions,BP-9200,active
V-NW001,Northwind Office Supply,NW-CUSTOMER,customer
purchase_orders.csv
po_id,vendor_id,limit_cents,status
PO-4510,V-BP4471,20000,open
PO-4510,V-BP9200,20000,open
PO-4510-B,V-SH9982,6000,open
PO-4577,V-BP9200,16000,open
PO-8422,V-OR1109,50000,open
PO-4488,V-BP9200,5000,closed
previous_invoices.csv
invoice_id,vendor_id,gross_total_cents,paid_date
INV-7600,V-BP4471,18737,2026-03-11
INV-82210,V-BP9200,3725,2026-04-18

Agent actions

  

Model answer

{"case_id":"P3-GEN-03","approved_invoice_ids":[82415],"review_invoice_ids":[7801,7802,8422,82478],"reject_invoice_ids":[82533],"ignored_document_ids":["QT-6400","CN-10032"],"total_approved_gross_cents":18737,"warnings_by_invoice":{"7801":["duplicate_risk"],"7802":["payment_short"],"8422":["under_review_stamp"],"82478":["payment_short"],"82533":["missing_po","inactive_vendor"]},"evidence":["bank_export.csv","document_extracts.csv","previous_invoices.csv","scans/case01_inv_7801_paid.png","scans/case01_inv_7802_short_paid.png","scans/case01_inv_8422_under_review.png","scans/case01_quote_6400_not_invoice.png","scans/case02_credit_note_10032_credit_applied.png","scans/case02_inv_82415_paid.png","scans/case02_inv_82478_partial_payment.png","scans/case02_inv_82533_vendor_hold.png"],"proof_code":290770}
generated_invoice_case_04_text paperwork_text 3/4 hidden_oracle_pass missing_or_wrong_evidence
Inspect

Prompt

TEXT-ONLY PAPERWORK VARIANT
Use `document_extracts.csv` as the normalized OCR/document reading.
No image files are attached in this run.
For the `evidence` field, cite the original `source_path` values from `document_extracts.csv` plus the CSV context files you used.
Do not cite `document_extracts.csv` itself as evidence.
Return final JSON only. No Markdown. No explanation.


## README_TASK.md
# Paperwork Trial v3 Generated Image Case 04

This is synthetic benchmark bookkeeping. It is not financial, tax, or legal advice.

Audit the scanned document folder and write `audit_result.json` only.

Case ID: `P3-GEN-04`

Output keys exactly:

`case_id, approved_invoice_ids, review_invoice_ids, reject_invoice_ids, ignored_document_ids, total_approved_gross_cents, warnings_by_invoice, evidence, proof_code`

Rules:

- Treat the files in `scans/` as the source documents.
- Ignore documents that are explicitly statements or not invoice payment requests.
- Match vendors by visible vendor name and tax ID when available.
- `payment_match` requires paid bank rows for the invoice whose summed amount equals the invoice gross.
- `vendor_tax_id_conflict` applies when the visible vendor name matches one vendor but the visible tax ID belongs to a different vendor in `vendor_master.csv`.
- `tax_rounding_mismatch` applies when invoice tax differs from the expected tax by more than 1 cent.
- For this case, expected tax is 19% of net, rounded to the nearest cent.
- Approved invoices have no warnings and paid amount equals gross.
- Review invoices have warnings but are not reject-level. `vendor_tax_id_conflict` and `tax_rounding_mismatch` are review-level warnings.
- Reject only if the invoice is impossible to process, uses an inactive vendor, or lacks a required PO.
- Sort all invoice-id arrays ascending.
- Allowed warning codes are exactly: `tax_rounding_mismatch`, `vendor_tax_id_conflict`.
- `warnings_by_invoice` must include every real invoice ID and sorted warning arrays.
- Warning arrays must be flat arrays of lowercase strings, never nested arrays and never prose labels.
- `ignored_document_ids` must include visible document IDs from ignored non-invoice documents, not filenames.
- `total_approved_gross_cents` is the sum of approved invoice gross totals only.
- `evidence` must list the relative source file paths used in stable alphabetical order, including folder prefixes such as `scans/`.
- Include the relevant CSV files and every scanned document inspected in `evidence`, including ignored statement scans.
- `proof_code = total_approved_gross_cents + sum(numeric parts of all real invoice IDs) + 97 * total_warning_count`.




## document_extracts.csv
source_path,document_id,document_type,vendor_name,tax_id,po_id,net_cents,tax_cents,gross_total_cents,visible_marks,notes
scans/orion_tax_collision_contact_sheet.png,INV-4170,invoice,Orion Field Services,OS-913,PO-4170,,,14500,,Vendor name and tax ID are both visible on the document
scans/orion_tax_collision_contact_sheet.png,INV-4171,invoice,Orion Field Services,OF-912,PO-4171,9999,2001,12000,,Invoice shows net 99.99 and tax 20.01
scans/orion_tax_collision_contact_sheet.png,ST-4170,statement,Orion Field Services,OF-912,,,,0,STATEMENT,Statement only; not an invoice payment request



## bank_export.csv
date,description,invoice_id,amount_cents,status
2026-06-12,Orion Field Services,INV-4170,14500,paid
2026-06-12,Orion Field Services,INV-4171,12000,paid




## vendor_master.csv
vendor_id,name,tax_id,status
V-912,Orion Field Services,OF-912,active
V-913,Orion Systems,OS-913,active




## purchase_orders.csv
po_id,vendor_id,limit_cents,status
PO-4170,V-912,16000,open
PO-4171,V-912,13000,open




Return `audit_result.json` as a single JSON object only. /no_think

Workspace files available to the model

README_TASK.md
# Paperwork Trial v3 Generated Image Case 04

This is synthetic benchmark bookkeeping. It is not financial, tax, or legal advice.

Audit the scanned document folder and write `audit_result.json` only.

Case ID: `P3-GEN-04`

Output keys exactly:

`case_id, approved_invoice_ids, review_invoice_ids, reject_invoice_ids, ignored_document_ids, total_approved_gross_cents, warnings_by_invoice, evidence, proof_code`

Rules:

- Treat the files in `scans/` as the source documents.
- Ignore documents that are explicitly statements or not invoice payment requests.
- Match vendors by visible vendor name and tax ID when available.
- `payment_match` requires paid bank rows for the invoice whose summed amount equals the invoice gross.
- `vendor_tax_id_conflict` applies when the visible vendor name matches one vendor but the visible tax ID belongs to a different vendor in `vendor_master.csv`.
- `tax_rounding_mismatch` applies when invoice tax differs from the expected tax by more than 1 cent.
- For this case, expected tax is 19% of net, rounded to the nearest cent.
- Approved invoices have no warnings and paid amount equals gross.
- Review invoices have warnings but are not reject-level. `vendor_tax_id_conflict` and `tax_rounding_mismatch` are review-level warnings.
- Reject only if the invoice is impossible to process, uses an inactive vendor, or lacks a required PO.
- Sort all invoice-id arrays ascending.
- Allowed warning codes are exactly: `tax_rounding_mismatch`, `vendor_tax_id_conflict`.
- `warnings_by_invoice` must include every real invoice ID and sorted warning arrays.
- Warning arrays must be flat arrays of lowercase strings, never nested arrays and never prose labels.
- `ignored_document_ids` must include visible document IDs from ignored non-invoice documents, not filenames.
- `total_approved_gross_cents` is the sum of approved invoice gross totals only.
- `evidence` must list the relative source file paths used in stable alphabetical order, including folder prefixes such as `scans/`.
- Include the relevant CSV files and every scanned document inspected in `evidence`, including ignored statement scans.
- `proof_code = total_approved_gross_cents + sum(numeric parts of all real invoice IDs) + 97 * total_warning_count`.

document_extracts.csv
source_path,document_id,document_type,vendor_name,tax_id,po_id,net_cents,tax_cents,gross_total_cents,visible_marks,notes
scans/orion_tax_collision_contact_sheet.png,INV-4170,invoice,Orion Field Services,OS-913,PO-4170,,,14500,,Vendor name and tax ID are both visible on the document
scans/orion_tax_collision_contact_sheet.png,INV-4171,invoice,Orion Field Services,OF-912,PO-4171,9999,2001,12000,,Invoice shows net 99.99 and tax 20.01
scans/orion_tax_collision_contact_sheet.png,ST-4170,statement,Orion Field Services,OF-912,,,,0,STATEMENT,Statement only; not an invoice payment request
bank_export.csv
date,description,invoice_id,amount_cents,status
2026-06-12,Orion Field Services,INV-4170,14500,paid
2026-06-12,Orion Field Services,INV-4171,12000,paid

vendor_master.csv
vendor_id,name,tax_id,status
V-912,Orion Field Services,OF-912,active
V-913,Orion Systems,OS-913,active

purchase_orders.csv
po_id,vendor_id,limit_cents,status
PO-4170,V-912,16000,open
PO-4171,V-912,13000,open

Agent actions

  

Model answer

{"case_id":"P3-GEN-04","approved_invoice_ids":[],"review_invoice_ids":["INV-4170","INV-4171"],"reject_invoice_ids":[],"ignored_document_ids":["ST-4170"],"total_approved_gross_cents":0,"warnings_by_invoice":{"INV-4170":["vendor_tax_id_conflict"],"INV-4171":["tax_rounding_mismatch"]},"evidence":["bank_export.csv","document_extracts.csv","purchase_orders.csv","scans/orion_tax_collision_contact_sheet.png","vendor_master.csv"],"proof_code":8535}
generated_invoice_case_05_text paperwork_text 2/4 core_oracle_pass, hidden_oracle_pass warning_code_error
Inspect

Prompt

TEXT-ONLY PAPERWORK VARIANT
Use `document_extracts.csv` as the normalized OCR/document reading.
No image files are attached in this run.
For the `evidence` field, cite the original `source_path` values from `document_extracts.csv` plus the CSV context files you used.
Do not cite `document_extracts.csv` itself as evidence.
Return final JSON only. No Markdown. No explanation.


## README_TASK.md
# Paperwork Trial v3 Generated Image Case 05

This is synthetic benchmark bookkeeping. It is not financial, tax, or legal advice.

Audit the scanned document folder and write `audit_result.json` only.

Case ID: `P3-GEN-05`

Output keys exactly:

`case_id, approved_invoice_ids, review_invoice_ids, reject_invoice_ids, ignored_document_ids, total_approved_gross_cents, warnings_by_invoice, evidence, proof_code`

Rules:

- Treat the files in `scans/` as the source documents.
- Ignore documents that are explicitly quotes or not invoice payment requests.
- Match vendors by visible vendor name and tax ID when available.
- `payment_match` requires paid bank rows for the invoice whose summed amount equals the invoice gross.
- Split bank payments count as `payment_match` when paid rows with the same invoice ID sum to the invoice gross.
- Use the latest purchase order revision visible in the scanned documents and reflected in `purchase_orders.csv`.
- `po_cancelled` applies when the invoice uses a purchase order that the latest revision marks cancelled.
- Approved invoices have no warnings and paid amount equals gross.
- Review invoices have warnings but are not reject-level. `po_cancelled` is review-level.
- Reject only if the invoice is impossible to process, uses an inactive vendor, or lacks a required PO.
- Sort all invoice-id arrays ascending.
- Allowed warning codes are exactly: `po_cancelled`.
- `warnings_by_invoice` must include every real invoice ID and sorted warning arrays.
- Warning arrays must be flat arrays of lowercase strings, never nested arrays and never prose labels.
- `ignored_document_ids` must include visible document IDs from ignored non-invoice documents, not filenames.
- `total_approved_gross_cents` is the sum of approved invoice gross totals only.
- `evidence` must list the relative source file paths used in stable alphabetical order, including folder prefixes such as `scans/`.
- Include the relevant CSV files and every scanned document inspected in `evidence`, including ignored quote scans.
- `proof_code = total_approved_gross_cents + sum(numeric parts of all real invoice IDs) + 97 * total_warning_count`.




## document_extracts.csv
source_path,document_id,document_type,vendor_name,tax_id,po_id,net_cents,tax_cents,gross_total_cents,visible_marks,notes
scans/delta_po_revision_contact_sheet.png,INV-5600,invoice,Delta Repair Group,DR-640,PO-5600-A,,,30000,,Invoice uses PO-5600-A
scans/delta_po_revision_contact_sheet.png,INV-5601,invoice,Delta Repair Group,DR-640,PO-5600-B,,,18000,,Invoice uses PO-5600-B
scans/delta_po_revision_contact_sheet.png,QT-5601,quote,Delta Repair Group,DR-640,PO-5600-B,,,18000,QUOTE,Quote only; not an invoice payment request



## bank_export.csv
date,description,invoice_id,amount_cents,status
2026-07-02,Delta Repair Group partial payment,INV-5600,20000,paid
2026-07-03,Delta Repair Group final payment,INV-5600,10000,paid
2026-07-04,Delta Repair Group parts invoice,INV-5601,18000,paid




## vendor_master.csv
vendor_id,name,tax_id,status
V-640,Delta Repair Group,DR-640,active
V-641,Delta Repair North,DR-641,active




## purchase_orders.csv
po_id,vendor_id,limit_cents,status
PO-5600-A,V-640,35000,open
PO-5600-B,V-640,22000,cancelled




Return `audit_result.json` as a single JSON object only. /no_think

Workspace files available to the model

README_TASK.md
# Paperwork Trial v3 Generated Image Case 05

This is synthetic benchmark bookkeeping. It is not financial, tax, or legal advice.

Audit the scanned document folder and write `audit_result.json` only.

Case ID: `P3-GEN-05`

Output keys exactly:

`case_id, approved_invoice_ids, review_invoice_ids, reject_invoice_ids, ignored_document_ids, total_approved_gross_cents, warnings_by_invoice, evidence, proof_code`

Rules:

- Treat the files in `scans/` as the source documents.
- Ignore documents that are explicitly quotes or not invoice payment requests.
- Match vendors by visible vendor name and tax ID when available.
- `payment_match` requires paid bank rows for the invoice whose summed amount equals the invoice gross.
- Split bank payments count as `payment_match` when paid rows with the same invoice ID sum to the invoice gross.
- Use the latest purchase order revision visible in the scanned documents and reflected in `purchase_orders.csv`.
- `po_cancelled` applies when the invoice uses a purchase order that the latest revision marks cancelled.
- Approved invoices have no warnings and paid amount equals gross.
- Review invoices have warnings but are not reject-level. `po_cancelled` is review-level.
- Reject only if the invoice is impossible to process, uses an inactive vendor, or lacks a required PO.
- Sort all invoice-id arrays ascending.
- Allowed warning codes are exactly: `po_cancelled`.
- `warnings_by_invoice` must include every real invoice ID and sorted warning arrays.
- Warning arrays must be flat arrays of lowercase strings, never nested arrays and never prose labels.
- `ignored_document_ids` must include visible document IDs from ignored non-invoice documents, not filenames.
- `total_approved_gross_cents` is the sum of approved invoice gross totals only.
- `evidence` must list the relative source file paths used in stable alphabetical order, including folder prefixes such as `scans/`.
- Include the relevant CSV files and every scanned document inspected in `evidence`, including ignored quote scans.
- `proof_code = total_approved_gross_cents + sum(numeric parts of all real invoice IDs) + 97 * total_warning_count`.

document_extracts.csv
source_path,document_id,document_type,vendor_name,tax_id,po_id,net_cents,tax_cents,gross_total_cents,visible_marks,notes
scans/delta_po_revision_contact_sheet.png,INV-5600,invoice,Delta Repair Group,DR-640,PO-5600-A,,,30000,,Invoice uses PO-5600-A
scans/delta_po_revision_contact_sheet.png,INV-5601,invoice,Delta Repair Group,DR-640,PO-5600-B,,,18000,,Invoice uses PO-5600-B
scans/delta_po_revision_contact_sheet.png,QT-5601,quote,Delta Repair Group,DR-640,PO-5600-B,,,18000,QUOTE,Quote only; not an invoice payment request
bank_export.csv
date,description,invoice_id,amount_cents,status
2026-07-02,Delta Repair Group partial payment,INV-5600,20000,paid
2026-07-03,Delta Repair Group final payment,INV-5600,10000,paid
2026-07-04,Delta Repair Group parts invoice,INV-5601,18000,paid

vendor_master.csv
vendor_id,name,tax_id,status
V-640,Delta Repair Group,DR-640,active
V-641,Delta Repair North,DR-641,active

purchase_orders.csv
po_id,vendor_id,limit_cents,status
PO-5600-A,V-640,35000,open
PO-5600-B,V-640,22000,cancelled

Agent actions

  

Model answer

{"case_id":"P3-GEN-05","approved_invoice_ids":["INV-5600"],"review_invoice_ids":["INV-5601"],"reject_invoice_ids":[],"ignored_document_ids":["QT-5601"],"total_approved_gross_cents":30000,"warnings_by_invoice":{"INV-5601":["po_cancelled"]},"evidence":["bank_export.csv","purchase_orders.csv","scans/delta_po_revision_contact_sheet.png","vendor_master.csv"],"proof_code":41298}