codex-default
2026-05-15T11-50-30Z_codex-default_paperwork_v3_generated_images
Reference run: Codex CLI, OpenCode, or OpenRouter baseline, included as a comparison target rather than a local-model result.
Verdict
practical run resultCore 4/5 · Visible 5/5 · Hidden 3/5
Task Results
Compact view; open a row for prompt and raw answer| Task | Category | Score | Failed Checks | Failure Types | Prompt / Answer |
|---|---|---|---|---|---|
| generated_invoice_case_01 | paperwork | 4/4 | none | none | case files OpenPrompt You are auditing a synthetic scanned paperwork folder. Read `README_TASK.md`, inspect the files in `scans/`, and use `bank_export.csv`, `vendor_master.csv`, and `purchase_orders.csv`. Write `audit_result.json` only. This is benchmark bookkeeping, not financial, tax, or legal advice. Important: - The scanned images are the source documents. - `Northwind Office Supply` is the customer, not the vendor. - Ignore documents that clearly say they are not invoices. - Do not invent fields that are not supported by the files. - Use visible document IDs for `ignored_document_ids`, not filenames. - Use relative paths with folder prefixes in `evidence`, for example `scans/example.png`. - Use only allowed lowercase warning codes from `README_TASK.md`. - Warning arrays must be flat arrays of strings. Workspace files available to the model README_TASK.md# Paperwork Trial v3 Calibration Case 01 This is synthetic benchmark bookkeeping. It is not financial, tax, or legal advice. Audit the scanned document folder and write `audit_result.json` only. Case ID: `P3-GEN-01` Output keys exactly: `case_id, approved_invoice_ids, review_invoice_ids, reject_invoice_ids, ignored_document_ids, total_approved_gross_cents, warnings_by_invoice, evidence, proof_code` Rules: - Treat the files in `scans/` as the source documents. - Ignore documents that are explicitly not invoices. - Match vendors by tax ID first, then by exact vendor name. - `payment_match` requires paid bank rows for the invoice whose summed amount equals the invoice gross. - `payment_short` applies when the paid bank amount is lower than invoice gross. - `under_review_stamp` applies when the scanned invoice visibly has an under-review stamp. - Approved invoices have no warnings and paid amount equals gross. - Review invoices have warnings but are not reject-level. `payment_short` and `under_review_stamp` are review-level warnings, not reject-level warnings. - Reject only if the invoice is impossible to process, uses an inactive vendor, or lacks a required PO. - Sort all invoice-id arrays ascending. - Allowed warning codes are exactly: `payment_short`, `under_review_stamp`. - `warnings_by_invoice` must include every real invoice ID and sorted warning arrays. - Warning arrays must be flat arrays of lowercase strings, never nested arrays and never prose labels. - `ignored_document_ids` must include visible document IDs from ignored non-invoice documents, not filenames. Example: use `QT-6400`, not `quote_6400_not_invoice.png`. - `total_approved_gross_cents` is the sum of approved invoice gross totals only. - `evidence` must list the relative source file paths used in stable alphabetical order, including folder prefixes such as `scans/`. - Include the relevant CSV files and every scanned document inspected in `evidence`, including ignored non-invoice scans. - `proof_code = total_approved_gross_cents + sum(numeric parts of all real invoice IDs) + 97 * total_warning_count`. Important: `Northwind Office Supply` is the customer, not the vendor. bank_export.csvdate,description,invoice_id,amount_cents,status 2026-04-23,BluePeak Distributors,INV-7801,18737,paid 2026-04-24,StationHub LLC,INV-7802,5000,paid 2026-04-27,Orion Field Services,INV-8422,42245,paid vendor_master.csvvendor_id,name,tax_id,status V-BP4471,BluePeak Distributors,BP-4471,active V-SH9982,StationHub LLC,SH-9982,active V-OR1109,Orion Field Services,OR-1109,active V-NW001,Northwind Office Supply,NW-CUSTOMER,customer purchase_orders.csvpo_id,vendor_id,limit_cents,status PO-4510,V-BP4471,20000,open PO-4510-B,V-SH9982,6000,open PO-8422,V-OR1109,50000,open Input images
open case files and generated images Agent actions Model answer {
"case_id": "P3-GEN-01",
"approved_invoice_ids": [
"INV-7801"
],
"review_invoice_ids": [
"INV-7802",
"INV-8422"
],
"reject_invoice_ids": [],
"ignored_document_ids": [
"QT-6400"
],
"total_approved_gross_cents": 18737,
"warnings_by_invoice": {
"INV-7801": [],
"INV-7802": [
"payment_short"
],
"INV-8422": [
"under_review_stamp"
]
},
"evidence": [
"bank_export.csv",
"purchase_orders.csv",
"scans/inv_7801_paid.png",
"scans/inv_7802_short_paid.png",
"scans/inv_8422_under_review.png",
"scans/quote_6400_not_invoice.png",
"vendor_master.csv"
],
"proof_code": 42956
}
|
| generated_invoice_case_02 | paperwork | 3/4 | hidden_oracle_pass | proof_code_error | case files InspectPrompt You are auditing a synthetic scanned paperwork folder. Read `README_TASK.md`, inspect the files in `scans/`, and use `bank_export.csv`, `vendor_master.csv`, and `purchase_orders.csv`. Write `audit_result.json` only. This is benchmark bookkeeping, not financial, tax, or legal advice. Important: - The scanned images are the source documents. - `Northwind Office Supply` is the customer, not the vendor. - Ignore credit notes and documents that are not invoice payment requests. - Do not invent fields that are not supported by the files. - Use visible document IDs for `ignored_document_ids`, not filenames. - Use relative paths with folder prefixes in `evidence`, for example `scans/example.png`. - Use only allowed lowercase warning codes from `README_TASK.md`. - Warning arrays must be flat arrays of strings. Workspace files available to the model README_TASK.md# Paperwork Trial v3 Calibration Case 02 This is synthetic benchmark bookkeeping. It is not financial, tax, or legal advice. Audit the scanned document folder and write `audit_result.json` only. Case ID: `P3-GEN-02` Output keys exactly: `case_id, approved_invoice_ids, review_invoice_ids, reject_invoice_ids, ignored_document_ids, total_approved_gross_cents, warnings_by_invoice, evidence, proof_code` Rules: - Treat the files in `scans/` as the source documents. - Ignore documents that are explicitly credit notes or not invoice payment requests. - Match vendors by visible vendor name and tax ID when available. - `payment_match` requires paid bank rows for the invoice whose summed amount equals the invoice gross. - `payment_short` applies when the paid bank amount is lower than invoice gross. - `missing_po` applies when the scanned invoice visibly has no valid PO number or says `MISSING PO`. - `inactive_vendor` applies when the scanned invoice visibly has a vendor-hold/inactive-vendor stamp or vendor records mark the vendor inactive. - Approved invoices have no warnings and paid amount equals gross. - Review invoices have warnings but are not reject-level. `payment_short` is review-level. - Reject invoices with `missing_po` or `inactive_vendor`. - Sort all invoice-id arrays ascending. - Allowed warning codes are exactly: `inactive_vendor`, `missing_po`, `payment_short`. - `warnings_by_invoice` must include every real invoice ID and sorted warning arrays. - Warning arrays must be flat arrays of lowercase strings, never nested arrays and never prose labels. - `ignored_document_ids` must include visible document IDs from ignored non-invoice documents, not filenames. Example: use `CN-10032`, not `credit_note_10032_credit_applied.png`. - `total_approved_gross_cents` is the sum of approved invoice gross totals only. - `evidence` must list the relative source file paths used in stable alphabetical order, including folder prefixes such as `scans/`. - Include the relevant CSV files and every scanned document inspected in `evidence`, including ignored credit-note scans. - `proof_code = total_approved_gross_cents + sum(numeric parts of all real invoice IDs) + 97 * total_warning_count`. Important: `Northwind Office Supply` is the customer, not the vendor. bank_export.csvdate,description,invoice_id,amount_cents,status 2026-04-23,BrightPath Office Solutions,INV-82415,18737,paid 2026-05-02,BrightPath Office Solutions,INV-82478,10000,paid 2026-05-05,BrightPath Office Solutions,INV-82533,23794,pending vendor_master.csvvendor_id,name,tax_id,status V-BP9200,BrightPath Office Solutions,BP-9200,active V-NW001,Northwind Office Supply,NW-CUSTOMER,customer purchase_orders.csvpo_id,vendor_id,limit_cents,status PO-4510,V-BP9200,20000,open PO-4577,V-BP9200,16000,open PO-4488,V-BP9200,5000,closed Input images
open case files and generated images Agent actions Model answer {
"case_id": "P3-GEN-02",
"approved_invoice_ids": [
"INV-82415"
],
"review_invoice_ids": [
"INV-82478"
],
"reject_invoice_ids": [
"INV-82533"
],
"ignored_document_ids": [
"CN-10032"
],
"total_approved_gross_cents": 18737,
"warnings_by_invoice": {
"INV-82415": [],
"INV-82478": [
"payment_short"
],
"INV-82533": [
"inactive_vendor",
"missing_po"
]
},
"evidence": [
"bank_export.csv",
"purchase_orders.csv",
"scans/credit_note_10032_credit_applied.png",
"scans/inv_82415_paid.png",
"scans/inv_82478_partial_payment.png",
"scans/inv_82533_vendor_hold.png",
"vendor_master.csv"
],
"proof_code": 267923
}
|
| generated_invoice_case_03 | paperwork | 4/4 | none | none | case files OpenPrompt You are auditing a synthetic scanned paperwork folder. Read `README_TASK.md`, inspect the files in `scans/`, and use `bank_export.csv`, `vendor_master.csv`, `purchase_orders.csv`, and `previous_invoices.csv`. Write `audit_result.json` only. This is benchmark bookkeeping, not financial, tax, or legal advice. Important: - The scanned images are the source documents. - `Northwind Office Supply` is the customer, not the vendor. - Ignore quotes, credit notes, and documents that are not invoice payment requests. - Do not invent fields that are not supported by the files. - Use visible document IDs for `ignored_document_ids`, not filenames. - Use relative paths with folder prefixes in `evidence`, for example `scans/example.png`. - Use only allowed lowercase warning codes from `README_TASK.md`. - Warning arrays must be flat arrays of strings. Workspace files available to the model README_TASK.md# Paperwork Trial v3 Calibration Case 03 This is synthetic benchmark bookkeeping. It is not financial, tax, or legal advice. Audit the scanned document folder and write `audit_result.json` only. Case ID: `P3-GEN-03` Output keys exactly: `case_id, approved_invoice_ids, review_invoice_ids, reject_invoice_ids, ignored_document_ids, total_approved_gross_cents, warnings_by_invoice, evidence, proof_code` Rules: - Treat the files in `scans/` as the source documents. - Ignore documents that are explicitly quotes, credit notes, or not invoice payment requests. - Match vendors by visible vendor name and tax ID when available. - `payment_match` requires paid bank rows for the invoice whose summed amount equals the invoice gross. - `duplicate_risk` applies when `previous_invoices.csv` has the same vendor_id and same gross amount as the current invoice. - `payment_short` applies when the paid bank amount is lower than invoice gross. - `under_review_stamp` applies when the scanned invoice visibly has an under-review stamp. - `missing_po` applies when the scanned invoice visibly has no valid PO number or says `MISSING PO`. - `inactive_vendor` applies when the scanned invoice visibly has a vendor-hold/inactive-vendor stamp or vendor records mark the vendor inactive. - Approved invoices have no warnings and paid amount equals gross. - Review invoices have warnings but are not reject-level. `duplicate_risk`, `payment_short`, and `under_review_stamp` are review-level. - Reject invoices with `missing_po` or `inactive_vendor`. - Sort all invoice-id arrays ascending. - Allowed warning codes are exactly: `duplicate_risk`, `inactive_vendor`, `missing_po`, `payment_short`, `under_review_stamp`. - `warnings_by_invoice` must include every real invoice ID and sorted warning arrays. - Warning arrays must be flat arrays of lowercase strings, never nested arrays and never prose labels. - `ignored_document_ids` must include visible document IDs from ignored non-invoice documents, not filenames. - `total_approved_gross_cents` is the sum of approved invoice gross totals only. - `evidence` must list the relative source file paths used in stable alphabetical order, including folder prefixes such as `scans/`. - Include the relevant CSV files and every scanned document inspected in `evidence`, including `previous_invoices.csv` and ignored quote and credit-note scans. - `proof_code = total_approved_gross_cents + sum(numeric parts of all real invoice IDs) + 97 * total_warning_count`. Important: `Northwind Office Supply` is the customer, not the vendor. bank_export.csvdate,description,invoice_id,amount_cents,status 2026-04-23,BluePeak Distributors,INV-7801,18737,paid 2026-04-24,StationHub LLC,INV-7802,5000,paid 2026-04-27,Orion Field Services,INV-8422,42245,paid 2026-04-23,BrightPath Office Solutions,INV-82415,18737,paid 2026-05-02,BrightPath Office Solutions,INV-82478,10000,paid 2026-05-05,BrightPath Office Solutions,INV-82533,23794,pending vendor_master.csvvendor_id,name,tax_id,status V-BP4471,BluePeak Distributors,BP-4471,active V-SH9982,StationHub LLC,SH-9982,active V-OR1109,Orion Field Services,OR-1109,active V-BP9200,BrightPath Office Solutions,BP-9200,active V-NW001,Northwind Office Supply,NW-CUSTOMER,customer purchase_orders.csvpo_id,vendor_id,limit_cents,status PO-4510,V-BP4471,20000,open PO-4510,V-BP9200,20000,open PO-4510-B,V-SH9982,6000,open PO-4577,V-BP9200,16000,open PO-8422,V-OR1109,50000,open PO-4488,V-BP9200,5000,closed previous_invoices.csvinvoice_id,vendor_id,gross_total_cents,paid_date INV-7600,V-BP4471,18737,2026-03-11 INV-82210,V-BP9200,3725,2026-04-18 Input images
open case files and generated images Agent actions Model answer {
"case_id": "P3-GEN-03",
"approved_invoice_ids": [
"INV-82415"
],
"review_invoice_ids": [
"INV-7801",
"INV-7802",
"INV-8422",
"INV-82478"
],
"reject_invoice_ids": [
"INV-82533"
],
"ignored_document_ids": [
"CN-10032",
"QT-6400"
],
"total_approved_gross_cents": 18737,
"warnings_by_invoice": {
"INV-7801": [
"duplicate_risk"
],
"INV-7802": [
"payment_short"
],
"INV-8422": [
"under_review_stamp"
],
"INV-82415": [],
"INV-82478": [
"payment_short"
],
"INV-82533": [
"inactive_vendor",
"missing_po"
]
},
"evidence": [
"bank_export.csv",
"previous_invoices.csv",
"purchase_orders.csv",
"scans/case01_inv_7801_paid.png",
"scans/case01_inv_7802_short_paid.png",
"scans/case01_inv_8422_under_review.png",
"scans/case01_quote_6400_not_invoice.png",
"scans/case02_credit_note_10032_credit_applied.png",
"scans/case02_inv_82415_paid.png",
"scans/case02_inv_82478_partial_payment.png",
"scans/case02_inv_82533_vendor_hold.png",
"vendor_master.csv"
],
"proof_code": 290770
}
|
| generated_invoice_case_04 | paperwork | 4/4 | none | none | case files OpenPrompt You are auditing a synthetic scanned paperwork folder. Read `README_TASK.md`, inspect the files in `scans/`, and use `bank_export.csv`, `vendor_master.csv`, and `purchase_orders.csv`. Write `audit_result.json` only. This is benchmark bookkeeping, not financial, tax, or legal advice. Important: - The scanned images are the source documents. - Ignore statements and documents that are not invoice payment requests. - Do not invent fields that are not supported by the files. - Use visible document IDs for `ignored_document_ids`, not filenames. - Use relative paths with folder prefixes in `evidence`, for example `scans/example.png`. - Use only allowed lowercase warning codes from `README_TASK.md`. - Warning arrays must be flat arrays of strings. Workspace files available to the model README_TASK.md# Paperwork Trial v3 Generated Image Case 04 This is synthetic benchmark bookkeeping. It is not financial, tax, or legal advice. Audit the scanned document folder and write `audit_result.json` only. Case ID: `P3-GEN-04` Output keys exactly: `case_id, approved_invoice_ids, review_invoice_ids, reject_invoice_ids, ignored_document_ids, total_approved_gross_cents, warnings_by_invoice, evidence, proof_code` Rules: - Treat the files in `scans/` as the source documents. - Ignore documents that are explicitly statements or not invoice payment requests. - Match vendors by visible vendor name and tax ID when available. - `payment_match` requires paid bank rows for the invoice whose summed amount equals the invoice gross. - `vendor_tax_id_conflict` applies when the visible vendor name matches one vendor but the visible tax ID belongs to a different vendor in `vendor_master.csv`. - `tax_rounding_mismatch` applies when invoice tax differs from the expected tax by more than 1 cent. - For this case, expected tax is 19% of net, rounded to the nearest cent. - Approved invoices have no warnings and paid amount equals gross. - Review invoices have warnings but are not reject-level. `vendor_tax_id_conflict` and `tax_rounding_mismatch` are review-level warnings. - Reject only if the invoice is impossible to process, uses an inactive vendor, or lacks a required PO. - Sort all invoice-id arrays ascending. - Allowed warning codes are exactly: `tax_rounding_mismatch`, `vendor_tax_id_conflict`. - `warnings_by_invoice` must include every real invoice ID and sorted warning arrays. - Warning arrays must be flat arrays of lowercase strings, never nested arrays and never prose labels. - `ignored_document_ids` must include visible document IDs from ignored non-invoice documents, not filenames. - `total_approved_gross_cents` is the sum of approved invoice gross totals only. - `evidence` must list the relative source file paths used in stable alphabetical order, including folder prefixes such as `scans/`. - Include the relevant CSV files and every scanned document inspected in `evidence`, including ignored statement scans. - `proof_code = total_approved_gross_cents + sum(numeric parts of all real invoice IDs) + 97 * total_warning_count`. bank_export.csvdate,description,invoice_id,amount_cents,status 2026-06-12,Orion Field Services,INV-4170,14500,paid 2026-06-12,Orion Field Services,INV-4171,12000,paid vendor_master.csvvendor_id,name,tax_id,status V-912,Orion Field Services,OF-912,active V-913,Orion Systems,OS-913,active purchase_orders.csvpo_id,vendor_id,limit_cents,status PO-4170,V-912,16000,open PO-4171,V-912,13000,open Input images
open case files and generated images Agent actions Model answer {
"case_id": "P3-GEN-04",
"approved_invoice_ids": [],
"review_invoice_ids": [
"INV-4170",
"INV-4171"
],
"reject_invoice_ids": [],
"ignored_document_ids": [
"ST-4170"
],
"total_approved_gross_cents": 0,
"warnings_by_invoice": {
"INV-4170": [
"vendor_tax_id_conflict"
],
"INV-4171": [
"tax_rounding_mismatch"
]
},
"evidence": [
"bank_export.csv",
"purchase_orders.csv",
"scans/orion_tax_collision_contact_sheet.png",
"vendor_master.csv"
],
"proof_code": 8535
}
|
| generated_invoice_case_05 | paperwork | 2/4 | core_oracle_pass, hidden_oracle_pass | ignored_document_id_error | case files InspectPrompt You are auditing a synthetic scanned paperwork folder. Read `README_TASK.md`, inspect the files in `scans/`, and use `bank_export.csv`, `vendor_master.csv`, and `purchase_orders.csv`. Write `audit_result.json` only. This is benchmark bookkeeping, not financial, tax, or legal advice. Important: - The scanned images are the source documents. - Ignore quotes and documents that are not invoice payment requests. - Split payments can sum to one invoice. - Use the latest purchase order revision visible in the scanned documents and in `purchase_orders.csv`. - Do not invent fields that are not supported by the files. - Use visible document IDs for `ignored_document_ids`, not filenames. - Use relative paths with folder prefixes in `evidence`, for example `scans/example.png`. - Use only allowed lowercase warning codes from `README_TASK.md`. - Warning arrays must be flat arrays of strings. Workspace files available to the model README_TASK.md# Paperwork Trial v3 Generated Image Case 05 This is synthetic benchmark bookkeeping. It is not financial, tax, or legal advice. Audit the scanned document folder and write `audit_result.json` only. Case ID: `P3-GEN-05` Output keys exactly: `case_id, approved_invoice_ids, review_invoice_ids, reject_invoice_ids, ignored_document_ids, total_approved_gross_cents, warnings_by_invoice, evidence, proof_code` Rules: - Treat the files in `scans/` as the source documents. - Ignore documents that are explicitly quotes or not invoice payment requests. - Match vendors by visible vendor name and tax ID when available. - `payment_match` requires paid bank rows for the invoice whose summed amount equals the invoice gross. - Split bank payments count as `payment_match` when paid rows with the same invoice ID sum to the invoice gross. - Use the latest purchase order revision visible in the scanned documents and reflected in `purchase_orders.csv`. - `po_cancelled` applies when the invoice uses a purchase order that the latest revision marks cancelled. - Approved invoices have no warnings and paid amount equals gross. - Review invoices have warnings but are not reject-level. `po_cancelled` is review-level. - Reject only if the invoice is impossible to process, uses an inactive vendor, or lacks a required PO. - Sort all invoice-id arrays ascending. - Allowed warning codes are exactly: `po_cancelled`. - `warnings_by_invoice` must include every real invoice ID and sorted warning arrays. - Warning arrays must be flat arrays of lowercase strings, never nested arrays and never prose labels. - `ignored_document_ids` must include visible document IDs from ignored non-invoice documents, not filenames. - `total_approved_gross_cents` is the sum of approved invoice gross totals only. - `evidence` must list the relative source file paths used in stable alphabetical order, including folder prefixes such as `scans/`. - Include the relevant CSV files and every scanned document inspected in `evidence`, including ignored quote scans. - `proof_code = total_approved_gross_cents + sum(numeric parts of all real invoice IDs) + 97 * total_warning_count`. bank_export.csvdate,description,invoice_id,amount_cents,status 2026-07-02,Delta Repair Group partial payment,INV-5600,20000,paid 2026-07-03,Delta Repair Group final payment,INV-5600,10000,paid 2026-07-04,Delta Repair Group parts invoice,INV-5601,18000,paid vendor_master.csvvendor_id,name,tax_id,status V-640,Delta Repair Group,DR-640,active V-641,Delta Repair North,DR-641,active purchase_orders.csvpo_id,vendor_id,limit_cents,status PO-5600-A,V-640,35000,open PO-5600-B,V-640,22000,cancelled Input images
open case files and generated images Agent actions Model answer {
"case_id": "P3-GEN-05",
"approved_invoice_ids": [
"INV-5600"
],
"review_invoice_ids": [
"INV-5601"
],
"reject_invoice_ids": [],
"ignored_document_ids": [
"PO-5600-A",
"PO-5600-B",
"QT-5601"
],
"total_approved_gross_cents": 30000,
"warnings_by_invoice": {
"INV-5600": [],
"INV-5601": [
"po_cancelled"
]
},
"evidence": [
"bank_export.csv",
"purchase_orders.csv",
"scans/delta_po_revision_contact_sheet.png",
"vendor_master.csv"
],
"proof_code": 41298
}
|