qwen3.6-35b-a3b
2026-05-14T19-34-53Z_qwen_qwen3.6-35b-a3b_paperwork_workflow_case05
50% Practical score
0/1 Resolved tasks
1/1 Core tasks
Verdict
practical run result near miss
Core 1/1 · Visible 1/1 · Hidden 0/1
missing_or_wrong_evidence
Prompt set v3_paperwork_workflow_case05.json
Run type local
Temperature 0
Top P 1
Missed tasks 1/1
Task Results
Compact view; open a row for prompt and raw answer| Task | Category | Score | Failed Checks | Failure Types | Prompt / Answer |
|---|---|---|---|---|---|
| email_attachment_intake_case_05 | paperwork | 3/4 | hidden_oracle_pass | missing_or_wrong_evidence | case files InspectPrompt Audit the synthetic email attachment intake folder in `incoming/`. Write these files: - `work/attachment_index.json` - `work/final_document_set.json` - `work/normalized/INV-2204-R1.txt` - `audit_result.json` - `proof.txt` Use `incoming/README_TASK.md` as the full task specification. Use the generated image attachments, the email thread, `payment_export.csv`, `vendor_master.csv`, and `purchase_orders.csv`. Important: - Identify the final revised invoice, not just the first invoice-like image. - The payment may use an old reference if the amount matches the revised invoice. - Do not modify anything under `incoming/`. - This is benchmark bookkeeping, not financial, tax, or legal advice. - `proof.txt` must contain only the numeric proof code. Workspace files available to the model README_TASK.md# Paperwork Workflow Case 05: Email Attachment Intake
Case ID: P3-WORK-05
You are auditing a synthetic email attachment intake folder. The folder contains generated document images, a thread note, payment data, vendor data, and a purchase order list.
Create these final artifacts:
- `work/attachment_index.json`
- `work/final_document_set.json`
- `work/normalized/INV-2204-R1.txt`
- `audit_result.json`
- `proof.txt`
Rules:
- Identify the active final document set.
- Use the email thread and the chat screenshot as evidence.
- Ignore superseded invoices, proforma documents, and hint screenshots as payment documents.
- Do not modify anything in `incoming/`.
- Return no advice. This is benchmark bookkeeping, not financial, tax, or legal advice.
- `proof.txt` must contain only the final numeric `proof_code`.
Final document logic:
- The first `INV-2204` attachment is superseded because the thread and chat hint say it had the wrong VAT.
- The revised attachment from May 8 is the final invoice: `INV-2204-R1`.
- `PF-2205` is a proforma invoice and is not a payment invoice.
- The bank payment may still use the old reference `INV-2204`; map it to `INV-2204-R1` only when the revised gross amount matches.
Use these document IDs for the attachment index and ignored-document lists:
- old invoice image: `INV-2204`
- revised invoice image: `INV-2204-R1`
- proforma image: `PF-2205`
- chat screenshot: `CHAT-MAY-08`
`work/attachment_index.json` schema:
```json
{
"case_id": "P3-WORK-05",
"attachments": [
{
"attachment_path": "",
"document_id": "",
"document_type": "",
"decision": ""
}
]
}
```
Allowed `document_type` values:
- `invoice`
- `proforma`
- `chat_hint`
Allowed `decision` values:
- `superseded`
- `final`
- `ignored`
- `evidence_only`
`work/final_document_set.json` schema:
```json
{
"case_id": "P3-WORK-05",
"final_invoice_ids": [],
"superseded_invoice_ids": [],
"ignored_document_ids": [],
"payment_mapped_from": "",
"payment_mapped_to": ""
}
```
The normalized invoice file must use exactly these eight lines:
```text
invoice_id=...
replaces_invoice_id=...
vendor_id=...
vendor_name=...
po_id=...
gross_total_cents=...
payment_reference=...
payment_match=...
```
`payment_match` must be exactly `true` or `false`.
`audit_result.json` must contain exactly these keys:
- case_id
- approved_invoice_ids
- review_invoice_ids
- reject_invoice_ids
- ignored_document_ids
- total_approved_gross_cents
- warnings_by_invoice
- evidence
- proof_code
`warnings_by_invoice` must include every final real invoice ID. Use an empty array when an invoice has no warnings.
Allowed warning codes:
- payment_missing
- payment_amount_mismatch
- inactive_vendor
- missing_po
- superseded_invoice
- non_payment_document
Approval rules:
- Approve only final invoices from active vendors with an open matching PO and an exact payment match.
- Put final invoices with missing payment, amount mismatch, missing PO, or inactive vendor into review.
- Do not approve superseded invoices or proforma documents.
- `ignored_document_ids` must include superseded invoices, proforma documents, and evidence-only screenshots.
- `total_approved_gross_cents` is the sum of approved final invoice gross totals only.
- `evidence` must list the files used to decide the final document set and payment mapping, with paths relative to the workspace.
- In `audit_result.json`, `evidence` should list only files that support the approved final invoice and payment mapping. Do not list `README_TASK.md` as evidence, and do not list every attachment just because it exists.
Proof code formula:
`proof_code = total_approved_gross_cents + numeric_token_for_final_invoice_ids + 97 * ignored_document_count + 503 * payment_revision_mapping_count`
For this case, the numeric token for `INV-2204-R1` is `22041`.
`payment_revision_mapping_count` is `1` when a payment with old reference `INV-2204` is correctly mapped to final invoice `INV-2204-R1`; otherwise it is `0`.
email_thread.txtSubject: Harbor Office Supply attachment cleanup 2026-05-06 10:12 Mira: The first Harbor Office Supply invoice was attached as INV-2204. Please hold it for now. The VAT looks wrong. 2026-05-08 09:07 Mira: Please ignore the first INV-2204 attachment. Wrong VAT on that attachment. 2026-05-08 09:11 Jon: Use the revised attachment from May 8. Revised one is INV-2204-R1. 2026-05-08 09:18 Jon: The bank reference may still show the old invoice number, but the amount should match the revised total. 2026-05-08 09:30 AP intake: The proforma document is for quote tracking only. It is not a payment invoice. payment_export.csvdate,description,amount_cents,reference 2026-05-09,HARBOR OFFICE SUPPLY PAYMENT,-24990,INV-2204 2026-05-09,UNRELATED COFFEE SUPPLIES,-1840,RCPT-778 vendor_master.csvvendor_id,vendor_name,tax_id,status V-410,Harbor Office Supply,TX-410,active V-411,Harbor Office Supply Old Record,TX-OLD,inactive purchase_orders.csvpo_id,vendor_id,gross_limit_cents,status PO-8801,V-410,26000,open PO-8802,V-410,12000,draft Input image Agent actions list_files -> read_file -> read_file -> read_file -> read_file -> read_file -> mkdir -> write_file -> write_file -> write_file -> write_file -> write_file -> finish Model answer ## audit_result.json
{
"case_id": "P3-WORK-05",
"approved_invoice_ids": ["INV-2204-R1"],
"review_invoice_ids": [],
"reject_invoice_ids": [],
"ignored_document_ids": ["INV-2204", "PF-2205", "CHAT-MAY-08"],
"total_approved_gross_cents": 24990,
"warnings_by_invoice": {
"INV-2204-R1": []
},
"evidence": [
"incoming/email_thread.txt",
"incoming/attachments/chat_hint.png",
"incoming/payment_export.csv"
],
"proof_code": 47825
}
## proof.txt
47825
## work/attachment_index.json
{
"case_id": "P3-WORK-05",
"attachments": [
{
"attachment_path": "incoming/attachments/IMG_2048.png",
"document_id": "INV-2204",
"document_type": "invoice",
"decision": "superseded"
},
{
"attachment_path": "incoming/attachments/chat_hint.png",
"document_id": "CHAT-MAY-08",
"document_type": "chat_hint",
"decision": "evidence_only"
},
{
"attachment_path": "incoming/attachments/doc-7.png",
"document_id": "PF-2205",
"document_type": "proforma",
"decision": "ignored"
},
{
"attachment_path": "incoming/attachments/invoice_new.png",
"document_id": "INV-2204-R1",
"document_type": "invoice",
"decision": "final"
}
]
}
## work/final_document_set.json
{
"case_id": "P3-WORK-05",
"final_invoice_ids": ["INV-2204-R1"],
"superseded_invoice_ids": ["INV-2204"],
"ignored_document_ids": ["INV-2204", "PF-2205", "CHAT-MAY-08"],
"payment_mapped_from": "INV-2204",
"payment_mapped_to": "INV-2204-R1"
}
## work/normalized/INV-2204-R1.txt
invoice_id=INV-2204-R1
replaces_invoice_id=INV-2204
vendor_id=V-410
vendor_name=Harbor Office Supply
po_id=PO-8801
gross_total_cents=24990
payment_reference=INV-2204
payment_match=true
## verify_output.json
{
"ok": false,
"core_ok": true,
"visible_ok": true,
"failure_types": [
"missing_or_wrong_evidence"
],
"checks": [
{
"check": "audit_result_exists",
"passed": true,
"message": "Expected final audit_result.json with exact keys."
},
{
"check": "visible_checks_pass",
"passed": true,
"message": "Expected required artifacts and unchanged incoming/ sources."
},
{
"check": "core_oracle_pass",
"passed": true,
"message": "Expected core audit fields to match the oracle."
},
{
"check": "hidden_oracle_pass",
"passed": false,
"message": "Expected exact audit, artifacts, proof.txt, and unchanged sources."
}
],
"audit_error": "",
"actual": {
"case_id": "P3-WORK-05",
"approved_invoice_ids": [
"INV-2204-R1"
],
"review_invoice_ids": [],
"reject_invoice_ids": [],
"ignored_document_ids": [
"INV-2204",
"PF-2205",
"CHAT-MAY-08"
],
"total_approved_gross_cents": 24990,
"warnings_by_invoice": {
"INV-2204-R1": []
},
"evidence": [
"incoming/email_thread.txt",
"incoming/attachments/chat_hint.png",
"incoming/payment_export.csv"
],
"proof_code": 47825
},
"expected": {
"case_id": "P3-WORK-05",
"approved_invoice_ids": [
"INV-2204-R1"
],
"review_invoice_ids": [],
"reject_invoice_ids": [],
"ignored_document_ids": [
"CHAT-MAY-08",
"INV-2204",
"PF-2205"
],
"total_approved_gross_cents": 24990,
"warnings_by_invoice": {
"INV-2204-R1": []
},
"evidence": [
"incoming/email_thread.txt",
"incoming/attachments/chat_hint.png",
"incoming/attachments/invoice_new.png",
"incoming/payment_export.csv",
"incoming/purchase_orders.csv",
"incoming/vendor_master.csv"
],
"proof_code": 47825
}
}
|