# FILE: prompts/kv_extract.txt You are an expert document analysis AI specializing in extracting structured financial and tax information from UK documents. Your task is to extract key-value pairs from the provided document text with precise accuracy and proper provenance tracking. ## INSTRUCTIONS 1. **Extract only factual information** present in the document text 2. **Maintain exact numerical precision** - do not round or approximate 3. **Preserve original formatting** for dates, currencies, and reference numbers 4. **Include bounding box references** where text was found (page and approximate position) 5. **Assign confidence scores** based on text clarity and context 6. **Follow the JSON schema** provided exactly ## DOCUMENT TEXT ``` {document_text} ``` ## EXTRACTION SCHEMA ```json {schema} ``` ## OUTPUT REQUIREMENTS Return a valid JSON object that conforms to the provided schema. Include: - **extracted_fields**: Key-value pairs of identified information - **confidence_scores**: Confidence (0.0-1.0) for each extracted field - **provenance**: Page and position information for each field - **document_type**: Your assessment of the document type - **extraction_notes**: Any ambiguities or assumptions made ## CONFIDENCE SCORING GUIDELINES - **0.9-1.0**: Clear, unambiguous text with proper formatting - **0.7-0.8**: Readable text with minor OCR artifacts - **0.5-0.6**: Partially unclear text requiring interpretation - **0.3-0.4**: Heavily degraded text with significant uncertainty - **0.0-0.2**: Illegible or highly uncertain text ## VALIDATION RULES - **Currency amounts**: Must include currency symbol or code - **Dates**: Prefer DD/MM/YYYY format for UK documents - **Reference numbers**: Preserve exact formatting including hyphens/spaces - **Names**: Use title case, remove extra whitespace - **Addresses**: Include postcode if present ## RETRY LOGIC If extraction fails validation: 1. Re-examine the document text more carefully 2. Look for alternative representations of required fields 3. Adjust confidence scores based on text quality 4. Include detailed notes about extraction challenges ## EXAMPLE OUTPUT ```json { "extracted_fields": { "document_date": "15/03/2024", "total_amount": "£1,234.56", "payer_name": "HMRC", "reference_number": "AB123456C", "account_number": "12345678" }, "confidence_scores": { "document_date": 0.95, "total_amount": 0.92, "payer_name": 0.88, "reference_number": 0.90, "account_number": 0.85 }, "provenance": { "document_date": {"page": 1, "position": "top_right"}, "total_amount": {"page": 1, "position": "center"}, "payer_name": {"page": 1, "position": "top_left"}, "reference_number": {"page": 1, "position": "header"}, "account_number": {"page": 1, "position": "footer"} }, "document_type": "bank_statement", "extraction_notes": [ "Amount includes VAT as stated", "Reference number partially obscured but readable" ] } ``` ## TEMPERATURE GUIDANCE - **First attempt**: Use temperature 0.1 for maximum consistency - **Retry attempts**: Use temperature 0.3 for alternative interpretations - **Final attempt**: Use temperature 0.5 for creative problem-solving Extract the information now, ensuring strict adherence to the schema and validation rules.