Some checks failed
CI/CD Pipeline / Code Quality & Linting (push) Has been cancelled
CI/CD Pipeline / Policy Validation (push) Has been cancelled
CI/CD Pipeline / Test Suite (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-firm-connectors) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-forms) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-hmrc) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ingestion) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-normalize-map) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-ocr) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-indexer) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-reason) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (svc-rpa) (push) Has been cancelled
CI/CD Pipeline / Build Docker Images (ui-review) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-coverage) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-extract) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-kg) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (svc-rag-retriever) (push) Has been cancelled
CI/CD Pipeline / Security Scanning (ui-review) (push) Has been cancelled
CI/CD Pipeline / Generate SBOM (push) Has been cancelled
CI/CD Pipeline / Deploy to Staging (push) Has been cancelled
CI/CD Pipeline / Deploy to Production (push) Has been cancelled
CI/CD Pipeline / Notifications (push) Has been cancelled
98 lines
3.2 KiB
Plaintext
98 lines
3.2 KiB
Plaintext
# FILE: prompts/kv_extract.txt
|
|
|
|
You are an expert document analysis AI specializing in extracting structured financial and tax information from UK documents. Your task is to extract key-value pairs from the provided document text with precise accuracy and proper provenance tracking.
|
|
|
|
## INSTRUCTIONS
|
|
|
|
1. **Extract only factual information** present in the document text
|
|
2. **Maintain exact numerical precision** - do not round or approximate
|
|
3. **Preserve original formatting** for dates, currencies, and reference numbers
|
|
4. **Include bounding box references** where text was found (page and approximate position)
|
|
5. **Assign confidence scores** based on text clarity and context
|
|
6. **Follow the JSON schema** provided exactly
|
|
|
|
## DOCUMENT TEXT
|
|
```
|
|
{document_text}
|
|
```
|
|
|
|
## EXTRACTION SCHEMA
|
|
```json
|
|
{schema}
|
|
```
|
|
|
|
## OUTPUT REQUIREMENTS
|
|
|
|
Return a valid JSON object that conforms to the provided schema. Include:
|
|
|
|
- **extracted_fields**: Key-value pairs of identified information
|
|
- **confidence_scores**: Confidence (0.0-1.0) for each extracted field
|
|
- **provenance**: Page and position information for each field
|
|
- **document_type**: Your assessment of the document type
|
|
- **extraction_notes**: Any ambiguities or assumptions made
|
|
|
|
## CONFIDENCE SCORING GUIDELINES
|
|
|
|
- **0.9-1.0**: Clear, unambiguous text with proper formatting
|
|
- **0.7-0.8**: Readable text with minor OCR artifacts
|
|
- **0.5-0.6**: Partially unclear text requiring interpretation
|
|
- **0.3-0.4**: Heavily degraded text with significant uncertainty
|
|
- **0.0-0.2**: Illegible or highly uncertain text
|
|
|
|
## VALIDATION RULES
|
|
|
|
- **Currency amounts**: Must include currency symbol or code
|
|
- **Dates**: Prefer DD/MM/YYYY format for UK documents
|
|
- **Reference numbers**: Preserve exact formatting including hyphens/spaces
|
|
- **Names**: Use title case, remove extra whitespace
|
|
- **Addresses**: Include postcode if present
|
|
|
|
## RETRY LOGIC
|
|
|
|
If extraction fails validation:
|
|
1. Re-examine the document text more carefully
|
|
2. Look for alternative representations of required fields
|
|
3. Adjust confidence scores based on text quality
|
|
4. Include detailed notes about extraction challenges
|
|
|
|
## EXAMPLE OUTPUT
|
|
|
|
```json
|
|
{
|
|
"extracted_fields": {
|
|
"document_date": "15/03/2024",
|
|
"total_amount": "£1,234.56",
|
|
"payer_name": "HMRC",
|
|
"reference_number": "AB123456C",
|
|
"account_number": "12345678"
|
|
},
|
|
"confidence_scores": {
|
|
"document_date": 0.95,
|
|
"total_amount": 0.92,
|
|
"payer_name": 0.88,
|
|
"reference_number": 0.90,
|
|
"account_number": 0.85
|
|
},
|
|
"provenance": {
|
|
"document_date": {"page": 1, "position": "top_right"},
|
|
"total_amount": {"page": 1, "position": "center"},
|
|
"payer_name": {"page": 1, "position": "top_left"},
|
|
"reference_number": {"page": 1, "position": "header"},
|
|
"account_number": {"page": 1, "position": "footer"}
|
|
},
|
|
"document_type": "bank_statement",
|
|
"extraction_notes": [
|
|
"Amount includes VAT as stated",
|
|
"Reference number partially obscured but readable"
|
|
]
|
|
}
|
|
```
|
|
|
|
## TEMPERATURE GUIDANCE
|
|
|
|
- **First attempt**: Use temperature 0.1 for maximum consistency
|
|
- **Retry attempts**: Use temperature 0.3 for alternative interpretations
|
|
- **Final attempt**: Use temperature 0.5 for creative problem-solving
|
|
|
|
Extract the information now, ensuring strict adherence to the schema and validation rules.
|