Document Extraction
Document extraction transforms raw uploaded files into typed, structured fields that become dynamic columns on the row's parent table. It's the bridge between unstructured uploads (passports, contracts, certificates) and the structured world of tables, registers, and matter properties.
Pipeline
File uploaded to a row / matter / submission
│
├─ FileRecord created
├─ FileLink created (entityType + role)
├─ ExtractionJob enqueued (PENDING)
│
▼
ExtractionWorker
├─ Loads FileRecord + parent context
├─ Picks extraction class (PASSPORT, CERTIFICATE_OF_INCORPORATION, ...)
├─ Runs the LangExtract bridge (Python subprocess via stdio)
│ ├─ Vision/OCR pass on the file bytes
│ ├─ LLM pass with class-specific prompt
│ └─ Returns typed key-value pairs + confidence scores
├─ Computes structure score (0-1)
├─ Decides: usable / overflow / failed
│
▼
On usable:
├─ Dynamic columns created on the row's table (capped at 50/table)
├─ Extracted values written to the row
├─ ExtractionSnapshot stored (plain text for retention period)
└─ ExtractionResult marked DONE
ExtractionJob States
| State | Meaning |
|---|---|
| PENDING | Enqueued, awaiting a worker. |
| PROCESSING | A worker is running the extraction. |
| DONE | Successfully completed, dynamic columns populated. |
| FAILED | Extraction errored. |
| TIMED_OUT | Exceeded EXTRACTION_JOB_TIMEOUT_MS. |
| OVERFLOW | Too many extraction columns on the target table; queued for admin review. |
Tasks can be retried by admins via the extraction queue under /api/extraction/pending.
Extraction Classes
Each upload is classified before extraction. The class determines the prompt template and the expected field shape.
| Class | Typical fields |
|---|---|
PASSPORT | fullName, nationality, dateOfBirth, documentNumber, expiryDate, sex, mrzLine1, mrzLine2 |
EMIRATES_ID | idNumber, nameAr, nameEn, nationality, dateOfBirth, expiryDate |
CERTIFICATE_OF_INCORPORATION | companyName, companyNumber, incorporationDate, jurisdiction, registeredAddress |
MEMORANDUM_OF_ASSOCIATION | companyName, subscribers, shareCapital, objectsClause |
BOARD_RESOLUTION | companyName, meetingDate, attendees, resolutionsPassed |
BANK_STATEMENT | accountHolder, accountNumber, bankName, statementPeriod, closingBalance |
LEASE_AGREEMENT | lessor, lessee, propertyAddress, leaseStart, leaseEnd, monthlyRent |
UTILITY_BILL | accountHolder, address, provider, billDate, amount |
Plus generic classes for free-form documents (GENERIC_LEGAL, GENERIC_KYC).
Quality Scoring
Each extraction returns a structure score between 0 and 1 measuring how well the document parsed. Computed from:
- Field-level confidence (LLM-reported certainty for each field).
- Coverage (proportion of expected fields present).
- Consistency checks (e.g. MRZ vs visible fields on a passport).
Threshold for usability: structure score >= 0.7. Below threshold, the result is flagged for review and not auto-promoted to dynamic columns.
get_extraction_quality (AI tool) and GET /api/files/[fileId]/extraction-quality surface the score per file.
Dynamic Columns
When extraction succeeds and the score passes the threshold:
- Each typed field becomes a column on the row's parent table.
- Column names are derived from the field name (e.g.
passport_number,nationality). - Type inferred from sample values:
DATE,NUMBER,TEXT,EMAIL. - Automatic deduplication prevents duplicate columns on subsequent extractions.
- The extracted value is written to the row's column.
Tables are capped at 50 dynamic extraction columns. Beyond that, new extractions queue as OVERFLOW for admin review.
Cell Locking
Cells can be locked to prevent further AI/extraction overwrites. Lock state lives in the row's JSON as __lock__<columnId>: true. Locks are enforced at:
- The row PATCH API.
- The AI
patch_rowandbulk_update_rowstools. - The extraction promotion path.
Locks are not enforced on direct Prisma writes or CSV import operations - those are admin-driven and assumed deliberate.
CSP Extraction Bridge
When an extraction completes for a row in the CSP system tables (Individuals, Companies, Stakeholders), the CSP extraction bridge propagates the typed fields into:
- CSP register entries (where applicable - new director appointment, share allotment, etc.).
- Cross-workspace identity resolution (extracts identifier fields like passport number, company registration number).
- Document presence advertisements.
This is how an uploaded passport on an Individual row becomes a PASSPORT document presence record visible to the overseer.
Snapshots & Retention
Each extraction writes an ExtractionSnapshot capturing the plain-text extracted content (not the file bytes). The snapshot is retained for EXTRACTION_SNAPSHOT_RETENTION_DAYS (default 90 days), then pruned by a daily cron.
Purpose: post-hoc audits ("did the AI mis-extract this name?") and re-running extractions against the original parse without re-running OCR.
Configuration
| Env var | Purpose | Default |
|---|---|---|
EXTRACTION_POLL_INTERVAL_MS | Worker poll interval | 5000 |
EXTRACTION_MAX_CONCURRENT | Max concurrent jobs per process | 3 |
EXTRACTION_JOB_TIMEOUT_MS | Per-job timeout | 180000 |
EXTRACTION_MAX_ATTEMPTS | Retries before TIMED_OUT | 3 |
EXTRACTION_SNAPSHOT_RETENTION_DAYS | Snapshot retention | 90 |
API Surface
| Endpoint | Purpose |
|---|---|
GET /api/extraction/pending | List pending/overflow extractions (ADMIN). |
POST /api/extraction/pending/[id]/resolve | Manually resolve an overflow / failed extraction. |
POST /api/extraction/pending/[id]/retry | Retry a failed extraction. |
AI-tool surface: get_extraction_quality, bridge_kyc_documents.
Known Risks
The extraction pipeline has several documented risks:
- LLM hallucination on field values (mitigated by confidence scoring + cell locking).
- Python subprocess isolation (LangExtract runs as a child process; sandboxed via OS-level limits).
- Overflow queue UX (50-column cap can be hit on heavily-extracted tables).
- Snapshot retention (default 90 days; longer retention requires explicit configuration).
See Also
- Intelligence Pipeline - the broader pipeline this slots into.
- Identity Resolution - how extracted identifiers become global identities.
- RAG & Search - how extracted text becomes searchable.