Document Extraction

Document extraction transforms raw uploaded files into typed, structured fields that become dynamic columns on the row's parent table. It's the bridge between unstructured uploads (passports, contracts, certificates) and the structured world of tables, registers, and matter properties.

Pipeline

File uploaded to a row / matter / submission
   │
   ├─ FileRecord created
   ├─ FileLink created (entityType + role)
   ├─ ExtractionJob enqueued (PENDING)
   │
   ▼
ExtractionWorker
   ├─ Loads FileRecord + parent context
   ├─ Picks extraction class (PASSPORT, CERTIFICATE_OF_INCORPORATION, ...)
   ├─ Runs the LangExtract bridge (Python subprocess via stdio)
   │   ├─ Vision/OCR pass on the file bytes
   │   ├─ LLM pass with class-specific prompt
   │   └─ Returns typed key-value pairs + confidence scores
   ├─ Computes structure score (0-1)
   ├─ Decides: usable / overflow / failed
   │
   ▼
On usable:
   ├─ Dynamic columns created on the row's table (capped at 50/table)
   ├─ Extracted values written to the row
   ├─ ExtractionSnapshot stored (plain text for retention period)
   └─ ExtractionResult marked DONE

ExtractionJob States

State	Meaning
PENDING	Enqueued, awaiting a worker.
PROCESSING	A worker is running the extraction.
DONE	Successfully completed, dynamic columns populated.
FAILED	Extraction errored.
TIMED_OUT	Exceeded `EXTRACTION_JOB_TIMEOUT_MS`.
OVERFLOW	Too many extraction columns on the target table; queued for admin review.

Tasks can be retried by admins via the extraction queue under /api/extraction/pending.

Extraction Classes

Each upload is classified before extraction. The class determines the prompt template and the expected field shape.

Class	Typical fields
`PASSPORT`	`fullName`, `nationality`, `dateOfBirth`, `documentNumber`, `expiryDate`, `sex`, `mrzLine1`, `mrzLine2`
`EMIRATES_ID`	`idNumber`, `nameAr`, `nameEn`, `nationality`, `dateOfBirth`, `expiryDate`
`CERTIFICATE_OF_INCORPORATION`	`companyName`, `companyNumber`, `incorporationDate`, `jurisdiction`, `registeredAddress`
`MEMORANDUM_OF_ASSOCIATION`	`companyName`, `subscribers`, `shareCapital`, `objectsClause`
`BOARD_RESOLUTION`	`companyName`, `meetingDate`, `attendees`, `resolutionsPassed`
`BANK_STATEMENT`	`accountHolder`, `accountNumber`, `bankName`, `statementPeriod`, `closingBalance`
`LEASE_AGREEMENT`	`lessor`, `lessee`, `propertyAddress`, `leaseStart`, `leaseEnd`, `monthlyRent`
`UTILITY_BILL`	`accountHolder`, `address`, `provider`, `billDate`, `amount`

Plus generic classes for free-form documents (GENERIC_LEGAL, GENERIC_KYC).

Quality Scoring

Each extraction returns a structure score between 0 and 1 measuring how well the document parsed. Computed from:

Field-level confidence (LLM-reported certainty for each field).
Coverage (proportion of expected fields present).
Consistency checks (e.g. MRZ vs visible fields on a passport).

Threshold for usability: structure score >= 0.7. Below threshold, the result is flagged for review and not auto-promoted to dynamic columns.

get_extraction_quality (AI tool) and GET /api/files/[fileId]/extraction-quality surface the score per file.

Dynamic Columns

When extraction succeeds and the score passes the threshold:

Each typed field becomes a column on the row's parent table.
Column names are derived from the field name (e.g. passport_number, nationality).
Type inferred from sample values: DATE, NUMBER, TEXT, EMAIL.
Automatic deduplication prevents duplicate columns on subsequent extractions.
The extracted value is written to the row's column.

Tables are capped at 50 dynamic extraction columns. Beyond that, new extractions queue as OVERFLOW for admin review.

Cell Locking

Cells can be locked to prevent further AI/extraction overwrites. Lock state lives in the row's JSON as __lock__<columnId>: true. Locks are enforced at:

The row PATCH API.
The AI patch_row and bulk_update_rows tools.
The extraction promotion path.

Locks are not enforced on direct Prisma writes or CSV import operations - those are admin-driven and assumed deliberate.

CSP Extraction Bridge

When an extraction completes for a row in the CSP system tables (Individuals, Companies, Stakeholders), the CSP extraction bridge propagates the typed fields into:

CSP register entries (where applicable - new director appointment, share allotment, etc.).
Cross-workspace identity resolution (extracts identifier fields like passport number, company registration number).
Document presence advertisements.

This is how an uploaded passport on an Individual row becomes a PASSPORT document presence record visible to the overseer.

Snapshots & Retention

Each extraction writes an ExtractionSnapshot capturing the plain-text extracted content (not the file bytes). The snapshot is retained for EXTRACTION_SNAPSHOT_RETENTION_DAYS (default 90 days), then pruned by a daily cron.

Purpose: post-hoc audits ("did the AI mis-extract this name?") and re-running extractions against the original parse without re-running OCR.

Configuration

Env var	Purpose	Default
`EXTRACTION_POLL_INTERVAL_MS`	Worker poll interval	5000
`EXTRACTION_MAX_CONCURRENT`	Max concurrent jobs per process	3
`EXTRACTION_JOB_TIMEOUT_MS`	Per-job timeout	180000
`EXTRACTION_MAX_ATTEMPTS`	Retries before TIMED_OUT	3
`EXTRACTION_SNAPSHOT_RETENTION_DAYS`	Snapshot retention	90

API Surface

Endpoint	Purpose
`GET /api/extraction/pending`	List pending/overflow extractions (ADMIN).
`POST /api/extraction/pending/[id]/resolve`	Manually resolve an overflow / failed extraction.
`POST /api/extraction/pending/[id]/retry`	Retry a failed extraction.

AI-tool surface: get_extraction_quality, bridge_kyc_documents.

Known Risks

The extraction pipeline has several documented risks:

LLM hallucination on field values (mitigated by confidence scoring + cell locking).
Python subprocess isolation (LangExtract runs as a child process; sandboxed via OS-level limits).
Overflow queue UX (50-column cap can be hit on heavily-extracted tables).
Snapshot retention (default 90 days; longer retention requires explicit configuration).