opboxDocs
Sign inBook a demo
DocsDocument ExtractionAI - Intelligence

Document Extraction

Document extraction transforms raw uploaded files into typed, structured fields that become dynamic columns on the row's parent table. It's the bridge between unstructured uploads (passports, contracts, certificates) and the structured world of tables, registers, and matter properties.

Pipeline

File uploaded to a row / matter / submission
   │
   ├─ FileRecord created
   ├─ FileLink created (entityType + role)
   ├─ ExtractionJob enqueued (PENDING)
   │
   ▼
ExtractionWorker
   ├─ Loads FileRecord + parent context
   ├─ Picks extraction class (PASSPORT, CERTIFICATE_OF_INCORPORATION, ...)
   ├─ Runs the LangExtract bridge (Python subprocess via stdio)
   │   ├─ Vision/OCR pass on the file bytes
   │   ├─ LLM pass with class-specific prompt
   │   └─ Returns typed key-value pairs + confidence scores
   ├─ Computes structure score (0-1)
   ├─ Decides: usable / overflow / failed
   │
   ▼
On usable:
   ├─ Dynamic columns created on the row's table (capped at 50/table)
   ├─ Extracted values written to the row
   ├─ ExtractionSnapshot stored (plain text for retention period)
   └─ ExtractionResult marked DONE

ExtractionJob States

StateMeaning
PENDINGEnqueued, awaiting a worker.
PROCESSINGA worker is running the extraction.
DONESuccessfully completed, dynamic columns populated.
FAILEDExtraction errored.
TIMED_OUTExceeded EXTRACTION_JOB_TIMEOUT_MS.
OVERFLOWToo many extraction columns on the target table; queued for admin review.

Tasks can be retried by admins via the extraction queue under /api/extraction/pending.

Extraction Classes

Each upload is classified before extraction. The class determines the prompt template and the expected field shape.

ClassTypical fields
PASSPORTfullName, nationality, dateOfBirth, documentNumber, expiryDate, sex, mrzLine1, mrzLine2
EMIRATES_IDidNumber, nameAr, nameEn, nationality, dateOfBirth, expiryDate
CERTIFICATE_OF_INCORPORATIONcompanyName, companyNumber, incorporationDate, jurisdiction, registeredAddress
MEMORANDUM_OF_ASSOCIATIONcompanyName, subscribers, shareCapital, objectsClause
BOARD_RESOLUTIONcompanyName, meetingDate, attendees, resolutionsPassed
BANK_STATEMENTaccountHolder, accountNumber, bankName, statementPeriod, closingBalance
LEASE_AGREEMENTlessor, lessee, propertyAddress, leaseStart, leaseEnd, monthlyRent
UTILITY_BILLaccountHolder, address, provider, billDate, amount

Plus generic classes for free-form documents (GENERIC_LEGAL, GENERIC_KYC).

Quality Scoring

Each extraction returns a structure score between 0 and 1 measuring how well the document parsed. Computed from:

  • Field-level confidence (LLM-reported certainty for each field).
  • Coverage (proportion of expected fields present).
  • Consistency checks (e.g. MRZ vs visible fields on a passport).

Threshold for usability: structure score >= 0.7. Below threshold, the result is flagged for review and not auto-promoted to dynamic columns.

get_extraction_quality (AI tool) and GET /api/files/[fileId]/extraction-quality surface the score per file.

Dynamic Columns

When extraction succeeds and the score passes the threshold:

  1. Each typed field becomes a column on the row's parent table.
  2. Column names are derived from the field name (e.g. passport_number, nationality).
  3. Type inferred from sample values: DATE, NUMBER, TEXT, EMAIL.
  4. Automatic deduplication prevents duplicate columns on subsequent extractions.
  5. The extracted value is written to the row's column.

Tables are capped at 50 dynamic extraction columns. Beyond that, new extractions queue as OVERFLOW for admin review.

Cell Locking

Cells can be locked to prevent further AI/extraction overwrites. Lock state lives in the row's JSON as __lock__<columnId>: true. Locks are enforced at:

  • The row PATCH API.
  • The AI patch_row and bulk_update_rows tools.
  • The extraction promotion path.

Locks are not enforced on direct Prisma writes or CSV import operations - those are admin-driven and assumed deliberate.

CSP Extraction Bridge

When an extraction completes for a row in the CSP system tables (Individuals, Companies, Stakeholders), the CSP extraction bridge propagates the typed fields into:

  • CSP register entries (where applicable - new director appointment, share allotment, etc.).
  • Cross-workspace identity resolution (extracts identifier fields like passport number, company registration number).
  • Document presence advertisements.

This is how an uploaded passport on an Individual row becomes a PASSPORT document presence record visible to the overseer.

Snapshots & Retention

Each extraction writes an ExtractionSnapshot capturing the plain-text extracted content (not the file bytes). The snapshot is retained for EXTRACTION_SNAPSHOT_RETENTION_DAYS (default 90 days), then pruned by a daily cron.

Purpose: post-hoc audits ("did the AI mis-extract this name?") and re-running extractions against the original parse without re-running OCR.

Configuration

Env varPurposeDefault
EXTRACTION_POLL_INTERVAL_MSWorker poll interval5000
EXTRACTION_MAX_CONCURRENTMax concurrent jobs per process3
EXTRACTION_JOB_TIMEOUT_MSPer-job timeout180000
EXTRACTION_MAX_ATTEMPTSRetries before TIMED_OUT3
EXTRACTION_SNAPSHOT_RETENTION_DAYSSnapshot retention90

API Surface

EndpointPurpose
GET /api/extraction/pendingList pending/overflow extractions (ADMIN).
POST /api/extraction/pending/[id]/resolveManually resolve an overflow / failed extraction.
POST /api/extraction/pending/[id]/retryRetry a failed extraction.

AI-tool surface: get_extraction_quality, bridge_kyc_documents.

Known Risks

The extraction pipeline has several documented risks:

  • LLM hallucination on field values (mitigated by confidence scoring + cell locking).
  • Python subprocess isolation (LangExtract runs as a child process; sandboxed via OS-level limits).
  • Overflow queue UX (50-column cap can be hit on heavily-extracted tables).
  • Snapshot retention (default 90 days; longer retention requires explicit configuration).

See Also

We use cookies

Strictly necessary cookies keep you signed in and protect requests. We also use optional cookies for preferences and (when enabled) analytics. Learn more.