RAG & Search

Opbox's RAG pipeline is hybrid: dense vector embeddings (semantic similarity) + Postgres full-text search (lexical match), combined via Reciprocal Rank Fusion. It powers the AI assistant's knowledge_search tool, the spotlight search bar's relevance ranking, and the system's self-knowledge.

Pipeline

Source content (KB doc / comment / note / transcript / extraction)
   │
   ├─ Chunk into 512-token windows (64-token overlap)
   ├─ SHA-256 dedup against existing chunks
   ├─ Embed via OpenAI text-embedding-3-small
   ├─ Write Embedding row (sourceType, sourceId, vector, text)
   └─ Index Postgres tsvector column for full-text
       │
       ▼
Query time:
   ├─ Embed the query (single vector)
   ├─ Run pgvector similarity search (top-K dense)
   ├─ Run tsvector full-text search (top-K lexical)
   ├─ Reciprocal Rank Fusion combines the two rankings
   └─ Return ranked chunks with source metadata

Source Types

Source type	What's embedded
`DOCUMENT`	Tiptap document content (KB pages and articles)
`MATTER_COMMENT`	Comments on matter steps
`TRANSCRIPT`	Voice transcripts (Whisper / WhisperX output)
`SUBMISSION_NOTE`	Notes on form submissions
`REGULATOR_COMMENT`	Comments on regulator filings
`TABLE_ROW_NOTE`	Free-text notes on table rows
`COMPLIANCE_EXCEPTION`	Compliance exception narratives
`FORM`	Form descriptions and field labels
`SYSTEM_DOC`	Opbox's own product documentation (self-knowledge)
`TABLE_DATA`	Specific row content (capped per table)

Chunking

Setting	Value
Chunk size	512 tokens
Overlap	64 tokens
Dedup	SHA-256 of chunk text
Embedding model	`text-embedding-3-small` (configurable)

Overlap means a sentence near a chunk boundary appears in both adjacent chunks - so retrieval doesn't miss it because it straddled the cut.

Hybrid Retrieval

Method	Strength
Dense (pgvector)	Semantic similarity. "company formation" finds "incorporation" even with no shared words.
Lexical (tsvector)	Exact-match precision. "MAT-0042" finds rows with that ID.
RRF combine	Each method ranks top K; fused via Reciprocal Rank Fusion to balance precision and recall.

The fusion step is what makes hybrid better than either alone for production search.

AI-Tool Surface

Tool	Description
`knowledge_search`	Hybrid search across all org content. Returns ranked chunks with source metadata + relevance scores.
`get_document_content`	Retrieve full document content by ID (after a `knowledge_search` match) for deeper context.
`get_document_extractions`	Get AI-extracted structured data from a document.
`search_system_docs`	Search Opbox's product docs by keyword. Returns relevant sections with relevance scores.
`embed_system_docs`	Index product documentation for semantic search (admin).

REST equivalents:

Endpoint	Purpose
`GET /api/docs/search?query=...`	Search system docs (powers `search_system_docs`).
`POST /api/docs/embed`	Trigger embedding of system docs.

Admin Surface

Monitor and manage the embedding pipeline at Settings > AI > RAG (ADMIN/OWNER):

GET /api/admin/rag?sourceType=DOCUMENT

{
  "ragEnabled": true,
  "embeddingModel": "text-embedding-3-small",
  "statusBySourceType": {
    "DOCUMENT": { "COMPLETED": 42, "PENDING": 2 },
    "MATTER_COMMENT": { "COMPLETED": 15 }
  },
  "chunksBySourceType": {
    "DOCUMENT": { "chunks": 210, "tokens": 48500 }
  },
  "recentFailures": []
}

Trigger Backfill (OWNER)

POST /api/admin/rag
{ "sourceType": "DOCUMENT" }   // omit to backfill all types

Backfill runs asynchronously. Useful after enabling RAG on a workspace that previously had it off.

Configuration

Env var	Purpose	Default
`RAG_ENABLED`	Master switch for the pipeline	`true`

When disabled, embedding writes stop, query-time retrieval falls back to lexical-only via tsvector. Useful for development or for workspaces where the embedding cost isn't worth it.

Cost Attribution

Embedding writes are billed against the calling workspace via the same BYOK Credentials chain as chat. The OpenAI key for embeddings is resolved using a provider-specific lookup that bypasses the chat-provider preference. The cost ledger row carries operationType: 'embedding' with the appropriate keySource.

If a workspace has no OpenAI key configured at any tier and the env fallback is also unset, embedding writes silently no-op. The retrieval falls back to lexical-only at query time.

Spotlight Search Integration

The spotlight search bar (Cmd+K) ranks results using the same hybrid pipeline. When you type a query, the result list draws from:

Entity index (matters, forms, tables, documents) - lexical match on title/metadata.
RAG content match - semantic match against chunked content.
Fused ranking - RRF combines them.

This is why spotlight finds "the matter about company formation in Dubai" even when you typed "incorporation UAE".

Self-Knowledge Loop

search_system_docs lets the AI answer questions about Opbox itself - "what does the autonomy level do?", "where does the cost ledger live?". The product documentation is chunked and embedded via the same pipeline, just under the SYSTEM_DOC source type.

Effect: the AI is documentation-aware out of the box. New users can ask the assistant "how do I set up an MCP key?" and get an answer pulled from the live docs, not the model's training data.