RAG & Search
Opbox's RAG pipeline is hybrid: dense vector embeddings (semantic similarity) + Postgres full-text search (lexical match), combined via Reciprocal Rank Fusion. It powers the AI assistant's knowledge_search tool, the spotlight search bar's relevance ranking, and the system's self-knowledge.
Pipeline
Source content (KB doc / comment / note / transcript / extraction)
│
├─ Chunk into 512-token windows (64-token overlap)
├─ SHA-256 dedup against existing chunks
├─ Embed via OpenAI text-embedding-3-small
├─ Write Embedding row (sourceType, sourceId, vector, text)
└─ Index Postgres tsvector column for full-text
│
▼
Query time:
├─ Embed the query (single vector)
├─ Run pgvector similarity search (top-K dense)
├─ Run tsvector full-text search (top-K lexical)
├─ Reciprocal Rank Fusion combines the two rankings
└─ Return ranked chunks with source metadata
Source Types
| Source type | What's embedded |
|---|---|
DOCUMENT | Tiptap document content (KB pages and articles) |
MATTER_COMMENT | Comments on matter steps |
TRANSCRIPT | Voice transcripts (Whisper / WhisperX output) |
SUBMISSION_NOTE | Notes on form submissions |
REGULATOR_COMMENT | Comments on regulator filings |
TABLE_ROW_NOTE | Free-text notes on table rows |
COMPLIANCE_EXCEPTION | Compliance exception narratives |
FORM | Form descriptions and field labels |
SYSTEM_DOC | Opbox's own product documentation (self-knowledge) |
TABLE_DATA | Specific row content (capped per table) |
Chunking
| Setting | Value |
|---|---|
| Chunk size | 512 tokens |
| Overlap | 64 tokens |
| Dedup | SHA-256 of chunk text |
| Embedding model | text-embedding-3-small (configurable) |
Overlap means a sentence near a chunk boundary appears in both adjacent chunks - so retrieval doesn't miss it because it straddled the cut.
Hybrid Retrieval
| Method | Strength |
|---|---|
| Dense (pgvector) | Semantic similarity. "company formation" finds "incorporation" even with no shared words. |
| Lexical (tsvector) | Exact-match precision. "MAT-0042" finds rows with that ID. |
| RRF combine | Each method ranks top K; fused via Reciprocal Rank Fusion to balance precision and recall. |
The fusion step is what makes hybrid better than either alone for production search.
AI-Tool Surface
| Tool | Description |
|---|---|
knowledge_search | Hybrid search across all org content. Returns ranked chunks with source metadata + relevance scores. |
get_document_content | Retrieve full document content by ID (after a knowledge_search match) for deeper context. |
get_document_extractions | Get AI-extracted structured data from a document. |
search_system_docs | Search Opbox's product docs by keyword. Returns relevant sections with relevance scores. |
embed_system_docs | Index product documentation for semantic search (admin). |
REST equivalents:
| Endpoint | Purpose |
|---|---|
GET /api/docs/search?query=... | Search system docs (powers search_system_docs). |
POST /api/docs/embed | Trigger embedding of system docs. |
Admin Surface
Monitor and manage the embedding pipeline at Settings > AI > RAG (ADMIN/OWNER):
GET /api/admin/rag?sourceType=DOCUMENT
{
"ragEnabled": true,
"embeddingModel": "text-embedding-3-small",
"statusBySourceType": {
"DOCUMENT": { "COMPLETED": 42, "PENDING": 2 },
"MATTER_COMMENT": { "COMPLETED": 15 }
},
"chunksBySourceType": {
"DOCUMENT": { "chunks": 210, "tokens": 48500 }
},
"recentFailures": []
}
Trigger Backfill (OWNER)
POST /api/admin/rag
{ "sourceType": "DOCUMENT" } // omit to backfill all types
Backfill runs asynchronously. Useful after enabling RAG on a workspace that previously had it off.
Configuration
| Env var | Purpose | Default |
|---|---|---|
RAG_ENABLED | Master switch for the pipeline | true |
When disabled, embedding writes stop, query-time retrieval falls back to lexical-only via tsvector. Useful for development or for workspaces where the embedding cost isn't worth it.
Cost Attribution
Embedding writes are billed against the calling workspace via the same BYOK Credentials chain as chat. The OpenAI key for embeddings is resolved using a provider-specific lookup that bypasses the chat-provider preference. The cost ledger row carries operationType: 'embedding' with the appropriate keySource.
If a workspace has no OpenAI key configured at any tier and the env fallback is also unset, embedding writes silently no-op. The retrieval falls back to lexical-only at query time.
Spotlight Search Integration
The spotlight search bar (Cmd+K) ranks results using the same hybrid pipeline. When you type a query, the result list draws from:
- Entity index (matters, forms, tables, documents) - lexical match on title/metadata.
- RAG content match - semantic match against chunked content.
- Fused ranking - RRF combines them.
This is why spotlight finds "the matter about company formation in Dubai" even when you typed "incorporation UAE".
Self-Knowledge Loop
search_system_docs lets the AI answer questions about Opbox itself - "what does the autonomy level do?", "where does the cost ledger live?". The product documentation is chunked and embedded via the same pipeline, just under the SYSTEM_DOC source type.
Effect: the AI is documentation-aware out of the box. New users can ask the assistant "how do I set up an MCP key?" and get an answer pulled from the live docs, not the model's training data.
See Also
- Document Extraction - the source of structured-data content for embedding.
- Identity Resolution - how cross-workspace entities link without exposing cleartext.
- Intelligence Pipeline - the broader pipeline.