opboxDocs
Sign inBook a demo
DocsRAG & SearchAI - Intelligence

RAG & Search

Opbox's RAG pipeline is hybrid: dense vector embeddings (semantic similarity) + Postgres full-text search (lexical match), combined via Reciprocal Rank Fusion. It powers the AI assistant's knowledge_search tool, the spotlight search bar's relevance ranking, and the system's self-knowledge.

Pipeline

Source content (KB doc / comment / note / transcript / extraction)
   │
   ├─ Chunk into 512-token windows (64-token overlap)
   ├─ SHA-256 dedup against existing chunks
   ├─ Embed via OpenAI text-embedding-3-small
   ├─ Write Embedding row (sourceType, sourceId, vector, text)
   └─ Index Postgres tsvector column for full-text
       │
       ▼
Query time:
   ├─ Embed the query (single vector)
   ├─ Run pgvector similarity search (top-K dense)
   ├─ Run tsvector full-text search (top-K lexical)
   ├─ Reciprocal Rank Fusion combines the two rankings
   └─ Return ranked chunks with source metadata

Source Types

Source typeWhat's embedded
DOCUMENTTiptap document content (KB pages and articles)
MATTER_COMMENTComments on matter steps
TRANSCRIPTVoice transcripts (Whisper / WhisperX output)
SUBMISSION_NOTENotes on form submissions
REGULATOR_COMMENTComments on regulator filings
TABLE_ROW_NOTEFree-text notes on table rows
COMPLIANCE_EXCEPTIONCompliance exception narratives
FORMForm descriptions and field labels
SYSTEM_DOCOpbox's own product documentation (self-knowledge)
TABLE_DATASpecific row content (capped per table)

Chunking

SettingValue
Chunk size512 tokens
Overlap64 tokens
DedupSHA-256 of chunk text
Embedding modeltext-embedding-3-small (configurable)

Overlap means a sentence near a chunk boundary appears in both adjacent chunks - so retrieval doesn't miss it because it straddled the cut.

Hybrid Retrieval

MethodStrength
Dense (pgvector)Semantic similarity. "company formation" finds "incorporation" even with no shared words.
Lexical (tsvector)Exact-match precision. "MAT-0042" finds rows with that ID.
RRF combineEach method ranks top K; fused via Reciprocal Rank Fusion to balance precision and recall.

The fusion step is what makes hybrid better than either alone for production search.

AI-Tool Surface

ToolDescription
knowledge_searchHybrid search across all org content. Returns ranked chunks with source metadata + relevance scores.
get_document_contentRetrieve full document content by ID (after a knowledge_search match) for deeper context.
get_document_extractionsGet AI-extracted structured data from a document.
search_system_docsSearch Opbox's product docs by keyword. Returns relevant sections with relevance scores.
embed_system_docsIndex product documentation for semantic search (admin).

REST equivalents:

EndpointPurpose
GET /api/docs/search?query=...Search system docs (powers search_system_docs).
POST /api/docs/embedTrigger embedding of system docs.

Admin Surface

Monitor and manage the embedding pipeline at Settings > AI > RAG (ADMIN/OWNER):

GET /api/admin/rag?sourceType=DOCUMENT
{
  "ragEnabled": true,
  "embeddingModel": "text-embedding-3-small",
  "statusBySourceType": {
    "DOCUMENT": { "COMPLETED": 42, "PENDING": 2 },
    "MATTER_COMMENT": { "COMPLETED": 15 }
  },
  "chunksBySourceType": {
    "DOCUMENT": { "chunks": 210, "tokens": 48500 }
  },
  "recentFailures": []
}

Trigger Backfill (OWNER)

POST /api/admin/rag
{ "sourceType": "DOCUMENT" }   // omit to backfill all types

Backfill runs asynchronously. Useful after enabling RAG on a workspace that previously had it off.

Configuration

Env varPurposeDefault
RAG_ENABLEDMaster switch for the pipelinetrue

When disabled, embedding writes stop, query-time retrieval falls back to lexical-only via tsvector. Useful for development or for workspaces where the embedding cost isn't worth it.

Cost Attribution

Embedding writes are billed against the calling workspace via the same BYOK Credentials chain as chat. The OpenAI key for embeddings is resolved using a provider-specific lookup that bypasses the chat-provider preference. The cost ledger row carries operationType: 'embedding' with the appropriate keySource.

If a workspace has no OpenAI key configured at any tier and the env fallback is also unset, embedding writes silently no-op. The retrieval falls back to lexical-only at query time.

Spotlight Search Integration

The spotlight search bar (Cmd+K) ranks results using the same hybrid pipeline. When you type a query, the result list draws from:

  • Entity index (matters, forms, tables, documents) - lexical match on title/metadata.
  • RAG content match - semantic match against chunked content.
  • Fused ranking - RRF combines them.

This is why spotlight finds "the matter about company formation in Dubai" even when you typed "incorporation UAE".

Self-Knowledge Loop

search_system_docs lets the AI answer questions about Opbox itself - "what does the autonomy level do?", "where does the cost ledger live?". The product documentation is chunked and embedded via the same pipeline, just under the SYSTEM_DOC source type.

Effect: the AI is documentation-aware out of the box. New users can ask the assistant "how do I set up an MCP key?" and get an answer pulled from the live docs, not the model's training data.

See Also

We use cookies

Strictly necessary cookies keep you signed in and protect requests. We also use optional cookies for preferences and (when enabled) analytics. Learn more.