Voice Transcripts

Upload audio files (voice memos, meeting recordings, dictation) and Opbox transcribes them to text. Two providers: cloud Whisper (OpenAI) for plain text, or local WhisperX for speaker diarisation with timestamped segments.

Transcripts are private per user - only the uploader can see and manage their own transcripts. The AI assistant respects this through dedicated list_my_transcripts / get_my_transcript tools.

Providers

Provider	Speaker Diarisation	Requirements	Output
Whisper (cloud)	No - plain text only	OpenAI API key	Single text blob
WhisperX (local)	Yes - timestamped segments with speaker labels	Python 3, `pip install whisperx`, HuggingFace token (`HF_TOKEN`), `WHISPERX_ENABLED=true`	Text + segments array

Provider selection in Settings > AI > Transcription:

Auto - WhisperX if installed and enabled, else Whisper.
WhisperX - force local diarised transcription. Errors if not configured.
Whisper - force cloud Whisper.

Upload & Transcribe

POST /api/ai/transcribe
Content-Type: multipart/form-data

file: <audio file (max 1GB)>
title: "Lee & Will"        # optional
provider: "whisperx"       # optional - overrides user pref

Supported formats: .m4a, .mp3, .wav, .webm, .ogg, .flac, .mp4, .qta (Apple Voice Memos - auto-converted to m4a).

The UI shows a naming dialog on file selection. The title defaults to the filename (minus extension) and is used as the Knowledge Base document name with a date prefix (e.g. "18 February 2026 - Lee & Will").

WhisperX Response

{
  "id": "cm...",
  "filename": "meeting.m4a",
  "title": "Lee & Will",
  "text": "Hello everyone, let's get started...",
  "language": "en",
  "durationSecs": 120.5,
  "provider": "whisperx",
  "speakerCount": 3,
  "segments": [
    { "start": 0.5, "end": 3.2, "text": "Hello everyone, let's get started.", "speaker": "SPEAKER_00" },
    { "start": 3.8, "end": 7.1, "text": "Thanks for joining.", "speaker": "SPEAKER_01" }
  ],
  "createdAt": "2026-02-18T10:30:00.000Z",
  "documentId": "cm..."
}

Whisper Response

{
  "id": "cm...",
  "filename": "recording.m4a",
  "title": null,
  "text": "Hello, this is a transcription of...",
  "language": "english",
  "durationSecs": 42.5,
  "provider": "openai",
  "speakerCount": null,
  "segments": null,
  "createdAt": "2026-02-18T10:30:00.000Z",
  "documentId": "cm..."
}

List, Get, Delete

GET /api/ai/transcripts?search=meeting&page=1&limit=20
GET /api/ai/transcripts/:id
DELETE /api/ai/transcripts/:id

The list endpoint is scoped to the requesting user - you only ever see your own transcripts.

Knowledge Base Sync

Each transcript creates a paired KB document under the Transcripts system folder. The KB document carries:

The transcript text (or rich segments for WhisperX, with speaker labels).
A title formatted as "DD Month YYYY - Title" (e.g. "18 February 2026 - Lee & Will").
A transcript label so you can filter the KB by it.

This means transcripts are immediately searchable via knowledge_search and the standard KB tools - useful when the AI is reasoning across notes, documents, and meeting transcripts together.

Response Field Reference

Field	Type	Description
`title`	string \| null	User-chosen display name. Null if not provided; falls back to filename.
`provider`	string \| null	`"openai"` or `"whisperx"`. Null for legacy transcripts.
`speakerCount`	number \| null	Distinct speakers detected. WhisperX only.
`segments`	array \| null	Each segment carries `start`, `end`, `text`, `speaker` with timestamps in seconds. WhisperX only.
`documentId`	string	Paired KB document for searchability.

Privacy Model

Per-user scoping - the list/detail/delete endpoints all filter by the requesting user.
AI access - the AI cannot use generic list_transcripts / get_transcript tools. It only has list_my_transcripts / get_my_transcript which enforce the same per-user scoping when the agent runs as a specific user (rare for transcripts).
Workspace owners - cannot read your transcripts via the UI or API. They can only see metadata in the audit log (filename, duration, timestamp) - never content.