Cost Control

Every LLM call in Opbox writes one ledger row. Every workspace has a hard cap. Every spend is attributed to the credential tier that paid for it. The full system is built on three primitives: the AiCostLedger, the AiBudgetSnapshot, and the checkBudget() gate.

The Ledger

AiCostLedger is an append-only row per LLM call.

Field	Purpose
`workspaceId`	Spend bucket. Every row has a workspace.
`operationType`	`chat` / `agent` / `extraction` / `embedding` / `other`
`model`	Specific model that ran (e.g. `claude-sonnet-4-5-20250929`)
`inputTokens` / `outputTokens`	Raw token counts from the provider response
`actualCostUsd`	Computed from per-model rates in `AI_COST_RATES`
`keySource`	Which credential tier paid: `USER_KEY` / `WORKSPACE_KEY` / `ORG_KEY` / `SERVER_KEY`
`userId`	The user whose action triggered the call (for chat); null for autonomous flows
`agentTaskId`	Set when the call was made by the agent worker
`createdAt`	Timestamp

Writes are transactional: the ledger entry and the monthly AiBudgetSnapshot upsert run in one atomic database transaction. Either both succeed or neither does.

The Snapshot

AiBudgetSnapshot is a denormalised running total per (workspaceId, month). Fields:

totalInputTokens, totalOutputTokens, totalCostUsd
monthStart (YYYY-MM-01)

The snapshot exists so that checkBudget() can answer "are we over the cap?" with one row read instead of an aggregate over the ledger. The transactional write keeps the snapshot in sync with the ledger automatically.

Drift Reconciliation

The snapshot can drift from the ledger if:

A failed transaction left a partial state (rare, the transactional write makes this near-impossible).
A historical migration adjusted ledger rows.
An admin manually edited rows.

reconcileBudgetSnapshot() walks the ledger for the period and rewrites the snapshot. Run nightly via cron or on-demand from Settings > AI > Cost Control.

Budget Gate

checkBudget(workspaceId, estimatedCostUsd) runs before every LLM call. It reads the current month's snapshot, adds the estimate, and rejects if the total would exceed the workspace's hard cap.

Check	Behaviour
Hard cap	Fail-closed - call is rejected with `BUDGET_EXCEEDED`.
Soft limit (configurable %, default 80%)	Logs but doesn't block - emits a warning event.
Cost recording itself	Fail-open - if writing the ledger row errors, the call still proceeds. Availability over consistency for recording; consistency for enforcement.

The asymmetry is deliberate: a transient DB hiccup shouldn't block a customer's chat session, but a clear over-cap state should.

Key Source Attribution

The keySource column is the heart of attribution. The BYOK resolver returns it; the ledger writes it; the breakdown reads it.

Tier	KeySource	What it means for billing
User (Personal BYOK)	`USER_KEY`	The user's own provider key paid. Bypasses budget gate by default - personal spend isn't on the org's cap.
Workspace	`WORKSPACE_KEY`	The workspace's override key paid. Counts against the workspace cap.
Org	`ORG_KEY`	The org's primary key paid. Counts against the workspace cap.
Server	`SERVER_KEY`	A server-level env-var key paid. Self-hosted / dev.

The "USER_KEY bypasses the gate" rule means: a user with their own personal key can keep running calls even if the workspace is over its cap. The org's cap is for spend the org pays for. If a user is paying their own way, the org's cap doesn't apply.

This rule is also why the allowPersonalKeys toggle matters. An org that needs hard guarantees on the cap will turn personal keys off so every call funnels through tiers that do hit the cap.

Spend Breakdown

The spend breakdown endpoint pivots the ledger two ways for a given period - by operation (chat, agent, extraction, embedding, other) and by key source (USER_KEY, WORKSPACE_KEY, ORG_KEY, SERVER_KEY):

{
  "byOperation": {
    "chat":       { "tokens": 42000, "costUsd": 0.45 },
    "agent":      { "tokens": 12000, "costUsd": 0.18 },
    "extraction": { "tokens":  3200, "costUsd": 0.04 },
    "embedding":  { "tokens":   800, "costUsd": 0.01 },
    "other":      { "tokens":     0, "costUsd": 0.00 }
  },
  "byKeySource": {
    "USER_KEY":      { "tokens":  4000, "costUsd": 0.04 },
    "WORKSPACE_KEY": { "tokens": 36000, "costUsd": 0.42 },
    "ORG_KEY":       { "tokens": 18000, "costUsd": 0.22 },
    "SERVER_KEY":    { "tokens":     0, "costUsd": 0.00 }
  }
}

Surfaced via:

GET /api/settings/ai-usage?view=summary
Settings > AI > Cost Control page - rendered as two breakdown rows ("By feature" + "Attributed to").

Daily Usage

A daily-usage endpoint returns a per-day token + cost series for charting, and a top-users-by-spend report lists the heaviest spenders for admin UX.

Per-Model Rates

A per-model rate table maps each supported model to its per-token pricing. For example, claude-sonnet-4-5-20250929 is billed at $3.00 per million input tokens and $15.00 per million output tokens. Cost-per-call is computed from these rates plus the input/output token counts the provider returns.

A regression test guards against silent under-billing: the resolver's default models must always be present in the rate table. Without this guard, a workspace with no model override would fall back to a default whose cost was $0 - a silent free ride.

Configuration

Hard cap and soft limit are configured per workspace at Settings > AI > Cost Control:

Setting	Purpose
Hard cap (USD/month)	Fail-closed budget gate. Default: unset (no cap).
Soft limit %	Warning threshold (default 80%). Logs but doesn't block.
Reconcile drift	Manual button to run `reconcileBudgetSnapshot()` for the current month.

The informational monthlyTokenCap on the BYOK config tiers is not the same thing - it's purely cosmetic and not enforced. The hard cap above is what actually gates calls.

Anthropic prompt caching

Opbox sends the AI tool catalog and the stable portion of the system prompt with cache_control: { type: 'ephemeral' }, which tells Anthropic to cache that prefix for 5 minutes. Subsequent requests within the window pay 0.10x the normal input-token rate for the cached portion.

In practice, ~95% of input tokens are cache reads on a workspace with active chat use. The savings show up automatically in actualCostUsd because the per-call price already accounts for cache reads vs. cache creation.

The ledger captures the breakdown in metadata:

Metadata field	Meaning
`cacheCreationInputTokens`	Tokens billed at the cache-creation rate (1.25x normal input). Paid once per cache write.
`cacheReadInputTokens`	Tokens billed at the cache-read rate (0.10x normal input). Paid on every cache hit.

Cache hit rate (target: >60%) can be computed over any window:

SELECT
  SUM((metadata->>'cacheReadInputTokens')::bigint) AS cache_reads,
  SUM((metadata->>'cacheCreationInputTokens')::bigint) AS cache_creations,
  SUM((metadata->>'cacheReadInputTokens')::bigint)::float
    / NULLIF(SUM((metadata->>'cacheReadInputTokens')::bigint)
             + SUM((metadata->>'cacheCreationInputTokens')::bigint), 0) AS hit_rate
FROM ai_cost_ledger
WHERE created_at >= NOW() - INTERVAL '7 days'
  AND metadata->>'cacheReadInputTokens' IS NOT NULL;

The volatile portion of the system prompt (per-request page context, current matter, current document) is intentionally NOT cached so it doesn't pollute the prefix.