Voice transcription
Transcription is a real, production feature backed by an external
argus-transcription service using Whisper with VAD (voice-activity
detection). The browser-side SpeechTranscriptionService receives streaming
partials + finalised utterances over the LiveKit data channel and surfaces
them in the PTT transcription bubble, the
timeline, and the voice-command detector.
Architecture
- argus-transcription — server-side Whisper + VAD pipeline. Joins the
mission’s LiveKit room as a bot participant. Subscribes to PTT audio
tracks (
AUDIO_PTT_TRACK_ID). Produces transcription text. - LiveKit data channel — transcription messages flow on a
ptt_transcriptiontopic. - SpeechTranscriptionService — client-side consumer. Tracks per-peer state, exposes streaming + completed utterances.
Live partial vs finalised
Each transcription message includes isFinal: boolean:
isFinal: false— partial. The text updates per VAD segment as the speaker continues. The transcription bubble shows it in real-time.isFinal: true— committed. Emitted ontranscriptionComplete$Subject, recorded to blackbox, added to the operation timeline as avoice.utteranceevent, and passed to the voice-command detector.
Service state model
SpeechTranscriptionService exposes:
- liveTranscript — Observable of the currently-streaming partial.
- lastTranscription — the most-recent finalised utterance.
- transcriptionComplete$ — Subject fired on every finalised utterance.
Per peer the service tracks:
peerId— who’s speaking.text— current accumulated text.startedAt— when the burst began.missionId— for audit.status—idle/streaming/done.
Languages
Whisper auto-detects language. Supported out-of-the-box:
- English.
- Spanish.
- Arabic (including transliterated).
- Other major languages Whisper handles — French, German, Portuguese, etc.
There’s no per-user language-override UI — Whisper’s auto-detection is strong enough that operators don’t typically need to configure it.
Redaction
The webapp does not currently redact transcription output client-side.
Your org admin can enable server-side PII redaction under
Admin → Organisation → Integrations → Transcription; when enabled,
patterns matching phone numbers, license plates, credit cards, and
operator-configured regex replace with [REDACTED] before the text arrives
at the browser.
Voice-command feeds off this
The voice-command detector subscribes to transcriptionComplete$ and matches against its 17-command pattern list. See that page for the full command list. Voice commands only fire on finalised utterances — never on partial text — so the system doesn’t act on half-formed speech.
Audit + replay
- Timeline tile logs every finalised utterance as a
voice.utteranceevent with the full text and speaker. - Blackbox recorder stores the raw utterance plus the audio segment URL for replay.
- Comms tile shows each PTT recording with its transcription; the comms-ptt tile has inline playback with transcription preview.
Cost
Whisper streaming is the dominant cost — per second of active speech. Metered per operation; visible under Admin → Org → Usage. When nobody’s talking, cost is zero.
Known limitations
- No client-side redaction — relies on server-side config.
- No speaker diarisation — one PTT = one speaker (by design, since PTT is push-to-talk). In always-on channels there’s no per-speaker labelling.
- No real-time translation — language detection only; not auto- translation to a different output language.