Skip to content

Voice transcription

Transcription is a real, production feature backed by an external argus-transcription service using Whisper with VAD (voice-activity detection). The browser-side SpeechTranscriptionService receives streaming partials + finalised utterances over the LiveKit data channel and surfaces them in the PTT transcription bubble, the timeline, and the voice-command detector.

Architecture

  • argus-transcription — server-side Whisper + VAD pipeline. Joins the mission’s LiveKit room as a bot participant. Subscribes to PTT audio tracks (AUDIO_PTT_TRACK_ID). Produces transcription text.
  • LiveKit data channel — transcription messages flow on a ptt_transcription topic.
  • SpeechTranscriptionService — client-side consumer. Tracks per-peer state, exposes streaming + completed utterances.

Live partial vs finalised

Each transcription message includes isFinal: boolean:

  • isFinal: false — partial. The text updates per VAD segment as the speaker continues. The transcription bubble shows it in real-time.
  • isFinal: true — committed. Emitted on transcriptionComplete$ Subject, recorded to blackbox, added to the operation timeline as a voice.utterance event, and passed to the voice-command detector.

Service state model

SpeechTranscriptionService exposes:

  • liveTranscript — Observable of the currently-streaming partial.
  • lastTranscription — the most-recent finalised utterance.
  • transcriptionComplete$ — Subject fired on every finalised utterance.

Per peer the service tracks:

  • peerId — who’s speaking.
  • text — current accumulated text.
  • startedAt — when the burst began.
  • missionId — for audit.
  • statusidle / streaming / done.

Languages

Whisper auto-detects language. Supported out-of-the-box:

  • English.
  • Spanish.
  • Arabic (including transliterated).
  • Other major languages Whisper handles — French, German, Portuguese, etc.

There’s no per-user language-override UI — Whisper’s auto-detection is strong enough that operators don’t typically need to configure it.

Redaction

The webapp does not currently redact transcription output client-side. Your org admin can enable server-side PII redaction under Admin → Organisation → Integrations → Transcription; when enabled, patterns matching phone numbers, license plates, credit cards, and operator-configured regex replace with [REDACTED] before the text arrives at the browser.

Voice-command feeds off this

The voice-command detector subscribes to transcriptionComplete$ and matches against its 17-command pattern list. See that page for the full command list. Voice commands only fire on finalised utterances — never on partial text — so the system doesn’t act on half-formed speech.

Audit + replay

  • Timeline tile logs every finalised utterance as a voice.utterance event with the full text and speaker.
  • Blackbox recorder stores the raw utterance plus the audio segment URL for replay.
  • Comms tile shows each PTT recording with its transcription; the comms-ptt tile has inline playback with transcription preview.

Cost

Whisper streaming is the dominant cost — per second of active speech. Metered per operation; visible under Admin → Org → Usage. When nobody’s talking, cost is zero.

Known limitations

  • No client-side redaction — relies on server-side config.
  • No speaker diarisation — one PTT = one speaker (by design, since PTT is push-to-talk). In always-on channels there’s no per-speaker labelling.
  • No real-time translation — language detection only; not auto- translation to a different output language.