DJI on-aircraft AI detections

DJI Dock 3 aircraft include an on-board AI target-recognition feature (persons, cars, boats). When enabled, the aircraft embeds detection bounding boxes into the H.264 video bitstream as SEI messages (Supplemental Enhancement Information). ARGUS parses those SEI messages in the browser and renders live overlay boxes on the drone-stream tile.

Enable AI on the aircraft

Open the drone-stream tile for a Dock 3 aircraft in the mission ops view.
Open the Drone settings drawer (gear icon, top-right of the tile).
Scroll to Perception → DJI on-aircraft AI and toggle it on.
ARGUS calls the DRC service drc_ai_identify with on=1, routed through argus-dji as an MQTT publish to the aircraft. SEI messages begin appearing in the bitstream within ~1–2 seconds.

Optional: on Dock 3 with multiple recognition models loaded, pick the active model from the dropdown below the toggle. ARGUS publishes drc_ai_model_select with the chosen index. Default is model 0 (the general-purpose person + vehicle + boat detector).

How the SEI pipeline works

DJI’s AI detections are not a separate data channel — they ride inside the H.264 bitstream itself. The decoder sees a normal video frame; ARGUS extracts the detection payload at the byte level:

Annex-B scan. The incoming buffer is an H.264 access unit — one or more NAL units separated by 00 00 01 or 00 00 00 01 start codes. iterateNalUnits yields each NAL body.
NAL type 6 (SEI). Only NAL units where (byte[0] & 0x1f) === 6 carry SEI. Others are skipped.
Emulation-prevention strip. Per RFC 3984 §5.4, any 00 00 03 sequence inside a NAL has the 03 inserted to prevent the payload from resembling a start code. stripEmulationPrevention removes all such escapes before parsing — reading the raw NAL as bytes would otherwise give wrong offsets.
Multi-byte length decode. H.264 §7.3.2.3.1 encodes both payload_type and payload_size as the sum of leading 0xFF bytes plus a final byte < 0xFF. readMultiByteLength decodes both.
DJI payload type. DJI uses payload_type = 5 (user_data_unregistered); some dev builds also emit type 4.
Transport marker. The first byte of the DJI payload distinguishes the transport:
- 0xF5 → WebRTC (post-Dock 3 firmware, the modern path).
- 0x65 → Agora (legacy transport, kept for old firmware).
- anything else → raw fallback, the caller can inspect the unparsed bytes.
DJI struct decode. Little-endian layout — optional DJIF magic, version byte, target count, 64-bit timestamp, then N × (target index, object type, confidence, x, y, w, h, state, reserved). Coordinates are 12-bit fixed-point 0–4095, frame-relative — not pixel coords. The parser divides by 4095 to return normalised 0–1 rectangles so the overlay renderer doesn’t need to know the video resolution.

All of this runs in pure TypeScript — no native decode dependency.

Object types

Value	Type
1	`UNKNOWN`
2	`PERSON`
3	`CAR`
4	`BOAT`

Object type drives the overlay box colour — person green, car cyan, boat blue, unknown grey.

Target states

When the aircraft’s tracking manager is active, each detection carries a per-target state:

Value	State
0	`TRACKED` — stable lock frame-to-frame.
1	`LOST` — tracker lost the target; the box will vanish next frame.
2	`NEW` — first appearance, not yet stable.
3	`OBSCURED` — partially hidden; tracker is maintaining the last-known position.

State drives the box border style — solid for TRACKED, dashed for NEW, amber outline for OBSCURED, faded for LOST.

Confidence normalisation

DJI emits confidence in one of two conventions depending on firmware: 0–100 (percent) or 0–10000 (basis points). The parser auto-detects — if the raw byte is greater than 100, it’s treated as basis-points and divided by 10000; otherwise percent and divided by 100. Output is clamped to [0, 1].

Overlay rendering

Each detection renders as a coloured box plus a corner label PERSON 87% (or similar). Clicking a box selects the target — its targetIndex becomes the active tracker. Subsequent actions:

Optional spotlight-zoom (Dock 3 with spotlight payload) — ARGUS calls drc_ai_spotlight_zoom_track with the selected target index. The aircraft’s gimbal auto-tracks and the spotlight follows.
Double-click a box — drop a map flag at the detection’s world-space centroid (ray-cast from camera pose through the box centre onto the Cesium terrain). Useful for marking a person of interest or a suspect vehicle.

Upstream wiring — partial

The downstream half of this pipeline (SEI parse → detection stream → overlay render) is wired and ships. The upstream half — extracting the H.264 access-unit bytes out of a live WebRTC receiver so the parser can ingest them — is partial. Two per-browser paths exist, and the correct one is picked by the view component that owns the stream:

RTCRtpScriptTransform (Chrome, Edge, Firefox-nightly) — attach a worker transform to the incoming RTCRtpReceiver. The worker receives RTCEncodedVideoFrames whose .data is the raw H.264 NAL bytes. Feed them into DjiAiDetectionsService.ingestAnnexB(buf, meta).
VideoDecoder + MediaStreamTrackProcessor (Chrome + Edge today, Safari 17+) — pull encoded chunks on the main thread, parse SEI, then re-decode normally for display. More portable but costs one main-thread decode hop.
Server-side parse (fallback) — argus-dji subscribes the WHIP ingress, extracts SEI from RTP, publishes parsed detections on the TACLINK data channel. Works in every browser but needs a C/GStreamer H.264 parser; scheduled separately if the browser paths prove fragile.

Aspirational until upstream is finalised. Today, detections render on browsers where the RTCRtpScriptTransform path is wired for the specific dock firmware family (Dock 3 post-04.00). On Safari, the VideoDecoder path is the planned fallback. Expect gaps on older browsers until the server-side path lands.

Reset between streams

When the operator switches cameras (e.g. wide → zoom) or the stream is torn down and re-provisioned, DjiAiDetectionsService.clear() empties the detection history and resets the latest frame — so stale boxes from the previous stream don’t linger on the new video.

DRC — the command pipeline used to toggle the on-aircraft AI and select the recognition model.
Dock tile — the parent tile that hosts the drone-stream overlay.
Auto-stream — the ingress layer the SEI messages ride.