DJI SEI detections

DJI Dock 3 aircraft run their own on-board object-detection model and embed the results in the H.264 video stream as SEI (Supplemental Enhancement Information) messages. ARGUS parses these in-browser and renders them in the drone-stream overlay alongside any YOLO-agent detections.

Why embed detections in SEI?

SEI embedding has a big advantage: detections arrive bound to the frame that produced them. There’s no clock-skew problem — a box and the pixels it describes are in the same H.264 NAL unit, so rendering is pixel-perfect even over jittery links.

The parser

dji-sei-parser.ts handles the full decode pipeline:

Annex-B scan — walks the access unit looking for NAL units with start codes 00 00 00 01 or 00 00 01. SEI NAL type is 0x06.
Emulation-prevention strip — removes 0x03 escape bytes inserted after any 00 00 sequence in the original payload.
Multi-byte length decode — SEI payload_type and payload_size can be multi-byte; the parser reads runs of 0xFF + a final <0xFF byte.
Marker detection — first payload byte identifies transport:
- 0xF5 — WebRTC (post-Dock 3 firmware).
- 0x65 — Agora (legacy).
- Anything else — raw payload returned, caller can inspect.
Payload struct decode — little-endian:
- Optional DJIF magic (0x44 0x4A 0x49 0x46).
- Version byte.
- Target count.
- Timestamp ms (BigUint64).
- Per target (16 bytes): target index, object type, confidence, normalised x/y/w/h (12-bit fixed-point, divided by 4095).
- Optional state + reserved byte.

Object types

Value	Meaning
`1`	UNKNOWN
`2`	PERSON
`3`	CAR
`4`	BOAT

Target state

Value	Meaning
`0`	TRACKED
`1`	LOST
`2`	NEW
`3`	OBSCURED

The overlay badges new / lost / obscured state for operators who want to track object continuity.

Confidence normalisation

Different DJI firmware families encode confidence differently:

Some use 0-100 (percent).
Some use 0-10000 (basis points).

The parser auto-detects: confRaw > 100 ? confRaw / 10000 : confRaw / 100 and clamps to [0, 1].

The downstream consumer

DjiAiDetectionsService takes parser output and publishes:

detections$ — Observable stream of per-frame detections.
latestFrame() — Angular signal for reactive bindings.
history[] — rolling 500-frame cache for analytics / scrubbing.

The drone-stream overlay subscribes and renders. Boxes appear in green by default; you can change via the AI tab.

Upstream wiring (the honest story)

The parser + consumer service are real and complete. What’s still partial is the upstream frame-transform hook — the browser-side code that extracts SEI from the live video track. It needs an RTCRtpScriptTransform (Chrome / Edge / Firefox Nightly) or a VideoDecoder + MediaStreamTrackProcessor fallback (Safari 17+). Both approaches are fragile across dock firmware and browser versions, so the hook is being rolled out per-browser after field telemetry from the auto-stream deployment. Until it’s fully live, DJI SEI detections may only render in a subset of browsers or after firmware-specific tweaks — flag any mismatch to support.

Comparison to YOLO11

Aspect	DJI SEI	YOLO11
Where inference runs	On aircraft (free)	Cloud GPU (metered)
Latency	~50 ms	~150 ms
Classes	person, car, boat	80 COCO + SAR fine-tune
Network required	No	Yes
Persistent re-ID	No	Yes (`p_xxxx`)
Runs with DJI only	Yes	Any drone

Most operations run both: DJI SEI for real-time low-latency on dock aircraft, YOLO11 for cross-fleet consistency + persistent IDs.

DJI auto-stream — the transport layer that delivers the H.264 stream we parse.
YOLO11
Detection overview