DJI SEI detections
DJI Dock 3 aircraft run their own on-board object-detection model and embed the results in the H.264 video stream as SEI (Supplemental Enhancement Information) messages. ARGUS parses these in-browser and renders them in the drone-stream overlay alongside any YOLO-agent detections.
Why embed detections in SEI?
SEI embedding has a big advantage: detections arrive bound to the frame that produced them. There’s no clock-skew problem — a box and the pixels it describes are in the same H.264 NAL unit, so rendering is pixel-perfect even over jittery links.
The parser
dji-sei-parser.ts handles the full decode pipeline:
- Annex-B scan — walks the access unit looking for NAL units with
start codes
00 00 00 01or00 00 01. SEI NAL type is0x06. - Emulation-prevention strip — removes
0x03escape bytes inserted after any00 00sequence in the original payload. - Multi-byte length decode — SEI
payload_typeandpayload_sizecan be multi-byte; the parser reads runs of0xFF+ a final <0xFFbyte. - Marker detection — first payload byte identifies transport:
0xF5— WebRTC (post-Dock 3 firmware).0x65— Agora (legacy).- Anything else — raw payload returned, caller can inspect.
- Payload struct decode — little-endian:
- Optional
DJIFmagic (0x44 0x4A 0x49 0x46). - Version byte.
- Target count.
- Timestamp ms (BigUint64).
- Per target (16 bytes): target index, object type, confidence, normalised x/y/w/h (12-bit fixed-point, divided by 4095).
- Optional state + reserved byte.
- Optional
Object types
| Value | Meaning |
|---|---|
1 | UNKNOWN |
2 | PERSON |
3 | CAR |
4 | BOAT |
Target state
| Value | Meaning |
|---|---|
0 | TRACKED |
1 | LOST |
2 | NEW |
3 | OBSCURED |
The overlay badges new / lost / obscured state for operators who want to track object continuity.
Confidence normalisation
Different DJI firmware families encode confidence differently:
- Some use 0-100 (percent).
- Some use 0-10000 (basis points).
The parser auto-detects: confRaw > 100 ? confRaw / 10000 : confRaw / 100
and clamps to [0, 1].
The downstream consumer
DjiAiDetectionsService takes parser output and publishes:
detections$— Observable stream of per-frame detections.latestFrame()— Angular signal for reactive bindings.history[]— rolling 500-frame cache for analytics / scrubbing.
The drone-stream overlay subscribes and renders. Boxes appear in green by default; you can change via the AI tab.
Upstream wiring (the honest story)
The parser + consumer service are real and complete. What’s still partial
is the upstream frame-transform hook — the browser-side code that
extracts SEI from the live video track. It needs an RTCRtpScriptTransform
(Chrome / Edge / Firefox Nightly) or a VideoDecoder + MediaStreamTrackProcessor
fallback (Safari 17+). Both approaches are fragile across dock firmware and
browser versions, so the hook is being rolled out per-browser after field
telemetry from the auto-stream deployment. Until it’s fully live, DJI SEI
detections may only render in a subset of browsers or after firmware-specific
tweaks — flag any mismatch to support.
Comparison to YOLO11
| Aspect | DJI SEI | YOLO11 |
|---|---|---|
| Where inference runs | On aircraft (free) | Cloud GPU (metered) |
| Latency | ~50 ms | ~150 ms |
| Classes | person, car, boat | 80 COCO + SAR fine-tune |
| Network required | No | Yes |
| Persistent re-ID | No | Yes (p_xxxx) |
| Runs with DJI only | Yes | Any drone |
Most operations run both: DJI SEI for real-time low-latency on dock aircraft, YOLO11 for cross-fleet consistency + persistent IDs.
Related
- DJI auto-stream — the transport layer that delivers the H.264 stream we parse.
- YOLO11
- Detection overview