Skip to content

DJI SEI detections

DJI Dock 3 aircraft run their own on-board object-detection model and embed the results in the H.264 video stream as SEI (Supplemental Enhancement Information) messages. ARGUS parses these in-browser and renders them in the drone-stream overlay alongside any YOLO-agent detections.

Why embed detections in SEI?

SEI embedding has a big advantage: detections arrive bound to the frame that produced them. There’s no clock-skew problem — a box and the pixels it describes are in the same H.264 NAL unit, so rendering is pixel-perfect even over jittery links.

The parser

dji-sei-parser.ts handles the full decode pipeline:

  1. Annex-B scan — walks the access unit looking for NAL units with start codes 00 00 00 01 or 00 00 01. SEI NAL type is 0x06.
  2. Emulation-prevention strip — removes 0x03 escape bytes inserted after any 00 00 sequence in the original payload.
  3. Multi-byte length decode — SEI payload_type and payload_size can be multi-byte; the parser reads runs of 0xFF + a final <0xFF byte.
  4. Marker detection — first payload byte identifies transport:
    • 0xF5 — WebRTC (post-Dock 3 firmware).
    • 0x65 — Agora (legacy).
    • Anything else — raw payload returned, caller can inspect.
  5. Payload struct decode — little-endian:
    • Optional DJIF magic (0x44 0x4A 0x49 0x46).
    • Version byte.
    • Target count.
    • Timestamp ms (BigUint64).
    • Per target (16 bytes): target index, object type, confidence, normalised x/y/w/h (12-bit fixed-point, divided by 4095).
    • Optional state + reserved byte.

Object types

ValueMeaning
1UNKNOWN
2PERSON
3CAR
4BOAT

Target state

ValueMeaning
0TRACKED
1LOST
2NEW
3OBSCURED

The overlay badges new / lost / obscured state for operators who want to track object continuity.

Confidence normalisation

Different DJI firmware families encode confidence differently:

  • Some use 0-100 (percent).
  • Some use 0-10000 (basis points).

The parser auto-detects: confRaw > 100 ? confRaw / 10000 : confRaw / 100 and clamps to [0, 1].

The downstream consumer

DjiAiDetectionsService takes parser output and publishes:

  • detections$ — Observable stream of per-frame detections.
  • latestFrame() — Angular signal for reactive bindings.
  • history[] — rolling 500-frame cache for analytics / scrubbing.

The drone-stream overlay subscribes and renders. Boxes appear in green by default; you can change via the AI tab.

Upstream wiring (the honest story)

The parser + consumer service are real and complete. What’s still partial is the upstream frame-transform hook — the browser-side code that extracts SEI from the live video track. It needs an RTCRtpScriptTransform (Chrome / Edge / Firefox Nightly) or a VideoDecoder + MediaStreamTrackProcessor fallback (Safari 17+). Both approaches are fragile across dock firmware and browser versions, so the hook is being rolled out per-browser after field telemetry from the auto-stream deployment. Until it’s fully live, DJI SEI detections may only render in a subset of browsers or after firmware-specific tweaks — flag any mismatch to support.

Comparison to YOLO11

AspectDJI SEIYOLO11
Where inference runsOn aircraft (free)Cloud GPU (metered)
Latency~50 ms~150 ms
Classesperson, car, boat80 COCO + SAR fine-tune
Network requiredNoYes
Persistent re-IDNoYes (p_xxxx)
Runs with DJI onlyYesAny drone

Most operations run both: DJI SEI for real-time low-latency on dock aircraft, YOLO11 for cross-fleet consistency + persistent IDs.