Killing ghost detections in real-time LiDAR + YOLO fusion on iPhone

By Daisuke Majima · 2026-05-13

When you fuse iPhone LiDAR depth with a YOLO bounding box, the obvious thing to do is intersect them: project the box into the depth map, sample the median depth, and trust that.

I shipped that. It worked beautifully when the camera was still — and produced ghost objects everywhere when the camera was panning.

Recognition rate on a real warehouse pilot dropped from 85% steady-state to roughly 60% during motion. This is the story of the three fixes that brought it back to 88%.

This post is iOS / CoreML / ARKit / LiDAR-specific. I maintain CoreML-Models (1,749★) — the de-facto iOS Core ML model zoo. The retail-vision app described here is the production iOS app I've been solo-leading for 6 months.

The setup

Hardware: iPhone 15 Pro, LiDAR sensor.
Model: YOLOv8n converted via coremltools, running on the Neural Engine. ~6ms per inference.
Depth: ARKit sceneDepth at 256×192 resolution (landscape).
Latency budget: <200ms end-to-end (capture → classify → fuse → display).
Workflow: Warehouse worker pans the iPhone across a shelf; the app detects SKUs and snaps them to physical bins via depth.

The bug

Bounding boxes near the edges of the frame produced detections that didn't exist in the real world — most often during camera motion. The downstream "match this SKU to a bin" code happily assigned these ghost detections to real bins, displacing the genuine SKU.

Symptom: SKU recognition rate visibly degraded during pans, returning to baseline when the camera was still.

What didn't work

Hypothesis 1: YOLO is firing on the wrong things during motion blur.

I bumped the YOLO confidence threshold from 0.40 to 0.55.

Result: recognition rate dropped further, to ~55%. The high threshold cut too many real-but-faint detections without solving the ghost problem. Reverted.

Hypothesis 2: Just debounce — require N consecutive frames to confirm a detection.

I added N=3 frame debounce.

Result: ghost detections did go away, but real detections now appeared with a visible 100-150ms lag during scans. Worker complaint: "too slow". Reverted.

Hypothesis 3: Tighten the per-anchor IoU threshold for the depth-to-detection match.

Result: marginal improvement, but most ghost detections were already passing the IoU check because they happened to be in plausible spatial locations. Marginal fix, kept.

Neither hypothesis was wrong, exactly. But none of them addressed why the camera-motion case produced bad depth.

What worked: looking at the confidence mask

When I dumped the actual depth data per-pixel during a pan, I noticed the depth values weren't completely wrong — they were just unconfident. ARKit's sceneDepth ships with a per-pixel confidence channel (ARConfidenceLevel: low / medium / high) that most fusion code ignores entirely.

At the edges of the frame, during motion, low-confidence pixels dominated. The median depth I was computing was therefore being driven by garbage.

Fix 1: Mask out low confidence pixels before computing per-detection median depth.

// Before
let depths = pixelsInBox.map { depthMap[$0] }
let medianDepth = depths.median()

// After
let depths = pixelsInBox
    .filter { confidenceMap[$0] != .low }
    .map { depthMap[$0] }

// If <30% high-or-medium-confidence pixels remain, abstain from depth fusion
guard depths.count > pixelsInBox.count / 3 else {
    detection.depth = nil  // surface to the UI as "uncertain"
    continue
}
let medianDepth = depths.median()

The "abstain" branch turned out to be more important than the masking. Telling the system "I don't know" let the upstream tracking handle the gap gracefully rather than feeding garbage downstream.

This alone bumped during-motion recognition from 60% → ~75%.

Fix 2: per-anchor exponential decay tracker

Once a SKU is recognized once and depth-fused once, you have a 3D anchor for it. Real detections should reinforce that anchor; ghosts should die out fast.

I keep a per-anchor confidence that decays each frame and adds on reinforcement:

struct TrackedAnchor {
    var lastSeenFrame: Int
    var confidence: Float  // 0.0 to 1.0
    var sku: String
    var worldPosition: simd_float3
}

// Every frame:
for anchor in trackedAnchors {
    anchor.confidence *= 0.85  // decay factor
}

for detection in currentFrameDetections {
    let matched = trackedAnchors.first {
        distance($0.world, detection.world) < 0.05
    }
    if let m = matched {
        m.confidence = min(1.0, m.confidence + 0.3)  // reinforce
        m.lastSeenFrame = currentFrame
    } else {
        // new anchor candidate; needs to reach 0.5 confidence before "real"
        let newAnchor = TrackedAnchor(confidence: 0.3, ...)
        trackedAnchors.append(newAnchor)
    }
}

// Prune
trackedAnchors.removeAll { $0.confidence < 0.15 }

Ghost detections — typically single-frame events — never accumulate enough confidence to be promoted to "real". Real detections, even ones that flicker, persist long enough to dominate.

Key insight: decay factor 0.85 with reinforcement 0.3 means an anchor needs 2-3 reinforcing frames before it becomes "real", but only loses real-status after ~5 frames of silence. Asymmetric; intentional.

This brought during-motion recognition to ~85%.

Fix 3: per-zone YOLO confidence thresholds

The remaining ghost detections clustered in specific image regions: corners and edges where depth quality is fundamentally worse.

I split the frame into 9 zones (3×3 grid) and applied different YOLO confidence thresholds per zone:

func confidenceThreshold(for boundingBox: CGRect) -> Float {
    let centerX = boundingBox.midX
    let centerY = boundingBox.midY

    // Distance from frame center, 0.0 (center) to ~1.4 (corner)
    let dx = abs(centerX - 0.5) * 2
    let dy = abs(centerY - 0.5) * 2
    let distFromCenter = sqrt(dx*dx + dy*dy)

    // Linear interpolate: center=0.40, edges=0.55, corners=0.65
    let baseThreshold: Float = 0.40
    let edgePenalty: Float = 0.15 * Float(distFromCenter)
    return baseThreshold + edgePenalty
}

This isn't novel; what mattered was tying it to depth quality (which falls off at frame edges on iPhone LiDAR), not to YOLO accuracy (which is actually pretty uniform across the frame).

This brought final during-motion recognition to 88% — within margin-of-error of steady-state.

What I'd do differently

Start with the confidence mask. I lost two weeks on hypothesis 1 and 2 before reading ARKit's documentation on sceneDepth carefully. The confidence channel is the single most underused signal in iOS depth work.
Visualize first. Dumping per-pixel depth + confidence to an overlay (just in a debug build) made the fix obvious in 20 minutes after weeks of fumbling.
Per-anchor tracking > per-frame filtering. I tried frame-level filters first; they always traded latency for correctness. The tracker has neither cost because it's amortized across frames.

What this lets the rest of the app do

With ghost detections suppressed, downstream code can trust the detections. SKU → bin assignment runs with no skepticism. The user-visible behavior is "you scan, it just works". 70+ TestFlight builds in 6 months, 5 daily beta users, sub-200ms end-to-end on iPhone 15 Pro.

Hire me

I'm a Senior iOS / On-Device ML Engineer based in Osaka, Japan. I'm currently solo-lead on the 94k-line production retail-vision iOS app this post is from.

If you're building products with on-device ML — camera, AR, LiDAR, or on-device LLM — I'm open to Staff / Senior iOS or Mobile ML roles. Remote (JST), Tokyo/Osaka hybrid, or relocate with visa.

👋 Hire-me: john-rocky.github.io
📦 OSS: CoreML-Models (1,749★), 134 repos
📱 App Store: 10 apps under my own dev account
✉️ rockyshikoku@gmail.com