When you fuse iPhone LiDAR depth with a YOLO bounding box, the obvious thing to do is intersect them: project the box into the depth map, sample the median depth, and trust that.
I shipped that. It worked beautifully when the camera was still โ and produced ghost objects everywhere when the camera was panning.
Recognition rate on a real warehouse pilot dropped from 85% steady-state to roughly 60% during motion. This is the story of the three fixes that brought it back to 88%.
This post is iOS / CoreML / ARKit / LiDAR-specific. I maintain CoreML-Models (1,749โ ) โ the de-facto iOS Core ML model zoo. The retail-vision app described here is the production iOS app I've been solo-leading for 6 months.
sceneDepth at 256ร192 resolution (landscape).Bounding boxes near the edges of the frame produced detections that didn't exist in the real world โ most often during camera motion. The downstream "match this SKU to a bin" code happily assigned these ghost detections to real bins, displacing the genuine SKU.
Symptom: SKU recognition rate visibly degraded during pans, returning to baseline when the camera was still.
Hypothesis 1: YOLO is firing on the wrong things during motion blur.
I bumped the YOLO confidence threshold from 0.40 to 0.55.
Result: recognition rate dropped further, to ~55%. The high threshold cut too many real-but-faint detections without solving the ghost problem. Reverted.
Hypothesis 2: Just debounce โ require N consecutive frames to confirm a detection.
I added N=3 frame debounce.
Result: ghost detections did go away, but real detections now appeared with a visible 100-150ms lag during scans. Worker complaint: "too slow". Reverted.
Hypothesis 3: Tighten the per-anchor IoU threshold for the depth-to-detection match.
Result: marginal improvement, but most ghost detections were already passing the IoU check because they happened to be in plausible spatial locations. Marginal fix, kept.
Neither hypothesis was wrong, exactly. But none of them addressed why the camera-motion case produced bad depth.
When I dumped the actual depth data per-pixel during a pan, I noticed the depth values weren't completely wrong โ they were just unconfident. ARKit's sceneDepth ships with a per-pixel confidence channel (ARConfidenceLevel: low / medium / high) that most fusion code ignores entirely.
At the edges of the frame, during motion, low-confidence pixels dominated. The median depth I was computing was therefore being driven by garbage.
Fix 1: Mask out low confidence pixels before computing per-detection median depth.
// Before
let depths = pixelsInBox.map { depthMap[$0] }
let medianDepth = depths.median()
// After
let depths = pixelsInBox
.filter { confidenceMap[$0] != .low }
.map { depthMap[$0] }
// If <30% high-or-medium-confidence pixels remain, abstain from depth fusion
guard depths.count > pixelsInBox.count / 3 else {
detection.depth = nil // surface to the UI as "uncertain"
continue
}
let medianDepth = depths.median()
The "abstain" branch turned out to be more important than the masking. Telling the system "I don't know" let the upstream tracking handle the gap gracefully rather than feeding garbage downstream.
This alone bumped during-motion recognition from 60% โ ~75%.
Once a SKU is recognized once and depth-fused once, you have a 3D anchor for it. Real detections should reinforce that anchor; ghosts should die out fast.
I keep a per-anchor confidence that decays each frame and adds on reinforcement:
struct TrackedAnchor {
var lastSeenFrame: Int
var confidence: Float // 0.0 to 1.0
var sku: String
var worldPosition: simd_float3
}
// Every frame:
for anchor in trackedAnchors {
anchor.confidence *= 0.85 // decay factor
}
for detection in currentFrameDetections {
let matched = trackedAnchors.first {
distance($0.world, detection.world) < 0.05
}
if let m = matched {
m.confidence = min(1.0, m.confidence + 0.3) // reinforce
m.lastSeenFrame = currentFrame
} else {
// new anchor candidate; needs to reach 0.5 confidence before "real"
let newAnchor = TrackedAnchor(confidence: 0.3, ...)
trackedAnchors.append(newAnchor)
}
}
// Prune
trackedAnchors.removeAll { $0.confidence < 0.15 }
Ghost detections โ typically single-frame events โ never accumulate enough confidence to be promoted to "real". Real detections, even ones that flicker, persist long enough to dominate.
Key insight: decay factor 0.85 with reinforcement 0.3 means an anchor needs 2-3 reinforcing frames before it becomes "real", but only loses real-status after ~5 frames of silence. Asymmetric; intentional.
This brought during-motion recognition to ~85%.
The remaining ghost detections clustered in specific image regions: corners and edges where depth quality is fundamentally worse.
I split the frame into 9 zones (3ร3 grid) and applied different YOLO confidence thresholds per zone:
func confidenceThreshold(for boundingBox: CGRect) -> Float {
let centerX = boundingBox.midX
let centerY = boundingBox.midY
// Distance from frame center, 0.0 (center) to ~1.4 (corner)
let dx = abs(centerX - 0.5) * 2
let dy = abs(centerY - 0.5) * 2
let distFromCenter = sqrt(dx*dx + dy*dy)
// Linear interpolate: center=0.40, edges=0.55, corners=0.65
let baseThreshold: Float = 0.40
let edgePenalty: Float = 0.15 * Float(distFromCenter)
return baseThreshold + edgePenalty
}
This isn't novel; what mattered was tying it to depth quality (which falls off at frame edges on iPhone LiDAR), not to YOLO accuracy (which is actually pretty uniform across the frame).
This brought final during-motion recognition to 88% โ within margin-of-error of steady-state.
sceneDepth carefully. The confidence channel is the single most underused signal in iOS depth work.With ghost detections suppressed, downstream code can trust the detections. SKU โ bin assignment runs with no skepticism. The user-visible behavior is "you scan, it just works". 70+ TestFlight builds in 6 months, 5 daily beta users, sub-200ms end-to-end on iPhone 15 Pro.
I'm a Senior iOS / On-Device ML Engineer based in Osaka, Japan. I'm currently solo-lead on the 94k-line production retail-vision iOS app this post is from.
If you're building products with on-device ML โ camera, AR, LiDAR, or on-device LLM โ I'm open to Staff / Senior iOS or Mobile ML roles. Remote (JST), Tokyo/Osaka hybrid, or relocate with visa.