Three Investigation Tracks

A scoping plan that splits the system into three parallel workstreams. They map to three layers — capture → perception → grounded reasoning — and can run largely in parallel if the hand-off contracts (§5) are frozen early.

  TRACK 1                 TRACK 2                      TRACK 3
  Capture & upload   →    Model perception R&D    →    Grounded reasoning
  (on device)             (what can we extract?)       (video + data pile)

  produces:               produces:                    consumes 1 + 2:
  capture bundle          scene_facts + config         data pile + facts
        └──────── contracts (§5) keep them aligned ────────┘

The dependency runs left to right, but the work doesn't have to: Tracks 2 and 3 can start today on sample/stock room video and a hand-authored data pile, long before Track 1 is production-ready. The contracts are what stop the tracks from diverging.

Track 1 — On-device reconstruction & capture

Objective. A native iOS app that captures a complete, registered, high-fidelity bundle on device and uploads it reliably. No server dependency during capture.

In scope.

Native Swift owns one ARWorldTrackingConfiguration ARSession (sceneReconstruction = .mesh, frameSemantics = .sceneDepth, planeDetection), with RoomCaptureSession(arSession:) driven from it.
Local recording of the full bundle: HEVC RGB, 16-bit packed depth, per-frame intrinsics + pose track, sparse points, ARPlaneAnchor/ARMeshAnchor, CapturedRoom/CapturedStructure, governor-selected keyframes, and a manifest.json pinning the world-frame anchor.
On-device live cues (coverage / "missed corner") from RoomPlan completedEdges + ARMesh coverage — no server needed.
Resumable background upload to S3 (URLSession background multipart, wifi-preferred, checksummed).
Instant on-device parametric model shown the moment scanning stops.

Out of scope (for now). Live streaming, Unity (can be added later as a render/UX layer over the same native session), any server round-trip.

Key questions to answer.

Depth fidelity: is 16-bit-packed-into-video good enough for fusion, or do we need raw depth frames?
Bundle format + size; mesh deltas vs final mesh; USDZ vs custom serialization.
Capture duration limits — thermal/throttling envelope on target devices.
Coordinate-frame stability: relocalization, drift, multi-pass merge via CapturedStructure.

Deliverables. Capture app that records + uploads; the capture-bundle schema (a shared contract); a handful of real captured datasets to feed Tracks 2 and 3.

Done when. A bundle reconstructs coherently offline (poses register depth into one cloud) and an upload survives a mid-transfer network drop.

Track 2 — Model perception R&D (Cosmos / SAM2 / Sapiens2)

Objective. Empirically characterize what each model can actually extract from room video and frames. This is exploratory R&D to de-risk capability, not a production pipeline.

"Video out → video in," decoupled. Per the upload-first decision, this is file-based: recorded video/frames from S3 (or stock room videos initially) fed to the models — not a live stream. That decoupling is what lets Track 2 start before Track 1 exists.

In scope.

A harness: video/frame-set in → run each model → log outputs for human eval.
- Cosmos 3 — captions, scene Q&A, surface/object labeling, coverage-gap reasoning. Measure hallucination rate on spatial/metric claims.
- SAM2 — masks + tracks on handheld room video; auto vs box vs point prompting; keyframe-reseed + memory propagation.
- Sapiens2 — only when occupants are in frame; pose/parts/pointmap, masked + scale-normalized. Decide if it earns a place.
- (Optional) metric-depth model to fill LiDAR gaps.
Sweep the variables: keyframe cadence, resolution, EVS on/off, Cosmos prompt styles, model sizes (Nano vs Super).
Define how outputs merge into scene_facts.json.

Key questions. Can Cosmos reliably describe/answer about room content without hallucinating geometry? How good are SAM2 masks/tracks on shaky handheld footage? Does Sapiens2 add value here at all? Right keyframe cadence? Cost + latency per minute of video, per model?

Deliverables. Experiment harness; a capability × quality × latency × cost matrix; a recommended model config; the frozen scene_facts schema (the contract Track 3 consumes).

Done when. We have an evidence-based answer to "what can we actually get," and a stable scene-facts format.

Track 3 — Grounded Cosmos reasoning with the data pile

(The "maybe third" — it's the payoff, and it depends on 1 + 2, but it can be prototyped now on mock data.)

Objective. Test whether feeding Cosmos a room video plus the reconstruction-derived data pile (progressive markdown KB + scene_facts) yields materially better, grounded reasoning than video alone — especially for metric questions it would otherwise guess at.

In scope.

Assemble context = keyframes/video + structured facts (measurements, plane equations, object inventory, captions) + the progressive markdown library.
Run Cosmos reasoning sessions; evaluate Q&A quality, grounding, and whether Cosmos defers to supplied numbers instead of hallucinating ("how wide is the doorway?" → reads the measured value).
A/B: with vs without the data pile, to quantify the lift.
Decide injection method: in-context markdown vs retrieval (RAG), and the context-budget threshold where RAG becomes necessary.

Key questions. Does grounding measurably cut hallucination? Does Cosmos correctly use provided geometry rather than inventing it? How large can the markdown get before it blows the context window (i.e., when to flip to RAG)? What fact formatting maximizes VLM uptake?

Deliverables. A grounded-reasoning eval (with/without data pile), prompt + fact-format recipes, a defensible RAG-now-vs-later decision, and the data-pile schema validated against real questions.

Done when. There's a demonstrable accuracy lift from grounding and a repeatable way to answer room questions correctly, including metrics.

Shared contracts (define these first — they are the glue)

The tracks only stay aligned if three data contracts are pinned early, even provisionally:

Capture-bundle schema (Track 1 output → Track 2 input): file layout, codecs, the pose/intrinsics track format, world-frame anchor, manifest.
scene_facts.json schema (Track 2 output → Track 3 input): objects (bbox, dimensions), surfaces (plane eqn, normal, corners), measurements, captions, links to keyframes/masks — all in the ARKit world frame.
Data-pile / KB schema (Tracks 1+2 → Track 3): the progressive markdown structure (index → rooms → surfaces → tasks) with front-matter IDs, coordinates, confidence, and S3 URIs.

Write these as living specs on day one. Tracks 2 and 3 stub them with mock data; Track 1 fills them with real data; everyone codes to the same shape.

Sequencing & parallelization

Now: freeze the three contracts (v0). Start Track 2 on stock room videos. Start Track 3 with a hand-authored data pile + mock scene_facts to prototype grounded prompting.
Track 1 in parallel: build capture → local bundle → upload; produce the first real datasets.
Converge: swap Track 2/3's mock inputs for Track 1's real bundles; Track 2's findings finalize scene_facts; Track 3 validates the data pile against real Q&A.
Then: revisit RAG (Track 3 threshold) and live streaming (only if a true interactive feature appears) — both deferred until the file-based pipeline works end to end.

The discipline that makes this work: build value on uploaded captures first, keep the three contracts stable, and let mock data unblock the downstream tracks so nothing waits on the capture app being finished.