Spatial Capture & Understanding System — Design

iOS (Unity + ARKit + RoomPlan) → WebRTC → Modal GPU pipeline (Cosmos 3 / SAM2 / Sapiens2) → S3 → batch enrichment → progressive-markdown knowledge base.

Design choices locked for v1: live feedback = light cues only; compute = Modal-first; knowledge output = progressive markdown (source of truth), RAG derived later.

1. System at a glance

 ┌───────────────────────── iOS DEVICE ──────────────────────────┐
 │  NATIVE CAPTURE CORE (Swift) — owns the ONE ARSession          │
 │   • ARWorldTrackingConfiguration: mesh + sceneDepth + planes   │
 │   • RoomCaptureSession(arSession:) driven from that session    │
 │   • Harvests: RGB, LiDAR depth, intrinsics, pose, points,      │
 │     ARPlaneAnchor, ARMeshAnchor, CapturedRoom/Structure        │
 │                         │ bridge (poses, anchors, cam texture) │
 │   UNITY (UaaL) ◀────────┘  render/UX layer + live-cue overlays │
 │                         │                                       │
 │   ENCODE + PUBLISH                                              │
 │   • H.265 RGB track  • depth track (16-bit packed)             │
 │   • DataChannel: intrinsics+pose+AR stats+RoomPlan (reliable)  │
 └───────────────────────────────┬───────────────────────────────┘
                                  │ WebRTC (media + data)  + TURN
                                  ▼
 ┌──────────────── INGEST EDGE (always-on, NOT Modal) ───────────┐
 │  LiveKit / Pion SFU terminates WebRTC, demuxes streams         │
 │  FRAME GOVERNOR: perceptual-hash + motion gating → keyframes   │
 │  ├─ LIVE-CUE LANE (cheap, <300ms) ──▶ back to app via DataChan │
 │  │    coverage/quality, "missed corner", basic object tags     │
 │  └─ INGEST LANE ─▶ write raw capture bundle to S3 + enqueue    │
 └───────────────┬───────────────────────────────┬───────────────┘
                 │ raw bundle                      │ job messages
                 ▼                                 ▼
        ┌──────────────┐              ┌──────────────────────────┐
        │   S3 (raw)   │              │  Modal queue (near-line + │
        │  per-session │◀────────────▶│  batch GPU functions)     │
        └──────┬───────┘              │                           │
               │                      │  NEAR-LINE (sec–min):     │
               │                      │   • SAM2 track/masks      │
               │                      │   • Cosmos3 captions/QA   │
               │                      │   • Sapiens2 (if person)  │
               │                      │   • depth+pose → cloud    │
               │                      │   • structured scene facts│
               │                      │        │ feed back        │
               │                      │        ▼ Cosmos3 reason    │
               │                      │  BATCH (min–hr):          │
               │                      │   • 3D Gaussian Splats    │
               │                      │   • mesh fuse/optimize    │
               │                      │   • progressive markdown  │
               │                      │   • (optional) RAG embed  │
               │                      └───────────┬───────────────┘
               ▼                                  ▼
        ┌─────────────────────────────────────────────────────────┐
        │  S3 (derived): splats, meshes, masks, scene-facts JSON,   │
        │  progressive markdown library (TOC+index), embeddings     │
        └─────────────────────────────────────────────────────────┘

2. Capture client (iOS): one session owner

The single hard rule, validated against Apple's API and the documented camera-hijack bug: exactly one component owns the ARSession, and it is native Swift — not Unity's AR Foundation.

Why native owns it:

RoomPlan is a native-only framework; AR Foundation doesn't surface it. RoomCaptureSession(arSession:) accepts your ARWorldTrackingConfiguration session and "preserves all of the AR session's settings" — so you configure sceneReconstruction = .mesh, frameSemantics = [.sceneDepth] (or .smoothedSceneDepth), planeDetection = [.horizontal, .vertical] once, hand it to RoomPlan, and get the parametric room and the raw mesh/plane/depth from the same session.
If Unity's ARKit plugin also tries to own a session, you hit the documented black-camera-feed conflict (two sessions fighting for the camera). One owner avoids it entirely.

Unity's role: embedded via Unity as a Library (UaaL) for the interactive 3D/AR UI and live-cue overlays. It is a passive renderer — the native core feeds it the camera texture, world poses, and anchor updates over a bridge; Unity never starts its own ARSession. (Alternative: native plugin inside a Unity-hosted app, but the session must still be created and owned on the native side.)

What the capture core emits, all keyed to ARFrame timestamp + the ARKit world coordinate frame (gravity-aligned, origin at session start — store the session's world-anchor so the server can register everything):

Stream	Source	Rate	Transport
RGB video	ARFrame `capturedImage`	30–60 fps	WebRTC video track, HEVC (VideoToolbox HW encode)
Depth	`sceneDepth.depthMap` (LiDAR, ~256×192)	10–30 fps	2nd video track, 16-bit depth packed into frame, or compressed on data channel
Intrinsics + pose	`camera.intrinsics`, `transform`	per-frame	DataChannel (reliable, ordered)
Feature points	`rawFeaturePoints`	throttled	DataChannel (deltas)
Planes / meshes	`ARPlaneAnchor`, `ARMeshAnchor`	on add/update only	DataChannel (deltas, throttled)
Room model	`CapturedRoom` / `CapturedStructure`	on update	DataChannel (small, parametric)

3. Transport: WebRTC design (and the honest Modal caveat)

One RTCPeerConnection per session:

Media: HEVC RGB track + a packed-depth track. Depth is encoded as 16-bit (e.g., depth split across two 8-bit channels) so a video codec can carry it without quantizing to 8-bit; reconstruct server-side. Subsample depth to the rate fusion actually needs (10–15 fps is plenty for a TSDF).
Data channels: meta (reliable, ordered) for intrinsics/pose/AR stats/RoomPlan deltas; an optional cues channel server→client for the live hints. Send geometry as deltas (only new/changed anchors) — full mesh every frame will saturate mobile uplink.
Signaling + NAT: a small WebSocket signaling service + a TURN server (coturn) for relay.

⚠️ Modal caveat — be explicit about this. WebRTC media ingest needs UDP, ICE, and stable long-lived endpoints. Modal's serverless/HTTP model is not a good fit for terminating WebRTC. So "Modal-first" applies to GPU compute and batch, not the media edge. Run the WebRTC termination on an always-on media server — LiveKit (self-host on a small EC2/Fly instance, or LiveKit Cloud) or a custom Pion (Go) / aiortc (Python) ingest — which demuxes frames and then calls Modal GPU functions / drops bundles to S3. This is the one piece that lives outside Modal.

4. Ingest edge: frame governor + the two lanes

At the media server (this is where the frame governor from the earlier design lives):

Frame governor — perceptual hash + optical-flow/motion gating tags each frame {keyframe?, motion, coverage-delta}. Only keyframes trigger GPU work; SAM2 gets continuous frames for tracking (it has its own streaming memory, so don't starve it).
Live-cue lane (cheap, sub-300 ms, → app): this is the only thing the user sees in real time, and it's intentionally light:
- Coverage/quality from RoomPlan completedEdges + ARMesh coverage → "you haven't captured the back wall / far-right corner."
- A periodic (every few seconds) Cosmos 3 Nano reasoner call on a keyframe for semantic gap detection ("ceiling not scanned," "mirror may be corrupting geometry") — VLM strength, not geometry.
- Optional lightweight detector for basic object tags.
- Results pushed back over the cues data channel and rendered by Unity.
Ingest lane: assemble the raw capture bundle and write to S3; enqueue near-line + batch jobs.

5. Storage: S3 layout

s3://spatial/{tenant}/{session_id}/
  manifest.json            # session meta, world-frame anchor, device, timestamps, codec params
  raw/
    rgb.mp4                # HEVC
    depth.mp4              # 16-bit packed depth
    track.parquet          # per-frame: ts, intrinsics, pose (4x4)
    points.parquet         # sparse feature points (deltas merged)
    planes.jsonl           # ARPlaneAnchor snapshots
    meshes/                # ARMeshAnchor chunks (per anchor id)
    room.usdz / room.json  # CapturedRoom / CapturedStructure (parametric)
    keyframes/             # extracted JPEGs at governor-selected frames
  derived/
    cloud.ply              # fused world point cloud
    mesh_opt.glb           # optimized mesh
    splat.ksplat           # gaussian splat
    masks/                 # SAM2 instance masks + track ids
    scene_facts.json       # structured objects/surfaces/measurements/captions
    kb/                    # progressive markdown library (see §8)
    embeddings/            # optional RAG vectors

Everything in derived/ is regenerable from raw/ — so raw is the durable asset; reprocessing improves results as models improve.

6. GPU processing on Modal

Each stage is a Modal function (scale-to-zero GPU, S3 in/out), wired through Modal's queue. Two tiers by latency tolerance.

Near-line (seconds–minutes after keyframes land):

SAM2 — segment + track objects across keyframes → instance masks + stable IDs. Detection/re-seed on keyframes, propagate via SAM2 memory between them.
Cosmos 3 reasoner NIM — captions, object/event Q&A, surface labels. Runs as the NIM container (one docker run, no Kubernetes) hosted as a Modal GPU service.
Sapiens2 — conditional: only when a person is detected in frame (occupant/contractor). Pose/parts/normals/pointmap, masked to human pixels via the segmentation head, scale-normalized — useful for people in the scene, irrelevant to empty-room geometry. Don't run it on empty rooms.
Geometry fusion — transform per-frame depth into world frame via poses → accumulate TSDF → point cloud. ARKit already gives metric depth + mesh, so this is fusion, not learned depth; a metric-depth model only fills LiDAR gaps if needed.
Structured scene facts — merge geometry + masks + room model into scene_facts.json (objects with bbox/dimensions, surfaces with plane eqn + normal + corners, measurements, links to keyframes/masks).

The Cosmos feedback loop: scene facts + keyframes are assembled into a structured context and fed back into Cosmos 3 for higher-level reasoning — labeling surfaces, answering "what/why/where," flagging anomalies. Critically, Cosmos consumes the computed numbers as context; it never generates coordinates (VLMs hallucinate geometry — see the pointmap discussion). Cosmos is captioner early, reasoner-over-structured-data late.

Batch (minutes–hours, the "fire up a queue" tier):

3D Gaussian Splatting — you have posed RGB + intrinsics, so skip SfM: pose-known 3DGS training → .ksplat/.ply. GPU-heavy, classic batch job.
Mesh optimization — fuse ARMeshAnchors + depth → Poisson/TSDF mesh → simplify/retopo; refine RoomPlan planes via RANSAC on the fused cloud and recover missed corners by intersecting adjacent fitted planes (RoomPlan ~±5 cm, rectangular, no ceilings — your primitives fix exactly these).
Progressive markdown KB (§8) + optional RAG embeddings.

Orchestration: Modal queues + spawned jobs; a small state row per session (DynamoDB/Postgres) tracks stage completion so the app/UX can show "reconstruction ready," "knowledge base ready."

7. Real-time vs. deferred — the boundary

Tier	Budget	Work	Where
Live cues	<300 ms	coverage/quality, gap reasoning, basic tags	Ingest edge + occasional Cosmos Nano
Near-line	sec–min	SAM2, Cosmos captions/QA, Sapiens2 (if person), fusion, scene facts	Modal GPU
Batch	min–hr	splats, mesh optimization, full KB, embeddings	Modal queue

Rule of thumb you stated, formalized: if it can't finish inside the live-cue budget, it doesn't block the scan — it lands in S3 and gets queued. The scan UX only ever waits on the cheap lane.

8. Knowledge base: progressive markdown (source of truth)

Markdown-first, authored by Cosmos 3 / an LLM over scene_facts.json + keyframes, with explicit detail tiers and a navigable index:

kb/
  index.md            # TOC + entity index (rooms, surfaces, objects) + asset links
  rooms/
    living_room.md     # L1: dimensions, surfaces, contents, summary
  surfaces/
    living_room.north_wall.md   # L2: plane eqn, normal, corners, openings, condition
  objects/
    sofa_01.md         # L2: bbox, dimensions, material guess, linked masks/keyframes
  tasks/
    leak_under_sink.md # L3: anomaly/repair notes, references to frames + measurements

Each doc carries front-matter (IDs, S3 URIs, confidence, world-frame coordinates) so it's both human-browsable and clean LLM context. Detail is progressive: L0 index → L1 room summaries → L2 per-entity detail → L3 task/anomaly notes. New scans append/refine rather than overwrite.

Markdown-first vs. RAG — the recommendation and when to flip. Start markdown-first: it's inspectable, diffable, trivially fed to any LLM, and for a single home the whole index + relevant docs fit comfortably in a modern context window — so you get grounded Q&A without a vector store. Derive RAG from the markdown when scale crosses a threshold: many homes/sessions, cross-property search, or a corpus too large to fit in context. At that point chunk the markdown (it's already cleanly sectioned) → embed → vector DB, keeping markdown as the source of truth and RAG as an index over it. So you're not choosing — you're sequencing. The case for RAG earlier: if you expect open-ended retrieval across a large multi-home library from day one, build the embedding step into the batch tier now; the markdown structure makes that a small add.

9. Key risks & honest caveats

Session ownership (highest risk). One native ARSession owner; Unity is a passive renderer (UaaL). Mixing in AR Foundation's own session causes the documented black-camera bug.
WebRTC ≠ Modal. Media termination needs an always-on server (LiveKit/Pion). Modal does GPU + batch behind it. Budget for a small always-on edge + TURN.
Mobile uplink + thermals. ARKit + RoomPlan + mesh + depth + HEVC encode + WebRTC is a lot of device load → expect limited continuous capture (minutes) and thermal throttling. Send geometry as deltas; subsample depth. RoomPlan itself recommends ≤ ~9×9 m per pass.
Sapiens2 is conditional. Empty rooms don't need it; gate on person detection. And its pointmap is human-only + scale-normalized, not room geometry.
RoomPlan accuracy. ~±5 cm/wall, rectangular simplification, 16 cm uniform wall thickness, no ceilings — treat it as a prior and refine with fused primitives.
Cosmos never emits coordinates. It reasons over computed geometry; geometry comes from ARKit fusion + RANSAC, not the VLM.
Privacy/security. Streaming home interiors to the cloud → explicit consent, encryption in transit (DTLS/SRTP already in WebRTC) and at rest (S3 SSE), per-tenant isolation, and a retention policy on raw video.

10. Suggested build order

Capture core + local record — native ARSession (mesh+depth+planes) + RoomPlan, write a bundle locally. Prove single-session ownership and data completeness before any networking.
S3 upload + manifest — offline bundle → S3; stand up the derived-asset schema.
Batch first, not realtime — run fusion + mesh refinement + 3DGS + markdown KB on uploaded bundles via Modal. This delivers the core value (model + knowledge base) with zero realtime complexity.
WebRTC edge — add LiveKit/Pion ingest + TURN; stream the same bundle live; move ingestion online.
Live-cue lane — coverage hints + occasional Cosmos Nano gap reasoning back to the app.
Near-line lane + Cosmos feedback loop — SAM2/captions/scene-facts → Cosmos reasoning; surface "ready" states in UX.
RAG (if/when scale demands) — derive embeddings from the markdown.

Build the value (3 = reconstruction + KB) before the plumbing (4–6). The realtime stream is an optimization over a pipeline that should already work on uploaded captures.