Spatial Capture & Understanding System — Design
iOS (Unity + ARKit + RoomPlan) → WebRTC → Modal GPU pipeline (Cosmos 3 / SAM2 / Sapiens2) → S3 → batch enrichment → progressive-markdown knowledge base.
Design choices locked for v1: live feedback = light cues only; compute = Modal-first; knowledge output = progressive markdown (source of truth), RAG derived later.
1. System at a glance
┌───────────────────────── iOS DEVICE ──────────────────────────┐
│ NATIVE CAPTURE CORE (Swift) — owns the ONE ARSession │
│ • ARWorldTrackingConfiguration: mesh + sceneDepth + planes │
│ • RoomCaptureSession(arSession:) driven from that session │
│ • Harvests: RGB, LiDAR depth, intrinsics, pose, points, │
│ ARPlaneAnchor, ARMeshAnchor, CapturedRoom/Structure │
│ │ bridge (poses, anchors, cam texture) │
│ UNITY (UaaL) ◀────────┘ render/UX layer + live-cue overlays │
│ │ │
│ ENCODE + PUBLISH │
│ • H.265 RGB track • depth track (16-bit packed) │
│ • DataChannel: intrinsics+pose+AR stats+RoomPlan (reliable) │
└───────────────────────────────┬───────────────────────────────┘
│ WebRTC (media + data) + TURN
▼
┌──────────────── INGEST EDGE (always-on, NOT Modal) ───────────┐
│ LiveKit / Pion SFU terminates WebRTC, demuxes streams │
│ FRAME GOVERNOR: perceptual-hash + motion gating → keyframes │
│ ├─ LIVE-CUE LANE (cheap, <300ms) ──▶ back to app via DataChan │
│ │ coverage/quality, "missed corner", basic object tags │
│ └─ INGEST LANE ─▶ write raw capture bundle to S3 + enqueue │
└───────────────┬───────────────────────────────┬───────────────┘
│ raw bundle │ job messages
▼ ▼
┌──────────────┐ ┌──────────────────────────┐
│ S3 (raw) │ │ Modal queue (near-line + │
│ per-session │◀────────────▶│ batch GPU functions) │
└──────┬───────┘ │ │
│ │ NEAR-LINE (sec–min): │
│ │ • SAM2 track/masks │
│ │ • Cosmos3 captions/QA │
│ │ • Sapiens2 (if person) │
│ │ • depth+pose → cloud │
│ │ • structured scene facts│
│ │ │ feed back │
│ │ ▼ Cosmos3 reason │
│ │ BATCH (min–hr): │
│ │ • 3D Gaussian Splats │
│ │ • mesh fuse/optimize │
│ │ • progressive markdown │
│ │ • (optional) RAG embed │
│ └───────────┬───────────────┘
▼ ▼
┌─────────────────────────────────────────────────────────┐
│ S3 (derived): splats, meshes, masks, scene-facts JSON, │
│ progressive markdown library (TOC+index), embeddings │
└─────────────────────────────────────────────────────────┘
2. Capture client (iOS): one session owner
The single hard rule, validated against Apple's API and the documented camera-hijack bug: exactly one component owns the ARSession, and it is native Swift — not Unity's AR Foundation.
Why native owns it:
- RoomPlan is a native-only framework; AR Foundation doesn't surface it.
RoomCaptureSession(arSession:)accepts yourARWorldTrackingConfigurationsession and "preserves all of the AR session's settings" — so you configuresceneReconstruction = .mesh,frameSemantics = [.sceneDepth](or.smoothedSceneDepth),planeDetection = [.horizontal, .vertical]once, hand it to RoomPlan, and get the parametric room and the raw mesh/plane/depth from the same session. - If Unity's ARKit plugin also tries to own a session, you hit the documented black-camera-feed conflict (two sessions fighting for the camera). One owner avoids it entirely.
Unity's role: embedded via Unity as a Library (UaaL) for the interactive 3D/AR UI and live-cue overlays. It is a passive renderer — the native core feeds it the camera texture, world poses, and anchor updates over a bridge; Unity never starts its own ARSession. (Alternative: native plugin inside a Unity-hosted app, but the session must still be created and owned on the native side.)
What the capture core emits, all keyed to ARFrame timestamp + the ARKit world coordinate frame (gravity-aligned, origin at session start — store the session's world-anchor so the server can register everything):
| Stream | Source | Rate | Transport |
|---|---|---|---|
| RGB video | ARFrame capturedImage |
30–60 fps | WebRTC video track, HEVC (VideoToolbox HW encode) |
| Depth | sceneDepth.depthMap (LiDAR, ~256×192) |
10–30 fps | 2nd video track, 16-bit depth packed into frame, or compressed on data channel |
| Intrinsics + pose | camera.intrinsics, transform |
per-frame | DataChannel (reliable, ordered) |
| Feature points | rawFeaturePoints |
throttled | DataChannel (deltas) |
| Planes / meshes | ARPlaneAnchor, ARMeshAnchor |
on add/update only | DataChannel (deltas, throttled) |
| Room model | CapturedRoom / CapturedStructure |
on update | DataChannel (small, parametric) |
3. Transport: WebRTC design (and the honest Modal caveat)
One RTCPeerConnection per session:
- Media: HEVC RGB track + a packed-depth track. Depth is encoded as 16-bit (e.g., depth split across two 8-bit channels) so a video codec can carry it without quantizing to 8-bit; reconstruct server-side. Subsample depth to the rate fusion actually needs (10–15 fps is plenty for a TSDF).
- Data channels:
meta(reliable, ordered) for intrinsics/pose/AR stats/RoomPlan deltas; an optionalcueschannel server→client for the live hints. Send geometry as deltas (only new/changed anchors) — full mesh every frame will saturate mobile uplink. - Signaling + NAT: a small WebSocket signaling service + a TURN server (coturn) for relay.
⚠️ Modal caveat — be explicit about this. WebRTC media ingest needs UDP, ICE, and stable long-lived endpoints. Modal's serverless/HTTP model is not a good fit for terminating WebRTC. So "Modal-first" applies to GPU compute and batch, not the media edge. Run the WebRTC termination on an always-on media server — LiveKit (self-host on a small EC2/Fly instance, or LiveKit Cloud) or a custom Pion (Go) / aiortc (Python) ingest — which demuxes frames and then calls Modal GPU functions / drops bundles to S3. This is the one piece that lives outside Modal.
4. Ingest edge: frame governor + the two lanes
At the media server (this is where the frame governor from the earlier design lives):
- Frame governor — perceptual hash + optical-flow/motion gating tags each frame
{keyframe?, motion, coverage-delta}. Only keyframes trigger GPU work; SAM2 gets continuous frames for tracking (it has its own streaming memory, so don't starve it). - Live-cue lane (cheap, sub-300 ms, → app): this is the only thing the user sees in real time, and it's intentionally light:
- Coverage/quality from RoomPlan
completedEdges+ ARMesh coverage → "you haven't captured the back wall / far-right corner." - A periodic (every few seconds) Cosmos 3 Nano reasoner call on a keyframe for semantic gap detection ("ceiling not scanned," "mirror may be corrupting geometry") — VLM strength, not geometry.
- Optional lightweight detector for basic object tags.
- Results pushed back over the
cuesdata channel and rendered by Unity.
- Coverage/quality from RoomPlan
- Ingest lane: assemble the raw capture bundle and write to S3; enqueue near-line + batch jobs.
5. Storage: S3 layout
s3://spatial/{tenant}/{session_id}/
manifest.json # session meta, world-frame anchor, device, timestamps, codec params
raw/
rgb.mp4 # HEVC
depth.mp4 # 16-bit packed depth
track.parquet # per-frame: ts, intrinsics, pose (4x4)
points.parquet # sparse feature points (deltas merged)
planes.jsonl # ARPlaneAnchor snapshots
meshes/ # ARMeshAnchor chunks (per anchor id)
room.usdz / room.json # CapturedRoom / CapturedStructure (parametric)
keyframes/ # extracted JPEGs at governor-selected frames
derived/
cloud.ply # fused world point cloud
mesh_opt.glb # optimized mesh
splat.ksplat # gaussian splat
masks/ # SAM2 instance masks + track ids
scene_facts.json # structured objects/surfaces/measurements/captions
kb/ # progressive markdown library (see §8)
embeddings/ # optional RAG vectors
Everything in derived/ is regenerable from raw/ — so raw is the durable asset; reprocessing improves results as models improve.
6. GPU processing on Modal
Each stage is a Modal function (scale-to-zero GPU, S3 in/out), wired through Modal's queue. Two tiers by latency tolerance.
Near-line (seconds–minutes after keyframes land):
- SAM2 — segment + track objects across keyframes → instance masks + stable IDs. Detection/re-seed on keyframes, propagate via SAM2 memory between them.
- Cosmos 3 reasoner NIM — captions, object/event Q&A, surface labels. Runs as the NIM container (one
docker run, no Kubernetes) hosted as a Modal GPU service. - Sapiens2 — conditional: only when a person is detected in frame (occupant/contractor). Pose/parts/normals/pointmap, masked to human pixels via the segmentation head, scale-normalized — useful for people in the scene, irrelevant to empty-room geometry. Don't run it on empty rooms.
- Geometry fusion — transform per-frame depth into world frame via poses → accumulate TSDF → point cloud. ARKit already gives metric depth + mesh, so this is fusion, not learned depth; a metric-depth model only fills LiDAR gaps if needed.
- Structured scene facts — merge geometry + masks + room model into
scene_facts.json(objects with bbox/dimensions, surfaces with plane eqn + normal + corners, measurements, links to keyframes/masks).
The Cosmos feedback loop: scene facts + keyframes are assembled into a structured context and fed back into Cosmos 3 for higher-level reasoning — labeling surfaces, answering "what/why/where," flagging anomalies. Critically, Cosmos consumes the computed numbers as context; it never generates coordinates (VLMs hallucinate geometry — see the pointmap discussion). Cosmos is captioner early, reasoner-over-structured-data late.
Batch (minutes–hours, the "fire up a queue" tier):
- 3D Gaussian Splatting — you have posed RGB + intrinsics, so skip SfM: pose-known 3DGS training →
.ksplat/.ply. GPU-heavy, classic batch job. - Mesh optimization — fuse ARMeshAnchors + depth → Poisson/TSDF mesh → simplify/retopo; refine RoomPlan planes via RANSAC on the fused cloud and recover missed corners by intersecting adjacent fitted planes (RoomPlan ~±5 cm, rectangular, no ceilings — your primitives fix exactly these).
- Progressive markdown KB (§8) + optional RAG embeddings.
Orchestration: Modal queues + spawned jobs; a small state row per session (DynamoDB/Postgres) tracks stage completion so the app/UX can show "reconstruction ready," "knowledge base ready."
7. Real-time vs. deferred — the boundary
| Tier | Budget | Work | Where |
|---|---|---|---|
| Live cues | <300 ms | coverage/quality, gap reasoning, basic tags | Ingest edge + occasional Cosmos Nano |
| Near-line | sec–min | SAM2, Cosmos captions/QA, Sapiens2 (if person), fusion, scene facts | Modal GPU |
| Batch | min–hr | splats, mesh optimization, full KB, embeddings | Modal queue |
Rule of thumb you stated, formalized: if it can't finish inside the live-cue budget, it doesn't block the scan — it lands in S3 and gets queued. The scan UX only ever waits on the cheap lane.
8. Knowledge base: progressive markdown (source of truth)
Markdown-first, authored by Cosmos 3 / an LLM over scene_facts.json + keyframes, with explicit detail tiers and a navigable index:
kb/
index.md # TOC + entity index (rooms, surfaces, objects) + asset links
rooms/
living_room.md # L1: dimensions, surfaces, contents, summary
surfaces/
living_room.north_wall.md # L2: plane eqn, normal, corners, openings, condition
objects/
sofa_01.md # L2: bbox, dimensions, material guess, linked masks/keyframes
tasks/
leak_under_sink.md # L3: anomaly/repair notes, references to frames + measurements
Each doc carries front-matter (IDs, S3 URIs, confidence, world-frame coordinates) so it's both human-browsable and clean LLM context. Detail is progressive: L0 index → L1 room summaries → L2 per-entity detail → L3 task/anomaly notes. New scans append/refine rather than overwrite.
Markdown-first vs. RAG — the recommendation and when to flip. Start markdown-first: it's inspectable, diffable, trivially fed to any LLM, and for a single home the whole index + relevant docs fit comfortably in a modern context window — so you get grounded Q&A without a vector store. Derive RAG from the markdown when scale crosses a threshold: many homes/sessions, cross-property search, or a corpus too large to fit in context. At that point chunk the markdown (it's already cleanly sectioned) → embed → vector DB, keeping markdown as the source of truth and RAG as an index over it. So you're not choosing — you're sequencing. The case for RAG earlier: if you expect open-ended retrieval across a large multi-home library from day one, build the embedding step into the batch tier now; the markdown structure makes that a small add.
9. Key risks & honest caveats
- Session ownership (highest risk). One native ARSession owner; Unity is a passive renderer (UaaL). Mixing in AR Foundation's own session causes the documented black-camera bug.
- WebRTC ≠ Modal. Media termination needs an always-on server (LiveKit/Pion). Modal does GPU + batch behind it. Budget for a small always-on edge + TURN.
- Mobile uplink + thermals. ARKit + RoomPlan + mesh + depth + HEVC encode + WebRTC is a lot of device load → expect limited continuous capture (minutes) and thermal throttling. Send geometry as deltas; subsample depth. RoomPlan itself recommends ≤ ~9×9 m per pass.
- Sapiens2 is conditional. Empty rooms don't need it; gate on person detection. And its pointmap is human-only + scale-normalized, not room geometry.
- RoomPlan accuracy. ~±5 cm/wall, rectangular simplification, 16 cm uniform wall thickness, no ceilings — treat it as a prior and refine with fused primitives.
- Cosmos never emits coordinates. It reasons over computed geometry; geometry comes from ARKit fusion + RANSAC, not the VLM.
- Privacy/security. Streaming home interiors to the cloud → explicit consent, encryption in transit (DTLS/SRTP already in WebRTC) and at rest (S3 SSE), per-tenant isolation, and a retention policy on raw video.
10. Suggested build order
- Capture core + local record — native ARSession (mesh+depth+planes) + RoomPlan, write a bundle locally. Prove single-session ownership and data completeness before any networking.
- S3 upload + manifest — offline bundle → S3; stand up the derived-asset schema.
- Batch first, not realtime — run fusion + mesh refinement + 3DGS + markdown KB on uploaded bundles via Modal. This delivers the core value (model + knowledge base) with zero realtime complexity.
- WebRTC edge — add LiveKit/Pion ingest + TURN; stream the same bundle live; move ingestion online.
- Live-cue lane — coverage hints + occasional Cosmos Nano gap reasoning back to the app.
- Near-line lane + Cosmos feedback loop — SAM2/captions/scene-facts → Cosmos reasoning; surface "ready" states in UX.
- RAG (if/when scale demands) — derive embeddings from the markdown.
Build the value (3 = reconstruction + KB) before the plumbing (4–6). The realtime stream is an optimization over a pipeline that should already work on uploaded captures.