Development Plan — Task/Test Pairs
Staged plan across the three tracks, written for execution with Claude Code: each task is a self-contained unit of work, and its paired test is the acceptance gate — Claude Code is "done" with a task only when the test passes. Every stage ends in a milestone with a concrete, demonstrable payoff.
Conventions
- Track 1 (Capture) → iOS app
~/gitlab/sparkyunity/iOS/TradesparkCapture - Track 2 (Perception) and Track 3 (Reasoning) → backend
~/gitlab/grizzlebear/dev/ml/... - Backend branch: all work on
jh; deploy + test to thejhenvironment. - iOS branch: feature branches off the TradesparkCapture mainline (suggest
jh/<feature>), squash-merge per stage. - Contracts first. The three schemas from the tracks doc —
capture-bundle,scene_facts.json,data-pile/KB— are versioned specs ingrizzlebear/dev/ml/contracts/. Every task codes against them; changing a contract is its own task with a version bump. - Test types:
unit(logic),integration(cross-component, injh),eval(quality/accuracy on a fixed set),smoke(deploy/endpoint health). Each task names which. - Task ID:
S{stage}-{track}{n}, e.g.S1-T2a.SH= shared/cross-track.
A task is complete when: code merged to the right branch, its test is automated + passing in CI (or the eval threshold is met and recorded), and the artifact is where the contract says it should be.
Stage 0 — Foundations & Contracts
Milestone: the three contracts are frozen at v0, both repos build in CI, the jh backend env is reachable, and a stubbed bundle flows end-to-end. Payoff: every downstream task has something concrete to code against, and "hello world" runs in jh.
Shared
SH-0a— Author v0 contracts (capture-bundle,scene_facts.json,data-pile) indev/ml/contracts/with JSON Schema + example fixtures. Test (unit): a schema-validation script passes on the example fixtures and fails on a deliberately broken one.SH-0b— Stand upjhbackend env + CI on thejhbranch (lint, test, deploy hooks). Test (smoke): CI green onjh; a/healthendpoint returns 200 in thejhenv.
Track 1
S0-T1a— Scaffold TradesparkCapture: ARKit session bootstrap, device capability gate (LiDAR check), empty capture screen. Test (unit): app builds; on a LiDAR device the session starts and reportssceneReconstructionsupported; on unsupported, a clean fallback message.
Track 2
S0-T2a— Scaffolddev/ml/perception/harness skeleton: takes a video path, loads a no-op model, writes an emptyscene_facts.jsonthat validates against the contract. Test (integration): harness run on a stock clip produces a schema-valid (empty)scene_facts.jsoninjhS3.
Track 3
S0-T3a— Scaffolddev/ml/reasoning/: a runner that loads a hand-authored data pile + a question set and calls a stub LLM. Test (unit): runner loads the data pile, validates it against the KB contract, emits a results file.
Stage 1 — Vertical slices
Milestone: each track produces a real artifact from real-ish input, independently. Payoff: a captured bundle on disk, model outputs from a stock room video, and a measurable grounded-vs-baseline reasoning result — three working slices before anything is integrated.
Track 1 — capture records locally
S1-T1a— Record the full bundle locally: HEVC RGB, 16-bit packed depth, per-frame intrinsics+pose track, points, plane/mesh anchors,CapturedRoom, keyframes,manifest.json(world-frame anchor). Test (integration): an offline script ingests the bundle, validates it against the capture-bundle contract, and reprojects depth+pose into a single registered point cloud with reprojection error under threshold.S1-T1b— On-device live coverage cue from RoomPlancompletedEdges+ ARMesh coverage. Test (unit): a synthetic scan with a deliberately skipped wall raises a "missing surface" cue; a complete scan does not.
Track 2 — models run on stock video
S1-T2a— Integrate SAM2 in the harness: masks + track IDs across keyframes from a stock room video. Test (eval): masks produced for ≥N objects; track IDs stable across a held-out clip above an IoU/consistency threshold.S1-T2b— Integrate Cosmos 3 reasoner (NIM on Modaljh): captions + scene Q&A on keyframes. Test (eval): on a fixed question set, answers are recorded with a hallucination-rate score; latency + cost per clip logged.S1-T2c— Merge model outputs into a populatedscene_facts.json. Test (integration): output validates against the scene_facts contract and references real mask/keyframe URIs.
Track 3 — grounded prompting on mock data
S1-T3a— Build the A/B eval: same question set answered (a) video/keyframes only vs (b) + hand-authored data pile. Test (eval): the harness produces a scored comparison; grounded condition beats video-only on a metric-question subset (e.g., "how wide is the doorway") by a recorded margin.
Stage 2 — Real capture → upload → facts → answers
Milestone: scan a real room on device, it uploads to jh, the pipeline produces scene_facts + a queryable data pile, and grounded reasoning answers metric questions correctly. Payoff: the first true end-to-end demo — point the phone at a room, then ask questions about it and get grounded answers.
Track 1 — reliable upload
S2-T1a— Resumable background upload (URLSession multipart, wifi-preferred, checksummed) →jhS3 with the manifest. Test (integration): a bundle upload interrupted by a forced network drop resumes and completes; checksum matches; object lands at the contracted S3 prefix.S2-T1b— Upload-complete trigger → enqueue pipeline. Test (integration): completed upload posts an event that lands a job on thejhqueue.
Track 2 — pipeline on real bundles
S2-T2a— Pipeline consumes a real bundle from S3: SAM2 + Cosmos + depth/pose fusion →scene_facts.json+ world point cloud. Test (integration): a real captured bundle yields a schema-validscene_facts.jsonand a registered cloud injhS3; geometry aligns with the bundle's RoomPlan model within tolerance.S2-T2b— Conditional Sapiens2: run only when a person is detected; else skip. Test (unit): a clip with a person triggers Sapiens2 outputs; an empty-room clip skips it (verified in the run log).S2-T2c— Capability/quality/latency/cost matrix auto-generated per run. Test (eval): a report artifact is produced summarizing each model's outputs, latency, and cost for the run.
Track 3 — grounding on generated facts
S2-T3a— Auto-generate the data pile from realscene_facts. Test (integration): generated data pile validates against the KB contract and includes measurements traceable to scene_facts.S2-T3b— Grounded reasoning session over the generated pile; verify deferral to supplied numbers. Test (eval): on metric questions, the model returns the supplied measured value (within rounding) rather than a guess, at a recorded accuracy.
Stage 3 — Reconstruction quality, batch enrichment, knowledge base
Milestone: a full-room capture yields polished 3D (gaussian splat + optimized mesh that corrects RoomPlan's misses) and a progressive markdown knowledge base supporting accurate grounded Q&A. Payoff: the headline result — a navigable 3D model + a queryable "room manual" generated from one scan.
Track 1 — capture fidelity & merge
S3-T1a— Depth/mesh fidelity pass + multi-passCapturedStructuremerge. Test (integration): two overlapping scans merge into one consistent structure; merged coverage exceeds either single scan on a coverage metric.S3-T1b— Tune depth encoding for fusion quality. Test (eval): fused cloud from tuned encoding beats the Stage-2 baseline on a reconstruction-error metric.
Track 2 — batch reconstruction
S3-T2a— 3D Gaussian Splatting batch job (pose-known, no SfM) →.ksplatin S3. Test (eval): splat renders from held-out poses with PSNR/SSIM above threshold.S3-T2b— Mesh optimization: TSDF/Poisson fuse + simplify; RANSAC plane refit + corner recovery over the RoomPlan prior. Test (eval): on a room with a known missed corner, the refined model recovers it; wall-dimension error beats raw RoomPlan (target: tighter than RoomPlan's ~±5 cm) on a measured-tape ground truth.
Track 3 — progressive KB + RAG decision
S3-T3a— Generate the progressive markdown library (index → rooms → surfaces → tasks) with front-matter + S3 URIs. Test (integration): KB builds, TOC/index resolve, every doc validates against the KB contract.S3-T3b— Full grounded Q&A over the KB on a held-out question set; make the RAG-now-vs-later call with data. Test (eval): held-out Q&A accuracy above threshold; a recorded measurement of context size vs window that justifies the in-context-vs-RAG decision.
Stage 4 — Hardening & optional live (deferred)
Milestone: production-readiness — performance, privacy/security, observability — plus the optional thin live-preview path if an interactive feature is greenlit. Payoff: a deployable, monitored system safe to put in front of real users' homes.
S4-T1a— Device perf/thermal budget for capture+encode+record. Test (eval): a sustained capture stays within a defined thermal/throttle envelope on target devices.S4-SHa— Privacy/security: encryption at rest (S3 SSE), per-tenant isolation, raw-video retention policy. Test (integration+ review): a security-review checklist passes; cross-tenant access is denied in an automated test.S4-SHb— Observability: per-session pipeline state + stage timings surfaced to the app. Test (smoke):jhdashboard shows a session progressing through stages with "ready" states.S4-T2a(optional) — Thin live-preview path (sampled keyframes + RoomPlan parametric updates only). Test (integration): a low-rate preview reaches the server during capture while the full bundle still records locally and uploads after.
How to drive this with Claude Code
- Point Claude Code at the relevant repo (TradesparkCapture or grizzlebear on
jh). - Give it one task ID at a time with its acceptance test as the definition of done.
- Require the paired test to be written/automated before the task is marked complete (test-first where practical).
- For
evaltasks, the threshold + the recorded score are the gate — store results so regressions are visible across stages. - Keep the three contracts as the source of truth; any task that needs a schema change must bump the contract version and update fixtures first.
The sequencing discipline from the tracks doc holds: Stage 0–1 prove each slice on stub/stock inputs, Stage 2 makes it real end-to-end, Stage 3 delivers the headline 3D + KB payoff, Stage 4 hardens. Mock data keeps Tracks 2 and 3 unblocked so nothing waits on the capture app being finished.