← All docs ml/development-plan-task-test-pairs.md

Development Plan — Task/Test Pairs

Staged plan across the three tracks, written for execution with Claude Code: each task is a self-contained unit of work, and its paired test is the acceptance gate — Claude Code is "done" with a task only when the test passes. Every stage ends in a milestone with a concrete, demonstrable payoff.

Conventions

  • Track 1 (Capture) → iOS app ~/gitlab/sparkyunity/iOS/TradesparkCapture
  • Track 2 (Perception) and Track 3 (Reasoning) → backend ~/gitlab/grizzlebear/dev/ml/...
  • Backend branch: all work on jh; deploy + test to the jh environment.
  • iOS branch: feature branches off the TradesparkCapture mainline (suggest jh/<feature>), squash-merge per stage.
  • Contracts first. The three schemas from the tracks doc — capture-bundle, scene_facts.json, data-pile/KB — are versioned specs in grizzlebear/dev/ml/contracts/. Every task codes against them; changing a contract is its own task with a version bump.
  • Test types: unit (logic), integration (cross-component, in jh), eval (quality/accuracy on a fixed set), smoke (deploy/endpoint health). Each task names which.
  • Task ID: S{stage}-{track}{n}, e.g. S1-T2a. SH = shared/cross-track.

A task is complete when: code merged to the right branch, its test is automated + passing in CI (or the eval threshold is met and recorded), and the artifact is where the contract says it should be.


Stage 0 — Foundations & Contracts

Milestone: the three contracts are frozen at v0, both repos build in CI, the jh backend env is reachable, and a stubbed bundle flows end-to-end. Payoff: every downstream task has something concrete to code against, and "hello world" runs in jh.

Shared

  • SH-0a — Author v0 contracts (capture-bundle, scene_facts.json, data-pile) in dev/ml/contracts/ with JSON Schema + example fixtures. Test (unit): a schema-validation script passes on the example fixtures and fails on a deliberately broken one.
  • SH-0b — Stand up jh backend env + CI on the jh branch (lint, test, deploy hooks). Test (smoke): CI green on jh; a /health endpoint returns 200 in the jh env.

Track 1

  • S0-T1a — Scaffold TradesparkCapture: ARKit session bootstrap, device capability gate (LiDAR check), empty capture screen. Test (unit): app builds; on a LiDAR device the session starts and reports sceneReconstruction supported; on unsupported, a clean fallback message.

Track 2

  • S0-T2a — Scaffold dev/ml/perception/ harness skeleton: takes a video path, loads a no-op model, writes an empty scene_facts.json that validates against the contract. Test (integration): harness run on a stock clip produces a schema-valid (empty) scene_facts.json in jh S3.

Track 3

  • S0-T3a — Scaffold dev/ml/reasoning/: a runner that loads a hand-authored data pile + a question set and calls a stub LLM. Test (unit): runner loads the data pile, validates it against the KB contract, emits a results file.

Stage 1 — Vertical slices

Milestone: each track produces a real artifact from real-ish input, independently. Payoff: a captured bundle on disk, model outputs from a stock room video, and a measurable grounded-vs-baseline reasoning result — three working slices before anything is integrated.

Track 1 — capture records locally

  • S1-T1a — Record the full bundle locally: HEVC RGB, 16-bit packed depth, per-frame intrinsics+pose track, points, plane/mesh anchors, CapturedRoom, keyframes, manifest.json (world-frame anchor). Test (integration): an offline script ingests the bundle, validates it against the capture-bundle contract, and reprojects depth+pose into a single registered point cloud with reprojection error under threshold.
  • S1-T1b — On-device live coverage cue from RoomPlan completedEdges + ARMesh coverage. Test (unit): a synthetic scan with a deliberately skipped wall raises a "missing surface" cue; a complete scan does not.

Track 2 — models run on stock video

  • S1-T2a — Integrate SAM2 in the harness: masks + track IDs across keyframes from a stock room video. Test (eval): masks produced for ≥N objects; track IDs stable across a held-out clip above an IoU/consistency threshold.
  • S1-T2b — Integrate Cosmos 3 reasoner (NIM on Modal jh): captions + scene Q&A on keyframes. Test (eval): on a fixed question set, answers are recorded with a hallucination-rate score; latency + cost per clip logged.
  • S1-T2c — Merge model outputs into a populated scene_facts.json. Test (integration): output validates against the scene_facts contract and references real mask/keyframe URIs.

Track 3 — grounded prompting on mock data

  • S1-T3a — Build the A/B eval: same question set answered (a) video/keyframes only vs (b) + hand-authored data pile. Test (eval): the harness produces a scored comparison; grounded condition beats video-only on a metric-question subset (e.g., "how wide is the doorway") by a recorded margin.

Stage 2 — Real capture → upload → facts → answers

Milestone: scan a real room on device, it uploads to jh, the pipeline produces scene_facts + a queryable data pile, and grounded reasoning answers metric questions correctly. Payoff: the first true end-to-end demo — point the phone at a room, then ask questions about it and get grounded answers.

Track 1 — reliable upload

  • S2-T1a — Resumable background upload (URLSession multipart, wifi-preferred, checksummed) → jh S3 with the manifest. Test (integration): a bundle upload interrupted by a forced network drop resumes and completes; checksum matches; object lands at the contracted S3 prefix.
  • S2-T1b — Upload-complete trigger → enqueue pipeline. Test (integration): completed upload posts an event that lands a job on the jh queue.

Track 2 — pipeline on real bundles

  • S2-T2a — Pipeline consumes a real bundle from S3: SAM2 + Cosmos + depth/pose fusion → scene_facts.json + world point cloud. Test (integration): a real captured bundle yields a schema-valid scene_facts.json and a registered cloud in jh S3; geometry aligns with the bundle's RoomPlan model within tolerance.
  • S2-T2b — Conditional Sapiens2: run only when a person is detected; else skip. Test (unit): a clip with a person triggers Sapiens2 outputs; an empty-room clip skips it (verified in the run log).
  • S2-T2c — Capability/quality/latency/cost matrix auto-generated per run. Test (eval): a report artifact is produced summarizing each model's outputs, latency, and cost for the run.

Track 3 — grounding on generated facts

  • S2-T3a — Auto-generate the data pile from real scene_facts. Test (integration): generated data pile validates against the KB contract and includes measurements traceable to scene_facts.
  • S2-T3b — Grounded reasoning session over the generated pile; verify deferral to supplied numbers. Test (eval): on metric questions, the model returns the supplied measured value (within rounding) rather than a guess, at a recorded accuracy.

Stage 3 — Reconstruction quality, batch enrichment, knowledge base

Milestone: a full-room capture yields polished 3D (gaussian splat + optimized mesh that corrects RoomPlan's misses) and a progressive markdown knowledge base supporting accurate grounded Q&A. Payoff: the headline result — a navigable 3D model + a queryable "room manual" generated from one scan.

Track 1 — capture fidelity & merge

  • S3-T1a — Depth/mesh fidelity pass + multi-pass CapturedStructure merge. Test (integration): two overlapping scans merge into one consistent structure; merged coverage exceeds either single scan on a coverage metric.
  • S3-T1b — Tune depth encoding for fusion quality. Test (eval): fused cloud from tuned encoding beats the Stage-2 baseline on a reconstruction-error metric.

Track 2 — batch reconstruction

  • S3-T2a3D Gaussian Splatting batch job (pose-known, no SfM) → .ksplat in S3. Test (eval): splat renders from held-out poses with PSNR/SSIM above threshold.
  • S3-T2bMesh optimization: TSDF/Poisson fuse + simplify; RANSAC plane refit + corner recovery over the RoomPlan prior. Test (eval): on a room with a known missed corner, the refined model recovers it; wall-dimension error beats raw RoomPlan (target: tighter than RoomPlan's ~±5 cm) on a measured-tape ground truth.

Track 3 — progressive KB + RAG decision

  • S3-T3a — Generate the progressive markdown library (index → rooms → surfaces → tasks) with front-matter + S3 URIs. Test (integration): KB builds, TOC/index resolve, every doc validates against the KB contract.
  • S3-T3b — Full grounded Q&A over the KB on a held-out question set; make the RAG-now-vs-later call with data. Test (eval): held-out Q&A accuracy above threshold; a recorded measurement of context size vs window that justifies the in-context-vs-RAG decision.

Stage 4 — Hardening & optional live (deferred)

Milestone: production-readiness — performance, privacy/security, observability — plus the optional thin live-preview path if an interactive feature is greenlit. Payoff: a deployable, monitored system safe to put in front of real users' homes.

  • S4-T1a — Device perf/thermal budget for capture+encode+record. Test (eval): a sustained capture stays within a defined thermal/throttle envelope on target devices.
  • S4-SHa — Privacy/security: encryption at rest (S3 SSE), per-tenant isolation, raw-video retention policy. Test (integration + review): a security-review checklist passes; cross-tenant access is denied in an automated test.
  • S4-SHb — Observability: per-session pipeline state + stage timings surfaced to the app. Test (smoke): jh dashboard shows a session progressing through stages with "ready" states.
  • S4-T2a (optional) — Thin live-preview path (sampled keyframes + RoomPlan parametric updates only). Test (integration): a low-rate preview reaches the server during capture while the full bundle still records locally and uploads after.

How to drive this with Claude Code

  1. Point Claude Code at the relevant repo (TradesparkCapture or grizzlebear on jh).
  2. Give it one task ID at a time with its acceptance test as the definition of done.
  3. Require the paired test to be written/automated before the task is marked complete (test-first where practical).
  4. For eval tasks, the threshold + the recorded score are the gate — store results so regressions are visible across stages.
  5. Keep the three contracts as the source of truth; any task that needs a schema change must bump the contract version and update fixtures first.

The sequencing discipline from the tracks doc holds: Stage 0–1 prove each slice on stub/stock inputs, Stage 2 makes it real end-to-end, Stage 3 delivers the headline 3D + KB payoff, Stage 4 hardens. Mock data keeps Tracks 2 and 3 unblocked so nothing waits on the capture app being finished.