Sapiens2 + Cosmos 3 — Summary & Comparison for a Video-Streaming Vision App
Prepared June 1, 2026. Companion to the Cosmos 3 study plan.
The target app: stream video into an endpoint, then surface reasoning / world understanding / object detection / measurement on the processed video. This doc summarizes Sapiens2, then compares it head-to-head with Cosmos 3 against those four capabilities, and proposes an architecture.
Sapiens2 — Summary
Meta AI released Sapiens2 on April 27, 2026 — the second generation of its human-centric vision foundation model. The one-sentence version: it's the best open model for precise, dense, pixel-level understanding of people, and it does nothing for non-human objects.
What it does (five tasks, one frozen backbone + lightweight heads):
- Pose estimation — 308-keypoint full-body skeleton, including a dense 243-keypoint face and 40-keypoint hands. This is far beyond typical 17-keypoint COCO pose.
- Body-part segmentation — 29 semantic classes (added eyeglasses), sharp boundaries.
- Pointmap estimation — per-pixel 3D point in the camera frame (XYZ). This is the "measuring" primitive — but note it's focal-length-normalized + a learned scale scalar, not absolute metric (see "Pointmap limitations" below). Better than relative depth (it's a consistent 3D shape), but not centimeters out of the box.
- Surface normal estimation — per-pixel unit normals; 5B hits 6.73° mean angular error (4K variant median just 3.08°).
- Albedo estimation — true diffuse color/skin tone independent of lighting.
Scale & training:
- Sizes: 0.4B, 0.8B, 1B, 5B params, native 1K resolution; hierarchical windowed-attention variants extend to 4K. The 5B is the highest-FLOPs ViT reported (15.7 TFLOPs).
- Trained on Humans-1B (1B curated human images, filtered from ~4B), pretrained with a joint MAE + DINOv3-style contrastive objective so it keeps both fine texture fidelity (for albedo/normals) and high-level semantics.
- Benchmarks: 82.3 mAP pose, 82.5 mIoU segmentation; the frozen 5B backbone beats even DINOv3-7B (a larger general-purpose model) across every human task.
Nature of the model — important for app design:
- It's an image / per-frame model (a ViT backbone with dense-prediction heads). There is no temporal/video module — for video you run it frame-by-frame (optionally with your own tracking/smoothing).
- It's a specialist: humans only. It is not a VLM, not a reasoner, not an open-vocabulary object detector. No text in, no language out.
- Open weights (
facebook/sapiens2on HF), code atfacebookresearch/sapiens2, paper. The 0.4B is small enough for real-time per-frame inference on a single mid-range GPU; the 5B/4K variants are batch/offline-grade.
Pointmap limitations (important for "measuring")
Two distinct constraints often get conflated — both matter if you're relying on Sapiens2 for measurement:
-
It's humans-only by training, not by runtime masking. The pointmap head emits a dense per-pixel 3D point over the whole image, but it was supervised entirely on synthetic human assets (the paper defines the pointmap/normal/albedo losses over human-form pixels). There is no segmentation gate inside the model. So on background and non-human objects the output is out-of-distribution and unreliable — the model still emits numbers there, they're just not meaningful. In practice you use the body-part segmentation head as a mask and keep the pointmap only where it says "human." That masking is your job downstream, not the model's.
-
It's scale-normalized, not absolute metric. Because metric scale is ambiguous without known camera intrinsics, Sapiens2 predicts a focal-length-normalized pointmap
P̃(u)plus a learned scalars(P̂ = s·P̃), and is even evaluated "in focal-length normalized canonical coordinates." So you get metrically consistent 3D shape of the person — correct proportions, limb ratios, surface geometry, self-consistent depth — but not guaranteed centimeters unless you supply known intrinsics or a reference scale.
Consequence: Sapiens2 is excellent for relative/structural human measurement (proportions, pose-derived dimensions, 3D surface, normals), masked to human pixels. It is not a general scene depth estimator and not an absolute-metric tape measure. For room/object measurement — or absolute units even on a person — add a dedicated metric-depth model (Depth-Anything-Metric, stereo/RGB-D) and/or camera calibration.
How the two models actually differ
They're almost orthogonal — different jobs, not competitors.
| Dimension | Cosmos 3 | Sapiens2 |
|---|---|---|
| Origin / date | NVIDIA, Jun 1 2026 | Meta, Apr 27 2026 |
| Core identity | World foundation model: reasoning VLM + physics-aware video/action generator | Human-centric dense-prediction backbone |
| Input → output | text/image/video/audio/action → text (reasoning), video, action | image (per frame) → pose / segmentation / pointmap / normals / albedo |
| Subject scope | General scenes (robots, vehicles, warehouses, indoor spaces, objects, people) | People only |
| Reasoning / language | ✅ chain-of-thought, captioning, Q&A about a scene | ❌ none |
| World understanding | ✅ qualitative — causality, motion, "what's happening / what next" | ⚠️ only the human geometry within a frame |
| Object detection | ⚠️ can describe/localize loosely via VLM, but not a precise detector | ❌ not a detector (segments human parts only) |
| Measurement | ⚠️ qualitative spatial reasoning ("how far is X from Y"), not metric | ✅ 3D pointmap + normals — humans only, and scale-normalized (not absolute metric) |
| Temporal modeling | ✅ native video understanding & generation | ❌ per-frame; bring your own tracking |
| Sizes | 8B (Nano) / 32B (Super) | 0.4B / 0.8B / 1B / 5B (+4K) |
| Real-time latency | Reasoner is heavy → keyframes/clips, not every frame | 0.4B → genuine per-frame real-time |
| Deployment | NIM container (reasoner today) or HF Diffusers | HF weights + task heads (your own serving) |
Mapping to your four app capabilities
- Reasoning → Cosmos 3 reasoner. This is its home turf; Sapiens2 can't do it at all.
- World understanding → Cosmos 3 for the scene-level narrative; Sapiens2 for precise human-body understanding inside the scene. Complementary.
- Object detection → neither is the right tool. Cosmos can loosely describe objects; Sapiens2 only segments human parts. You need a dedicated detector/tracker (Grounding DINO / YOLO / SAM 2 for open-vocab detection + masks + tracking).
- Measuring → Sapiens2 for people (the pointmap is the standout feature — but it's scale-normalized human geometry, masked to human pixels, not absolute metric; see "Pointmap limitations"). For arbitrary objects, Sapiens2 doesn't apply and Cosmos is only qualitative — add a general metric-depth model (e.g., Depth-Anything-Metric or stereo/RGB-D), and supply camera intrinsics/a reference if you need absolute units anywhere.
The key takeaway: no single model covers all four. Cosmos 3 and Sapiens2 cover different quadrants well, and "object detection" + "general-object measurement" fall between them — so a production app is an ensemble, not a single endpoint.
Usefulness vs. usability for the streaming app
Usefulness (capability fit)
- Cosmos 3 is the most useful for the "reasoning / world understanding" half: it's the only one that can watch a clip and tell you what's happening and why. It also future-proofs you toward action/event prediction. But it's a generalist brain, weak on pixel precision and metric numbers.
- Sapiens2 is the most useful for the "measurement / human analytics" half: nothing open matches its pose/geometry precision, and the pointmap is close to what "measuring" wants — as long as the thing being measured is a person, you mask to human pixels, and you treat the output as scale-normalized geometry rather than absolute units.
- Object detection is a genuine gap in both; treat it as a third component from day one.
Usability (effort to ship)
- Cosmos 3 — higher ceiling, higher cost. The reasoner NIM is a one-command Docker container with an OpenAI-compatible API (very usable), but it's an 8B–32B model: heavy GPU, real latency. You won't run it on every frame — you sample keyframes or short clips. Generator NIM isn't out yet.
- Sapiens2 — lower friction for inference, more glue required. The 0.4B runs per-frame in real time on modest hardware, but Meta ships weights + task heads, not a packaged serving container — you build the endpoint, batching, and tracking yourself. Outputs are dense tensors (heatmaps, masks, pointmaps) you must post-process into overlays.
- Latency reality for streaming: run the cheap dense models (Sapiens2-0.4B, detector/tracker) on the per-frame hot path; run Cosmos reasoning asynchronously on sampled keyframes/short clips and overlay its narrative a beat behind the live geometry. Don't put a 32B VLM in the frame loop.
Recommended architecture
┌──────────────────────────────────────────────────────────┐
video ───▶│ Ingest (WebRTC / RTSP) → decode │
stream └──────────────────────────────────────────────────────────┘
│ every frame
┌──────────────────────────────────────────────────────────┐
│ FRAME GOVERNOR (temporal-redundancy gate) │
│ • perceptual hash + optical-flow / scene-change detection │
│ • tags each frame: {keyframe?, motion score, ROI changed} │
│ NOTE: this is YOUR gate. Cosmos's EVS is internal to the │
│ reasoner (token pruning) and is NOT reusable here. │
└───┬───────────────────┬───────────────────────┬───────────┘
every │ keyframes │ keyframes/│clips
frame │ only │ cadence │
┌───▼───────────────┐ ┌──▼────────────────┐ ┌────▼──────────────┐
│ TRACK PATH (RT) │ │ KEYFRAME PATH │ │ SLOW PATH (async) │
│ │ │ │ │ │
│ • SAM 2 mask │ │ • Detector re-seed │ │ • Cosmos 3 │
│ propagation │ │ (G-DINO/YOLO) │ │ Reasoner NIM │
│ via streaming │ │ → boxes for SAM2 │ │ → reasoning, │
│ memory │ │ • Metric-depth │ │ captions, │
│ (wants temporal │ │ (per-frame, │ │ events │
│ continuity — │ │ stateless) │ │ │
│ don't starve │ │ • Sapiens2-0.4B │ │ (sparser cadence; │
│ it of frames) │ │ pose/parts/ │ │ lags ~1 beat) │
│ │ │ pointmap (mask │ │ │
│ │ │ via seg head; │ │ │
│ │ │ scale-norm.) │ │ │
└─────────┬─────────┘ └─────────┬──────────┘ └────────┬──────────┘
│ masks/IDs │ boxes,depth,geom │ narrative
┌─────────▼─────────────────────▼─────────────────────▼─────────┐
│ Fusion + overlay service → annotated video + JSON results │
│ (boxes, masks, skeletons, measurements, scene reasoning) │
└────────────────────────────────────────────────────────────────┘
Why this shape:
- Frame governor owns temporal-redundancy reduction (see the dedicated section below) and decides which frames trigger expensive work.
- SAM 2 runs on the track path and propagates masks via its own streaming memory — fed continuous frames, re-seeded on keyframes.
- Detector re-seeds object boxes on keyframes (and feeds SAM2's prompts); owns the object-detection gap neither foundation model fills.
- Sapiens2-0.4B owns precise human pose/geometry on the keyframe path. Mask its pointmap with the segmentation head and treat it as scale-normalized; convert to absolute units only with known intrinsics or a reference.
- Metric-depth model covers measurement of non-human objects (Sapiens2's blind spot) — per-frame and stateless, so it benefits most directly from keyframe gating.
- Cosmos 3 reasoner owns the language/reasoning layer asynchronously, so a heavy VLM never blocks live rendering.
- All converge in a fusion layer that produces both the annotated video and a structured JSON stream the app can render.
Frame governor: temporal-redundancy reduction (and why EVS doesn't help here)
The instinct to reuse Cosmos's video-frame optimization for your other models is reasonable, but the mechanism doesn't transfer. Worth being precise:
What Cosmos's EVS actually is. "Efficient Video Sampling" (arXiv 2510.14624) prunes temporally static patches — sub-frame spatial regions unchanged across consecutive frames — at the token level, inside the reasoner, before the transformer attends over them. It preserves positional identity, needs no retraining, and cuts time-to-first-token up to ~4×. Crucially, its output is a shorter sequence of video tokens consumed by Cosmos's own transformer — not RGB frames, and not even whole frames. There is no tap to pull frames out and route them to SAM2 or a depth model. Treat EVS as a private latency knob for the reasoner; it is not a frame source.
The idea is reusable; the implementation is per-model. Build your own frame governor (perceptual hash + optical-flow/scene-change detection) that tags each frame. Then each consumer treats it differently:
| Consumer | State? | What it wants from the governor |
|---|---|---|
| Metric-depth model | Stateless, per-frame | Keyframe gating helps directly — skip near-duplicate frames, run on keyframes, hold/interpolate/smooth between. Do not apply EVS-style patch pruning: depth needs the whole frame for geometric context. |
| SAM 2 | Stateful (streaming memory) | Opposite instinct — feed it continuous frames. Higher frame rates give better tracking stability; aggressive frame-dropping causes drift and lost tracks. Run expensive detection/re-seeding on keyframes only and let SAM2's memory propagate masks on intervening frames. SAM2 is the temporal-redundancy optimization for segmentation. |
| Sapiens2 | Stateless, per-frame | Keyframe gating helps; add your own light tracking/smoothing across frames since Sapiens2 has no temporal module. |
| Cosmos 3 reasoner | — | Runs on a sparse keyframe/clip cadence anyway; let its internal EVS handle intra-clip token redundancy. |
One-line rule: a VLM tolerates patch-level token pruning that pixel models (depth, SAM2) don't — so don't reuse EVS; gate at the frame level upstream, and let stateful SAM2 keep its continuity.
Hosting: Cosmos reasoner as a NIM on Modal/ECS/EC2 (no Kubernetes — see the Cosmos study plan); Sapiens2 + detector + depth as your own GPU services (Modal functions or a Triton/vLLM-style server). Start with Sapiens2-0.4B + a YOLO/SAM2 detector for the real-time path and the Cosmos Nano reasoner on keyframes; scale model sizes up once the pipeline works.
Bottom line
- Cosmos 3 = the reasoning and world-understanding brain (general scenes, language, events). Heavier, async, keyframe-driven.
- Sapiens2 = the precision instrument for people (pose, parts, 3D pointmap, surface detail). Light, real-time, per-frame — but humans only, masked via segmentation, and scale-normalized rather than absolute metric.
- Object detection and general-object measurement belong to neither — budget for a dedicated detector/tracker and a metric-depth model.
- Ship them as an ensemble with a fast per-frame hot path and an async reasoning path, not as one model behind one endpoint.
Sources
- Sapiens2 launch writeup: https://www.marktechpost.com/2026/04/27/meta-ai-releases-sapiens2-a-high-resolution-human-centric-vision-model-for-pose-segmentation-normals-pointmap-and-albedo/
- Sapiens2 paper: https://arxiv.org/abs/2604.21681
- Sapiens2 weights: https://huggingface.co/collections/facebook/sapiens2 · Repo: https://github.com/facebookresearch/sapiens2
- Original Sapiens (ECCV 2024): https://arxiv.org/abs/2408.12569
- Cosmos 3 technical blog: https://developer.nvidia.com/blog/develop-physical-ai-reasoning-world-and-action-models-with-nvidia-cosmos-3/
- Cosmos 3 HF launch: https://huggingface.co/blog/nvidia/cosmos-3-for-physical-ai
- Efficient Video Sampling (EVS) paper: https://arxiv.org/abs/2510.14624
- SAM2 temporal sampling / streaming memory (TSMS-SAM2): https://arxiv.org/abs/2508.05829