Transform any video into an interactive 3D avatar with motion capture, animation labeling, and AI-powered controls.
This project provides a complete pipeline for converting videos (local files or YouTube) into interactive 3D human animations. Built on Meta's SAM 3D Body model, it extracts full-body 3D meshes from video, enables frame-by-frame labeling, and presents everything in a professional web-based 3D studio with voice control.
- 🎥 Video to 3D Pipeline — Upload any video or paste a YouTube URL, get back a fully rigged 3D avatar
- 🌐 Professional Web Studio — Three.js-powered viewer with timeline, playback controls, and segment editor
- 🎙️ Voice Control — Say "show me a kick" and AI navigates to the matching animation
- 📹 YouTube Integration — Download and process videos directly from YouTube URLs
- 🎬 Multiple Jobs — Process multiple videos simultaneously, switch between them instantly
- 💾 Export Ready — Export labeled segments as JSON, animation data as binary for web/game engines
- 🤖 AI-Powered — Uses 5 different AI models working together seamlessly
- 📁 Local Upload — Drag & drop MP4, AVI, MOV, MKV, WebM files (up to 2GB)
- 📺 YouTube Download — Paste any YouTube URL, fetch metadata preview, download & process automatically
- 🔄 Multiple Jobs — Process multiple videos in parallel, switch between them via dropdown
- ⚙️ Configurable — Adjust frame skip (1-10) and inference mode (body-only or full body+hands)
- 🎨 Real-time 3D Viewer — Orbit controls, PBR lighting, wireframe mode, color presets
- 🎬 Video Sync — Source video plays side-by-side with 3D mesh, frame-perfect sync
- 📊 Timeline Scrubber — Color-coded technique segments, playback speed (0.25x–4x)
- ✏️ Segment Editor — Label frame ranges with names and categories (stance, kick, block, etc.)
- 🎙️ Voice Control — Speak commands like "show me a horse stance" powered by Llama 3.3 70B
- ⌨️ Keyboard Shortcuts — Space (play/pause), arrows (step), Home/End (skip to start/end)
- Video Input — Upload file or download from YouTube with yt-dlp
- Frame Extraction — Extract frames at configurable intervals (OpenCV)
- Human Detection — YOLOv8 or ViTDet finds person bounding boxes
- Camera Estimation — MoGe2 predicts field-of-view and intrinsics
- 3D Mesh Recovery — SAM 3D Body extracts 18,439 vertices + 70 joints per frame
- Segmentation (optional) — SAM2 provides high-quality human masks
- Web Export — Binary mesh data for Three.js real-time rendering
- Interactive Studio — Web-based viewer with labeling, voice control, and export
YouTube Video (Karate Heian Sandan)
│
▼
┌─────────────────────────────────────────────────────────┐
│ FRAME EXTRACTION │
│ download_karate_video.py → karate_pose_pipeline.py │
│ OpenCV: extract frames, resize, label with transcript │
└─────────────────────┬───────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ 3D MESH RECOVERY (per frame) │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌────────────┐ │
│ │ YOLOv8 / │ │ MoGe2 FOV │ │ SAM2 │ │
│ │ ViTDet │──▶│ Estimator │──▶│ Segmentor │ │
│ │ (Detector) │ │ (Camera K) │ │ (optional) │ │
│ └──────┬───────┘ └──────┬───────┘ └─────┬──────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ SAM 3D Body (DINOv3-H+ 840M) │ │
│ │ │ │
│ │ Image Patches → DINOv3 Backbone → Embeddings │ │
│ │ CameraEncoder → Ray-Conditioned Features │ │
│ │ PromptableDecoder (N layers) → Pose Tokens │ │
│ │ MHR Head → 18,439 vertices + 70 joints │ │
│ └──────────────────────┬──────────────────────────┘ │
│ │ │
└─────────────────────────┼───────────────────────────────┘
│
┌─────────────────┼─────────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────────┐ ┌──────────────┐
│ .npz per │ │ Visualization│ │ Pose Library │
│ frame │ │ JPEG panels │ │ (averaged │
│ (vertices,│ │ (original + │ │ 3D poses per │
│ joints, │ │ skeleton + │ │ technique) │
│ params) │ │ mesh views) │ │ │
└──────┬────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
▼ ▼ ▼
┌────────────────────────────────────────────────────────┐
│ WEB EXPORT & AVATAR STUDIO │
│ │
│ export_web_data.py → faces.bin + frame_*.bin │
│ │
│ avatar_studio.html (Three.js) │
│ ├── 3D Mesh Viewer with orbit controls │
│ ├── Video player synced to 3D frames │
│ ├── Timeline scrubber with colored technique segments │
│ ├── Segment labeling / editing UI │
│ ├── Voice Control Agent (Web Speech API → Groq LLM) │
│ └── Export (JSON labels, animation data) │
│ │
│ llm_pose_controller.py │
│ └── JSON command → interpolated 3D transition video │
└────────────────────────────────────────────────────────┘
This project orchestrates 5 different AI models in a single pipeline:
| # | Model | Role | Size | Source |
|---|---|---|---|---|
| 1 | SAM 3D Body (DINOv3-H+) | Core 3D human mesh recovery from single images | 840M params (~3.5 GB) | HuggingFace |
| 2 | ViTDet (Cascade Mask R-CNN) | Human detection — finds person bounding boxes in frames | ~2.5 GB | Detectron2 |
| 3 | MoGe2 | Field-of-view estimation — predicts camera intrinsics K | ~1.2 GB | HuggingFace |
| 4 | SAM2.1 (Hiera-Large) | Human segmentation masks (optional, highest quality) | ~900 MB | Meta |
| 5 | Llama 3.3 70B | Voice command intent recognition (via Groq API) | Cloud API | Groq |
Additional: YOLOv8n is included as a lightweight alternative detector.
The output mesh uses Meta's MHR parametric body model, which produces:
- 18,439 mesh vertices per person per frame
- 70 3D joints (MHR70 skeleton — body, hands, feet)
- Body pose parameters + shape parameters
- Camera translation and focal length
- Python 3.11 (recommended) — Download from python.org
- CUDA 11.8+ — NVIDIA CUDA Toolkit
- GPU with 16GB+ VRAM (RTX 3090/4090 or better recommended)
- Git — Download Git for Windows
# Clone the repository
git clone https://github.com/yourusername/sam-3d-body.git
cd sam-3d-body
# Create Python virtual environment
python -m venv venv
venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
pip install yt-dlp imageio-ffmpeg # For YouTube downloads# Login to HuggingFace (you need to request access to SAM 3D Body first)
huggingface-cli login
# Download all models (~8GB total)
python download_models.pyNote: You must request access to the SAM 3D Body model on HuggingFace before downloading.
# Start the web server (handles uploads, processing, YouTube downloads)
python studio_server.py --port 8765Or simply double-click run_studio.bat
Navigate to http://localhost:8765 in your browser (Chrome or Edge recommended).
Option A: Upload a Local Video
- Click 📹 Upload Video
- Drag & drop a video file or click to browse
- Configure settings (frame skip: 6, mode: full)
- Click Process
- Wait for processing to complete (~1-2 seconds per frame on RTX 4090)
Option B: Download from YouTube
- Click 📹 Upload Video → 📺 YouTube URL tab
- Paste a YouTube URL (e.g., karate tutorial, dance video, sports clip)
- Click Fetch Info to preview
- Click Download & Process
- The video downloads and processes automatically
- Play/Pause — Space bar or ▶ button
- Scrub Timeline — Click anywhere on the timeline
- Label Segments — Select frame range, add name/category, click Save
- Voice Control — Click 🎙️ Voice Control, say "show me a kick"
- Export — Click 💾 Export JSON to save labeled segments
This project can be used for:
- Martial Arts Training — Analyze karate, taekwondo, kung fu techniques frame-by-frame
- Dance Choreography — Extract 3D poses from dance videos, create move libraries
- Yoga & Exercise — Document poses, create interactive tutorials
- Sports Analysis — Study golf swings, tennis serves, basketball shots in 3D
- Motion Capture — Convert video to 3D animation data without expensive mocap equipment
- Character Animation — Extract realistic human movements for game characters
- Reference Library — Build a searchable database of 3D poses and movements
- Animation Prototyping — Quickly test movement ideas from video reference
- Biomechanics — Study human movement patterns, joint angles, body mechanics
- Physical Therapy — Document patient movements, track rehabilitation progress
- Ergonomics — Analyze workplace movements, optimize body positions
- Academic Research — Dataset creation for computer vision, HMR research
- Video Production — Create 3D visualizations from 2D video footage
- Social Media — Generate unique 3D content from viral videos
- Tutorials — Make interactive 3D guides from instructional videos
- Art Projects — Use 3D human meshes as creative material
| Script | Description |
|---|---|
download_karate_video.py |
Downloads the karate Heian Sandan video from YouTube (with cookie auth support) |
download_models.py |
Downloads all model checkpoints from HuggingFace (SAM 3D Body, ViTDet, MoGe2, SAM2) |
karate_pose_pipeline.py |
Main pipeline — extracts frames, labels with transcript, runs SAM 3D Body, builds pose library |
process_karate_video.py |
Alternative video processing script (no transcript labeling, simpler) |
export_web_data.py |
Exports mesh data to binary format for the Three.js web viewer |
render_avatar_video.py |
Renders a standalone avatar-only video from the mesh data |
llm_pose_controller.py |
Accepts JSON pose commands and generates interpolated transition videos |
demo.py |
Original SAM 3D Body demo — single image inference |
run_dancing.py |
Quick test script on a sample image |
Problem: "Download produced no file" or "yt-dlp not found"
Solution:
# Install yt-dlp and ffmpeg
pip install yt-dlp imageio-ffmpeg
# Add Python Scripts to PATH (if yt-dlp command not found)
$userScripts = "$env:APPDATA\Python\Python311\Scripts"
[Environment]::SetEnvironmentVariable("Path", "$userScripts;" + [Environment]::GetEnvironmentVariable("Path", "User"), "User")
# Restart your terminal/PowerShellProblem: "CUDA out of memory" or "RuntimeError: No CUDA GPUs available"
Solutions:
- Reduce frame skip: Use
--frame_skip 10(process every 10th frame) - Use body-only mode: Set inference type to "body" instead of "full"
- Close other GPU applications (Chrome, games, etc.)
- Check CUDA installation:
nvidia-smishould show your GPU
Problem: Models run on CPU (very slow)
Solution:
# Verify PyTorch sees your GPU
python -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))"
# If False, reinstall PyTorch with CUDA
pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118Problem: "401 Unauthorized" when downloading models
Solution:
- Go to https://huggingface.co/facebook/sam-3d-body-dinov3
- Click "Request Access" and wait for approval (usually instant)
- Run
huggingface-cli loginand enter your token - Retry
python download_models.py
Problem: Download interrupted or corrupted
Solution:
# Clear cache and re-download
Remove-Item -Recurse -Force "$env:USERPROFILE\.cache\huggingface\hub"
python download_models.pyProblem: "Address already in use" or port 8765 occupied
Solution:
# Find and kill the process using port 8765
Get-Process -Name python | Where-Object {$_.CommandLine -like '*studio_server*'} | Stop-Process -Force
# Or use a different port
python studio_server.py --port 8080Problem: Voice control doesn't work
Solution:
- Use Chrome or Edge (Firefox has limited Web Speech API support)
- Allow microphone permissions when prompted
- Get a free Groq API key: https://console.groq.com/
- The studio will prompt for the key on first use
Problem: 3D viewer is black or not loading
Solution:
- Enable hardware acceleration in browser settings
- Update your GPU drivers
- Try a different browser (Chrome recommended)
- Check browser console (F12) for errors
- Frame Skip: Start with 6-10 for testing, use 1-2 for final quality
- Inference Mode: "body" is 2x faster than "full" (body+hands)
- Video Resolution: Downscale large videos to 1080p before processing
- Batch Size: Process shorter clips (30-60 seconds) for faster iteration
The avatar_studio.html file is a full-featured web application built with Three.js that provides:
- 3D Mesh Viewer — orbit controls, PBR lighting, wireframe toggle, color presets
- Video Sync — source video synced frame-by-frame to the 3D mesh
- Split View — side-by-side 3D avatar and video
- Timeline Scrubber — color-coded technique segments, playback controls (0.25x–4x speed)
- Segment Editor — label frame ranges with technique names and categories
- Smooth Morphing — vertex interpolation between frames for fluid animation
- Keyboard Shortcuts — Space (play/pause), Arrow keys (step), Home/End (skip)
- Export — save labeled segments as JSON
The studio includes an agentic voice control system:
- Click the Voice Control button (or use Chrome/Edge with microphone)
- Speak a command like "show me a horse stance" or "do a crescent kick"
- The Web Speech API transcribes your voice
- The transcript is sent to Llama 3.3 70B (via Groq API) with the list of available techniques
- The LLM returns a structured JSON action (
goto_move,play,stop) - The studio navigates to the matching 3D animation and starts playback
This requires a Groq API key — the studio will prompt you on first use.
The llm_pose_controller.py script enables programmatic control of 3D pose transitions:
command = {
"start_pose": "yoi ready stance",
"end_pose": "kiba-dachi horse riding stance",
"air_time": 2.0, # transition duration in seconds
"rotation": 15 # Y-axis rotation in degrees
}This will:
- Look up both poses in the pose library (averaged 3D vertices per technique)
- Generate N interpolated frames between start and end
- Apply optional Y-axis rotation with ease-in/out
- Render front + side view video using the SAM 3D Body renderer
After running the full pipeline:
sam-3d-body/
├── karate_frames/ # Extracted video frames
│ └── frame_000000.jpg ...
├── karate_output/
│ ├── visualized_frames/ # 4-panel visualizations per frame
│ │ └── frame_000000.jpg ... # (original | skeleton | mesh front | mesh side)
│ ├── mesh_data/ # Per-frame 3D data (.npz)
│ │ └── frame_000000.npz ... # (vertices, joints, pose params, camera)
│ ├── web/ # Binary data for Three.js viewer
│ │ ├── faces.bin # Mesh face topology (int32)
│ │ ├── frame_*.bin # Per-frame vertices (float32)
│ │ └── manifest.json # Frame metadata index
│ ├── pose_timeline.csv # Frame → technique → 3D joint positions
│ ├── pose_library.npz # Averaged 3D poses per technique
│ ├── pose_library_index.json # Human-readable technique index
│ └── heian_sandan_3d.mp4 # Compiled output video
└── karate_transcript.json # Technique labels with timestamps
- GPU: NVIDIA RTX 3060 (12GB VRAM) or better
- RAM: 16GB system RAM
- Storage: 20GB free space (models + processed data)
- OS: Windows 10/11 (64-bit)
- GPU: NVIDIA RTX 4090 (24GB VRAM) or RTX 3090 (24GB VRAM)
- RAM: 32GB system RAM
- Storage: 50GB+ free space (SSD recommended)
- OS: Windows 11 (64-bit)
| Mode | VRAM | Speed (RTX 4090) | Quality |
|---|---|---|---|
| Body-only, DINOv3-H+ | ~14 GB | ~0.8s/frame | Excellent |
| Full (body+hands), DINOv3-H+ | ~16 GB | ~1.5s/frame | Best |
| Full + SAM2 mask, DINOv3-H+ | ~20 GB | ~2.5s/frame | Maximum |
Note: Model checkpoints (~8GB total) are not included in this repository. You must download them separately using download_models.py after requesting access on HuggingFace.
This project is built on SAM 3D Body by Meta Superintelligence Labs.
SAM 3D Body (3DB) is a promptable model for single-image full-body 3D human mesh recovery (HMR). It uses an encoder-decoder architecture with a DINOv3-H+ backbone and supports auxiliary prompts (2D keypoints, masks). Trained on high-quality annotations from multi-view geometry and differentiable optimization.
| Backbone (size) | 3DPW (MPJPE) | EMDB (MPJPE) | RICH (PVE) | COCO (PCK@.05) | LSPET (PCK@.05) | Freihand (PA-MPJPE) |
|---|---|---|---|---|---|---|
| DINOv3-H+ (840M) (config, checkpoint) |
54.8 | 61.7 | 60.3 | 86.5 | 68.0 | 5.5 |
| ViT-H (631M) (config, checkpoint) |
54.8 | 62.9 | 61.7 | 86.8 | 68.9 | 5.5 |
import cv2
import numpy as np
from notebook.utils import setup_sam_3d_body
from tools.vis_utils import visualize_sample_together
estimator = setup_sam_3d_body(hf_repo_id="facebook/sam-3d-body-dinov3")
img_bgr = cv2.imread("path/to/image.jpg")
outputs = estimator.process_one_image(cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB))
rend_img = visualize_sample_together(img_bgr, outputs, estimator.faces)
cv2.imwrite("output.jpg", rend_img.astype(np.uint8))For the complete upstream demo, see notebook/demo_human.ipynb.
sam-3d-body/
├── sam_3d_body/ # Core SAM 3D Body model (Meta upstream)
│ ├── models/ # Encoder, decoder, heads
│ ├── visualization/ # Renderer, skeleton visualizer
│ ├── data/ # Data loading utilities
│ ├── metadata/ # MHR70 joint definitions
│ └── utils/ # Model utilities
├── tools/ # Detector, FOV estimator, segmentor builders
│ ├── build_detector.py # YOLOv8 / ViTDet human detector
│ ├── build_fov_estimator.py # MoGe2 field-of-view estimator
│ ├── build_sam.py # SAM2 human segmentor
│ └── vis_utils.py # Visualization helpers
├── notebook/ # Jupyter demo notebook
├── data/ # Dataset download scripts (upstream)
├── karate_pose_pipeline.py # Main video → 3D mesh pipeline
├── llm_pose_controller.py # LLM-driven pose interpolation
├── avatar_studio.html # Interactive 3D web studio
├── export_web_data.py # Mesh → binary for Three.js
├── download_karate_video.py # YouTube video downloader
├── download_models.py # Model checkpoint downloader
├── karate_transcript.json # Karate technique labels
├── .env.example # API key template
├── INSTALL.md # Dependency installation guide
├── KARATE_PIPELINE_README.md # Detailed pipeline documentation
└── LICENSE # SAM License
The SAM 3D Body model checkpoints and code are licensed under SAM License.
See contributing and the code of conduct.
If you use SAM 3D Body or the SAM 3D Body dataset in your research, please use the following BibTeX entry.
@article{yang2026sam3dbody,
title={SAM 3D Body: Robust Full-Body Human Mesh Recovery},
author={Yang, Xitong and Kukreja, Devansh and Pinkus, Don and Sagar, Anushka and Fan, Taosha and Park, Jinhyung and Shin, Soyong and Cao, Jinkun and Liu, Jiawei and Ugrinovic, Nicolas and Feiszli, Matt and Malik, Jitendra and Dollar, Piotr and Kitani, Kris},
journal={arXiv preprint arXiv:2602.15989},
year={2026}
}

