🥋 Karate 3D Avatar Studio

Transform any video into an interactive 3D avatar with motion capture, animation labeling, and AI-powered controls.

This project provides a complete pipeline for converting videos (local files or YouTube) into interactive 3D human animations. Built on Meta's SAM 3D Body model, it extracts full-body 3D meshes from video, enables frame-by-frame labeling, and presents everything in a professional web-based 3D studio with voice control.

✨ What Makes This Special

🎥 Video to 3D Pipeline — Upload any video or paste a YouTube URL, get back a fully rigged 3D avatar
🌐 Professional Web Studio — Three.js-powered viewer with timeline, playback controls, and segment editor
🎙️ Voice Control — Say "show me a kick" and AI navigates to the matching animation
📹 YouTube Integration — Download and process videos directly from YouTube URLs
🎬 Multiple Jobs — Process multiple videos simultaneously, switch between them instantly
💾 Export Ready — Export labeled segments as JSON, animation data as binary for web/game engines
🤖 AI-Powered — Uses 5 different AI models working together seamlessly

🎯 Key Features

Video Processing

📁 Local Upload — Drag & drop MP4, AVI, MOV, MKV, WebM files (up to 2GB)
📺 YouTube Download — Paste any YouTube URL, fetch metadata preview, download & process automatically
🔄 Multiple Jobs — Process multiple videos in parallel, switch between them via dropdown
⚙️ Configurable — Adjust frame skip (1-10) and inference mode (body-only or full body+hands)

3D Avatar Studio

🎨 Real-time 3D Viewer — Orbit controls, PBR lighting, wireframe mode, color presets
🎬 Video Sync — Source video plays side-by-side with 3D mesh, frame-perfect sync
📊 Timeline Scrubber — Color-coded technique segments, playback speed (0.25x–4x)
✏️ Segment Editor — Label frame ranges with names and categories (stance, kick, block, etc.)
🎙️ Voice Control — Speak commands like "show me a horse stance" powered by Llama 3.3 70B
⌨️ Keyboard Shortcuts — Space (play/pause), arrows (step), Home/End (skip to start/end)

Technical Pipeline

Video Input — Upload file or download from YouTube with yt-dlp
Frame Extraction — Extract frames at configurable intervals (OpenCV)
Human Detection — YOLOv8 or ViTDet finds person bounding boxes
Camera Estimation — MoGe2 predicts field-of-view and intrinsics
3D Mesh Recovery — SAM 3D Body extracts 18,439 vertices + 70 joints per frame
Segmentation (optional) — SAM2 provides high-quality human masks
Web Export — Binary mesh data for Three.js real-time rendering
Interactive Studio — Web-based viewer with labeling, voice control, and export

Architecture Overview

YouTube Video (Karate Heian Sandan)
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│  FRAME EXTRACTION                                       │
│  download_karate_video.py → karate_pose_pipeline.py     │
│  OpenCV: extract frames, resize, label with transcript  │
└─────────────────────┬───────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────┐
│  3D MESH RECOVERY (per frame)                           │
│                                                         │
│  ┌──────────────┐   ┌──────────────┐   ┌────────────┐  │
│  │ YOLOv8 /     │   │ MoGe2 FOV    │   │ SAM2       │  │
│  │ ViTDet       │──▶│ Estimator    │──▶│ Segmentor  │  │
│  │ (Detector)   │   │ (Camera K)   │   │ (optional) │  │
│  └──────┬───────┘   └──────┬───────┘   └─────┬──────┘  │
│         │                  │                  │         │
│         ▼                  ▼                  ▼         │
│  ┌─────────────────────────────────────────────────┐    │
│  │         SAM 3D Body (DINOv3-H+ 840M)           │    │
│  │                                                  │    │
│  │  Image Patches → DINOv3 Backbone → Embeddings   │    │
│  │  CameraEncoder → Ray-Conditioned Features        │    │
│  │  PromptableDecoder (N layers) → Pose Tokens     │    │
│  │  MHR Head → 18,439 vertices + 70 joints          │    │
│  └──────────────────────┬──────────────────────────┘    │
│                         │                               │
└─────────────────────────┼───────────────────────────────┘
                          │
        ┌─────────────────┼─────────────────────┐
        ▼                 ▼                     ▼
 ┌──────────┐     ┌──────────────┐     ┌──────────────┐
 │ .npz per  │     │ Visualization│     │ Pose Library │
 │ frame     │     │ JPEG panels  │     │ (averaged    │
 │ (vertices,│     │ (original +  │     │ 3D poses per │
 │  joints,  │     │  skeleton +  │     │ technique)   │
 │  params)  │     │  mesh views) │     │              │
 └──────┬────┘     └──────┬───────┘     └──────┬───────┘
        │                 │                    │
        ▼                 ▼                    ▼
 ┌────────────────────────────────────────────────────────┐
 │  WEB EXPORT & AVATAR STUDIO                            │
 │                                                         │
 │  export_web_data.py → faces.bin + frame_*.bin           │
 │                                                         │
 │  avatar_studio.html (Three.js)                          │
 │  ├── 3D Mesh Viewer with orbit controls                 │
 │  ├── Video player synced to 3D frames                   │
 │  ├── Timeline scrubber with colored technique segments  │
 │  ├── Segment labeling / editing UI                      │
 │  ├── Voice Control Agent (Web Speech API → Groq LLM)   │
 │  └── Export (JSON labels, animation data)               │
 │                                                         │
 │  llm_pose_controller.py                                 │
 │  └── JSON command → interpolated 3D transition video    │
 └────────────────────────────────────────────────────────┘

Models Used

This project orchestrates 5 different AI models in a single pipeline:

#	Model	Role	Size	Source
1	SAM 3D Body (DINOv3-H+)	Core 3D human mesh recovery from single images	840M params (~3.5 GB)	HuggingFace
2	ViTDet (Cascade Mask R-CNN)	Human detection — finds person bounding boxes in frames	~2.5 GB	Detectron2
3	MoGe2	Field-of-view estimation — predicts camera intrinsics K	~1.2 GB	HuggingFace
4	SAM2.1 (Hiera-Large)	Human segmentation masks (optional, highest quality)	~900 MB	Meta
5	Llama 3.3 70B	Voice command intent recognition (via Groq API)	Cloud API	Groq

Additional: YOLOv8n is included as a lightweight alternative detector.

Momentum Human Rig (MHR)

The output mesh uses Meta's MHR parametric body model, which produces:

18,439 mesh vertices per person per frame
70 3D joints (MHR70 skeleton — body, hands, feet)
Body pose parameters + shape parameters
Camera translation and focal length

🚀 Quick Start (Windows)

Prerequisites

Python 3.11 (recommended) — Download from python.org
CUDA 11.8+ — NVIDIA CUDA Toolkit
GPU with 16GB+ VRAM (RTX 3090/4090 or better recommended)
Git — Download Git for Windows

1️⃣ Clone and Setup Environment

# Clone the repository
git clone https://github.com/yourusername/sam-3d-body.git
cd sam-3d-body

# Create Python virtual environment
python -m venv venv
venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
pip install yt-dlp imageio-ffmpeg  # For YouTube downloads

2️⃣ Download Model Checkpoints

# Login to HuggingFace (you need to request access to SAM 3D Body first)
huggingface-cli login

# Download all models (~8GB total)
python download_models.py

Note: You must request access to the SAM 3D Body model on HuggingFace before downloading.

3️⃣ Launch the Studio Server

# Start the web server (handles uploads, processing, YouTube downloads)
python studio_server.py --port 8765

Or simply double-click run_studio.bat

4️⃣ Open the Studio

Navigate to http://localhost:8765 in your browser (Chrome or Edge recommended).

5️⃣ Process Your First Video

Option A: Upload a Local Video

Click 📹 Upload Video
Drag & drop a video file or click to browse
Configure settings (frame skip: 6, mode: full)
Click Process
Wait for processing to complete (~1-2 seconds per frame on RTX 4090)

Option B: Download from YouTube

Click 📹 Upload Video → 📺 YouTube URL tab
Paste a YouTube URL (e.g., karate tutorial, dance video, sports clip)
Click Fetch Info to preview
Click Download & Process
The video downloads and processes automatically

6️⃣ Explore the 3D Avatar

Play/Pause — Space bar or ▶ button
Scrub Timeline — Click anywhere on the timeline
Label Segments — Select frame range, add name/category, click Save
Voice Control — Click 🎙️ Voice Control, say "show me a kick"
Export — Click 💾 Export JSON to save labeled segments

💡 Use Cases

This project can be used for:

Sports & Fitness

Martial Arts Training — Analyze karate, taekwondo, kung fu techniques frame-by-frame
Dance Choreography — Extract 3D poses from dance videos, create move libraries
Yoga & Exercise — Document poses, create interactive tutorials
Sports Analysis — Study golf swings, tennis serves, basketball shots in 3D

Animation & Game Development

Motion Capture — Convert video to 3D animation data without expensive mocap equipment
Character Animation — Extract realistic human movements for game characters
Reference Library — Build a searchable database of 3D poses and movements
Animation Prototyping — Quickly test movement ideas from video reference

Education & Research

Biomechanics — Study human movement patterns, joint angles, body mechanics
Physical Therapy — Document patient movements, track rehabilitation progress
Ergonomics — Analyze workplace movements, optimize body positions
Academic Research — Dataset creation for computer vision, HMR research

Content Creation

Video Production — Create 3D visualizations from 2D video footage
Social Media — Generate unique 3D content from viral videos
Tutorials — Make interactive 3D guides from instructional videos
Art Projects — Use 3D human meshes as creative material

Pipeline Scripts

Script	Description
`download_karate_video.py`	Downloads the karate Heian Sandan video from YouTube (with cookie auth support)
`download_models.py`	Downloads all model checkpoints from HuggingFace (SAM 3D Body, ViTDet, MoGe2, SAM2)
`karate_pose_pipeline.py`	Main pipeline — extracts frames, labels with transcript, runs SAM 3D Body, builds pose library
`process_karate_video.py`	Alternative video processing script (no transcript labeling, simpler)
`export_web_data.py`	Exports mesh data to binary format for the Three.js web viewer
`render_avatar_video.py`	Renders a standalone avatar-only video from the mesh data
`llm_pose_controller.py`	Accepts JSON pose commands and generates interpolated transition videos
`demo.py`	Original SAM 3D Body demo — single image inference
`run_dancing.py`	Quick test script on a sample image

🔧 Troubleshooting (Windows)

YouTube Download Issues

Problem: "Download produced no file" or "yt-dlp not found"

Solution:

# Install yt-dlp and ffmpeg
pip install yt-dlp imageio-ffmpeg

# Add Python Scripts to PATH (if yt-dlp command not found)
$userScripts = "$env:APPDATA\Python\Python311\Scripts"
[Environment]::SetEnvironmentVariable("Path", "$userScripts;" + [Environment]::GetEnvironmentVariable("Path", "User"), "User")

# Restart your terminal/PowerShell

CUDA / GPU Issues

Problem: "CUDA out of memory" or "RuntimeError: No CUDA GPUs available"

Solutions:

Reduce frame skip: Use --frame_skip 10 (process every 10th frame)
Use body-only mode: Set inference type to "body" instead of "full"
Close other GPU applications (Chrome, games, etc.)
Check CUDA installation: nvidia-smi should show your GPU

Problem: Models run on CPU (very slow)

Solution:

# Verify PyTorch sees your GPU
python -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))"

# If False, reinstall PyTorch with CUDA
pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Model Download Issues

Problem: "401 Unauthorized" when downloading models

Solution:

Go to https://huggingface.co/facebook/sam-3d-body-dinov3
Click "Request Access" and wait for approval (usually instant)
Run huggingface-cli login and enter your token
Retry python download_models.py

Problem: Download interrupted or corrupted

Solution:

# Clear cache and re-download
Remove-Item -Recurse -Force "$env:USERPROFILE\.cache\huggingface\hub"
python download_models.py

Server Won't Start

Problem: "Address already in use" or port 8765 occupied

Solution:

# Find and kill the process using port 8765
Get-Process -Name python | Where-Object {$_.CommandLine -like '*studio_server*'} | Stop-Process -Force

# Or use a different port
python studio_server.py --port 8080

Browser Issues

Problem: Voice control doesn't work

Solution:

Use Chrome or Edge (Firefox has limited Web Speech API support)
Allow microphone permissions when prompted
Get a free Groq API key: https://console.groq.com/
The studio will prompt for the key on first use

Problem: 3D viewer is black or not loading

Solution:

Enable hardware acceleration in browser settings
Update your GPU drivers
Try a different browser (Chrome recommended)
Check browser console (F12) for errors

Performance Tips

Frame Skip: Start with 6-10 for testing, use 1-2 for final quality
Inference Mode: "body" is 2x faster than "full" (body+hands)
Video Resolution: Downscale large videos to 1080p before processing
Batch Size: Process shorter clips (30-60 seconds) for faster iteration

3D Avatar Studio (Web UI)

The avatar_studio.html file is a full-featured web application built with Three.js that provides:

3D Mesh Viewer — orbit controls, PBR lighting, wireframe toggle, color presets
Video Sync — source video synced frame-by-frame to the 3D mesh
Split View — side-by-side 3D avatar and video
Timeline Scrubber — color-coded technique segments, playback controls (0.25x–4x speed)
Segment Editor — label frame ranges with technique names and categories
Smooth Morphing — vertex interpolation between frames for fluid animation
Keyboard Shortcuts — Space (play/pause), Arrow keys (step), Home/End (skip)
Export — save labeled segments as JSON

Voice Control Agent

The studio includes an agentic voice control system:

Click the Voice Control button (or use Chrome/Edge with microphone)
Speak a command like "show me a horse stance" or "do a crescent kick"
The Web Speech API transcribes your voice
The transcript is sent to Llama 3.3 70B (via Groq API) with the list of available techniques
The LLM returns a structured JSON action (goto_move, play, stop)
The studio navigates to the matching 3D animation and starts playback

This requires a Groq API key — the studio will prompt you on first use.

LLM Pose Controller

The llm_pose_controller.py script enables programmatic control of 3D pose transitions:

command = {
    "start_pose": "yoi ready stance",
    "end_pose": "kiba-dachi horse riding stance",
    "air_time": 2.0,       # transition duration in seconds
    "rotation": 15          # Y-axis rotation in degrees
}

This will:

Look up both poses in the pose library (averaged 3D vertices per technique)
Generate N interpolated frames between start and end
Apply optional Y-axis rotation with ease-in/out
Render front + side view video using the SAM 3D Body renderer

Output Structure

After running the full pipeline:

sam-3d-body/
├── karate_frames/                  # Extracted video frames
│   └── frame_000000.jpg ...
├── karate_output/
│   ├── visualized_frames/          # 4-panel visualizations per frame
│   │   └── frame_000000.jpg ...    # (original | skeleton | mesh front | mesh side)
│   ├── mesh_data/                  # Per-frame 3D data (.npz)
│   │   └── frame_000000.npz ...    # (vertices, joints, pose params, camera)
│   ├── web/                        # Binary data for Three.js viewer
│   │   ├── faces.bin               # Mesh face topology (int32)
│   │   ├── frame_*.bin             # Per-frame vertices (float32)
│   │   └── manifest.json           # Frame metadata index
│   ├── pose_timeline.csv           # Frame → technique → 3D joint positions
│   ├── pose_library.npz            # Averaged 3D poses per technique
│   ├── pose_library_index.json     # Human-readable technique index
│   └── heian_sandan_3d.mp4         # Compiled output video
└── karate_transcript.json          # Technique labels with timestamps

🖥️ System Requirements

Minimum

GPU: NVIDIA RTX 3060 (12GB VRAM) or better
RAM: 16GB system RAM
Storage: 20GB free space (models + processed data)
OS: Windows 10/11 (64-bit)

Performance by Mode

Mode	VRAM	Speed (RTX 4090)	Quality
Body-only, DINOv3-H+	~14 GB	~0.8s/frame	Excellent
Full (body+hands), DINOv3-H+	~16 GB	~1.5s/frame	Best
Full + SAM2 mask, DINOv3-H+	~20 GB	~2.5s/frame	Maximum

Note: Model checkpoints (~8GB total) are not included in this repository. You must download them separately using download_models.py after requesting access on HuggingFace.

SAM 3D Body (Upstream Model)

This project is built on SAM 3D Body by Meta Superintelligence Labs.

SAM 3D Body (3DB) is a promptable model for single-image full-body 3D human mesh recovery (HMR). It uses an encoder-decoder architecture with a DINOv3-H+ backbone and supports auxiliary prompts (2D keypoints, masks). Trained on high-quality annotations from multi-view geometry and differentiable optimization.

Checkpoints

Backbone (size)	3DPW (MPJPE)	EMDB (MPJPE)	RICH (PVE)	COCO (PCK@.05)	LSPET (PCK@.05)	Freihand (PA-MPJPE)
DINOv3-H+ (840M) (config, checkpoint)	54.8	61.7	60.3	86.5	68.0	5.5
ViT-H (631M) (config, checkpoint)	54.8	62.9	61.7	86.8	68.9	5.5

Quick Single-Image Demo

import cv2
import numpy as np
from notebook.utils import setup_sam_3d_body
from tools.vis_utils import visualize_sample_together

estimator = setup_sam_3d_body(hf_repo_id="facebook/sam-3d-body-dinov3")
img_bgr = cv2.imread("path/to/image.jpg")
outputs = estimator.process_one_image(cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB))
rend_img = visualize_sample_together(img_bgr, outputs, estimator.faces)
cv2.imwrite("output.jpg", rend_img.astype(np.uint8))

For the complete upstream demo, see notebook/demo_human.ipynb.

Project Structure

sam-3d-body/
├── sam_3d_body/                    # Core SAM 3D Body model (Meta upstream)
│   ├── models/                     # Encoder, decoder, heads
│   ├── visualization/              # Renderer, skeleton visualizer
│   ├── data/                       # Data loading utilities
│   ├── metadata/                   # MHR70 joint definitions
│   └── utils/                      # Model utilities
├── tools/                          # Detector, FOV estimator, segmentor builders
│   ├── build_detector.py           # YOLOv8 / ViTDet human detector
│   ├── build_fov_estimator.py      # MoGe2 field-of-view estimator
│   ├── build_sam.py                # SAM2 human segmentor
│   └── vis_utils.py                # Visualization helpers
├── notebook/                       # Jupyter demo notebook
├── data/                           # Dataset download scripts (upstream)
├── karate_pose_pipeline.py         # Main video → 3D mesh pipeline
├── llm_pose_controller.py          # LLM-driven pose interpolation
├── avatar_studio.html              # Interactive 3D web studio
├── export_web_data.py              # Mesh → binary for Three.js
├── download_karate_video.py        # YouTube video downloader
├── download_models.py              # Model checkpoint downloader
├── karate_transcript.json          # Karate technique labels
├── .env.example                    # API key template
├── INSTALL.md                      # Dependency installation guide
├── KARATE_PIPELINE_README.md       # Detailed pipeline documentation
└── LICENSE                         # SAM License

License

The SAM 3D Body model checkpoints and code are licensed under SAM License.

Contributing

See contributing and the code of conduct.

Citing SAM 3D Body

If you use SAM 3D Body or the SAM 3D Body dataset in your research, please use the following BibTeX entry.

@article{yang2026sam3dbody,
  title={SAM 3D Body: Robust Full-Body Human Mesh Recovery},
  author={Yang, Xitong and Kukreja, Devansh and Pinkus, Don and Sagar, Anushka and Fan, Taosha and Park, Jinhyung and Shin, Soyong and Cao, Jinkun and Liu, Jiawei and Ugrinovic, Nicolas and Feiszli, Matt and Malik, Jitendra and Dollar, Piotr and Kitani, Kris},
  journal={arXiv preprint arXiv:2602.15989},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
data		data
docs		docs
notebook		notebook
sam_3d_body		sam_3d_body
tools		tools
.env.example		.env.example
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
INSTALL.md		INSTALL.md
KARATE_PIPELINE_README.md		KARATE_PIPELINE_README.md
LICENSE		LICENSE
README.md		README.md
avatar_studio.html		avatar_studio.html
build_detectron2.bat		build_detectron2.bat
demo.py		demo.py
download_karate_video.py		download_karate_video.py
download_models.py		download_models.py
export_web_data.py		export_web_data.py
karate_pose_pipeline.py		karate_pose_pipeline.py
karate_transcript.json		karate_transcript.json
llm_pose_controller.py		llm_pose_controller.py
process_karate_video.py		process_karate_video.py
render_avatar_video.py		render_avatar_video.py
run_dancing.py		run_dancing.py
run_studio.bat		run_studio.bat
setup_env.ps1		setup_env.ps1
studio_server.py		studio_server.py

Folders and files

Latest commit

History

Repository files navigation

🥋 Karate 3D Avatar Studio

✨ What Makes This Special

🎯 Key Features

Video Processing

3D Avatar Studio

Technical Pipeline

Architecture Overview

Models Used

Momentum Human Rig (MHR)

🚀 Quick Start (Windows)

Prerequisites

1️⃣ Clone and Setup Environment

2️⃣ Download Model Checkpoints

3️⃣ Launch the Studio Server

4️⃣ Open the Studio

5️⃣ Process Your First Video

6️⃣ Explore the 3D Avatar

💡 Use Cases

Sports & Fitness

Animation & Game Development

Education & Research

Content Creation

Pipeline Scripts

🔧 Troubleshooting (Windows)

YouTube Download Issues

CUDA / GPU Issues

Model Download Issues

Server Won't Start

Browser Issues

Performance Tips

3D Avatar Studio (Web UI)

Voice Control Agent

LLM Pose Controller

Output Structure

🖥️ System Requirements

Minimum

Recommended

Performance by Mode

SAM 3D Body (Upstream Model)

Checkpoints

Quick Single-Image Demo

Project Structure

License

Contributing

Citing SAM 3D Body

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages