Skip to content

nodeblackbox/Karate-AI-Agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🥋 Karate 3D Avatar Studio

Transform any video into an interactive 3D avatar with motion capture, animation labeling, and AI-powered controls.

This project provides a complete pipeline for converting videos (local files or YouTube) into interactive 3D human animations. Built on Meta's SAM 3D Body model, it extracts full-body 3D meshes from video, enables frame-by-frame labeling, and presents everything in a professional web-based 3D studio with voice control.

Karate 3D Avatar Studio

✨ What Makes This Special

  • 🎥 Video to 3D Pipeline — Upload any video or paste a YouTube URL, get back a fully rigged 3D avatar
  • 🌐 Professional Web Studio — Three.js-powered viewer with timeline, playback controls, and segment editor
  • 🎙️ Voice Control — Say "show me a kick" and AI navigates to the matching animation
  • 📹 YouTube Integration — Download and process videos directly from YouTube URLs
  • 🎬 Multiple Jobs — Process multiple videos simultaneously, switch between them instantly
  • 💾 Export Ready — Export labeled segments as JSON, animation data as binary for web/game engines
  • 🤖 AI-Powered — Uses 5 different AI models working together seamlessly

arXiv Paper Blog


🎯 Key Features

Video Processing

  • 📁 Local Upload — Drag & drop MP4, AVI, MOV, MKV, WebM files (up to 2GB)
  • 📺 YouTube Download — Paste any YouTube URL, fetch metadata preview, download & process automatically
  • 🔄 Multiple Jobs — Process multiple videos in parallel, switch between them via dropdown
  • ⚙️ Configurable — Adjust frame skip (1-10) and inference mode (body-only or full body+hands)

YouTube Download Modal

3D Avatar Studio

  • 🎨 Real-time 3D Viewer — Orbit controls, PBR lighting, wireframe mode, color presets
  • 🎬 Video Sync — Source video plays side-by-side with 3D mesh, frame-perfect sync
  • 📊 Timeline Scrubber — Color-coded technique segments, playback speed (0.25x–4x)
  • ✏️ Segment Editor — Label frame ranges with names and categories (stance, kick, block, etc.)
  • 🎙️ Voice Control — Speak commands like "show me a horse stance" powered by Llama 3.3 70B
  • ⌨️ Keyboard Shortcuts — Space (play/pause), arrows (step), Home/End (skip to start/end)

YouTube Tab

Technical Pipeline

  1. Video Input — Upload file or download from YouTube with yt-dlp
  2. Frame Extraction — Extract frames at configurable intervals (OpenCV)
  3. Human Detection — YOLOv8 or ViTDet finds person bounding boxes
  4. Camera Estimation — MoGe2 predicts field-of-view and intrinsics
  5. 3D Mesh Recovery — SAM 3D Body extracts 18,439 vertices + 70 joints per frame
  6. Segmentation (optional) — SAM2 provides high-quality human masks
  7. Web Export — Binary mesh data for Three.js real-time rendering
  8. Interactive Studio — Web-based viewer with labeling, voice control, and export

Architecture Overview

YouTube Video (Karate Heian Sandan)
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│  FRAME EXTRACTION                                       │
│  download_karate_video.py → karate_pose_pipeline.py     │
│  OpenCV: extract frames, resize, label with transcript  │
└─────────────────────┬───────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────┐
│  3D MESH RECOVERY (per frame)                           │
│                                                         │
│  ┌──────────────┐   ┌──────────────┐   ┌────────────┐  │
│  │ YOLOv8 /     │   │ MoGe2 FOV    │   │ SAM2       │  │
│  │ ViTDet       │──▶│ Estimator    │──▶│ Segmentor  │  │
│  │ (Detector)   │   │ (Camera K)   │   │ (optional) │  │
│  └──────┬───────┘   └──────┬───────┘   └─────┬──────┘  │
│         │                  │                  │         │
│         ▼                  ▼                  ▼         │
│  ┌─────────────────────────────────────────────────┐    │
│  │         SAM 3D Body (DINOv3-H+ 840M)           │    │
│  │                                                  │    │
│  │  Image Patches → DINOv3 Backbone → Embeddings   │    │
│  │  CameraEncoder → Ray-Conditioned Features        │    │
│  │  PromptableDecoder (N layers) → Pose Tokens     │    │
│  │  MHR Head → 18,439 vertices + 70 joints          │    │
│  └──────────────────────┬──────────────────────────┘    │
│                         │                               │
└─────────────────────────┼───────────────────────────────┘
                          │
        ┌─────────────────┼─────────────────────┐
        ▼                 ▼                     ▼
 ┌──────────┐     ┌──────────────┐     ┌──────────────┐
 │ .npz per  │     │ Visualization│     │ Pose Library │
 │ frame     │     │ JPEG panels  │     │ (averaged    │
 │ (vertices,│     │ (original +  │     │ 3D poses per │
 │  joints,  │     │  skeleton +  │     │ technique)   │
 │  params)  │     │  mesh views) │     │              │
 └──────┬────┘     └──────┬───────┘     └──────┬───────┘
        │                 │                    │
        ▼                 ▼                    ▼
 ┌────────────────────────────────────────────────────────┐
 │  WEB EXPORT & AVATAR STUDIO                            │
 │                                                         │
 │  export_web_data.py → faces.bin + frame_*.bin           │
 │                                                         │
 │  avatar_studio.html (Three.js)                          │
 │  ├── 3D Mesh Viewer with orbit controls                 │
 │  ├── Video player synced to 3D frames                   │
 │  ├── Timeline scrubber with colored technique segments  │
 │  ├── Segment labeling / editing UI                      │
 │  ├── Voice Control Agent (Web Speech API → Groq LLM)   │
 │  └── Export (JSON labels, animation data)               │
 │                                                         │
 │  llm_pose_controller.py                                 │
 │  └── JSON command → interpolated 3D transition video    │
 └────────────────────────────────────────────────────────┘

Models Used

This project orchestrates 5 different AI models in a single pipeline:

# Model Role Size Source
1 SAM 3D Body (DINOv3-H+) Core 3D human mesh recovery from single images 840M params (~3.5 GB) HuggingFace
2 ViTDet (Cascade Mask R-CNN) Human detection — finds person bounding boxes in frames ~2.5 GB Detectron2
3 MoGe2 Field-of-view estimation — predicts camera intrinsics K ~1.2 GB HuggingFace
4 SAM2.1 (Hiera-Large) Human segmentation masks (optional, highest quality) ~900 MB Meta
5 Llama 3.3 70B Voice command intent recognition (via Groq API) Cloud API Groq

Additional: YOLOv8n is included as a lightweight alternative detector.

Momentum Human Rig (MHR)

The output mesh uses Meta's MHR parametric body model, which produces:

  • 18,439 mesh vertices per person per frame
  • 70 3D joints (MHR70 skeleton — body, hands, feet)
  • Body pose parameters + shape parameters
  • Camera translation and focal length

🚀 Quick Start (Windows)

Prerequisites

1️⃣ Clone and Setup Environment

# Clone the repository
git clone https://github.com/yourusername/sam-3d-body.git
cd sam-3d-body

# Create Python virtual environment
python -m venv venv
venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
pip install yt-dlp imageio-ffmpeg  # For YouTube downloads

2️⃣ Download Model Checkpoints

# Login to HuggingFace (you need to request access to SAM 3D Body first)
huggingface-cli login

# Download all models (~8GB total)
python download_models.py

Note: You must request access to the SAM 3D Body model on HuggingFace before downloading.

3️⃣ Launch the Studio Server

# Start the web server (handles uploads, processing, YouTube downloads)
python studio_server.py --port 8765

Or simply double-click run_studio.bat

4️⃣ Open the Studio

Navigate to http://localhost:8765 in your browser (Chrome or Edge recommended).

5️⃣ Process Your First Video

Option A: Upload a Local Video

  1. Click 📹 Upload Video
  2. Drag & drop a video file or click to browse
  3. Configure settings (frame skip: 6, mode: full)
  4. Click Process
  5. Wait for processing to complete (~1-2 seconds per frame on RTX 4090)

Option B: Download from YouTube

  1. Click 📹 Upload Video📺 YouTube URL tab
  2. Paste a YouTube URL (e.g., karate tutorial, dance video, sports clip)
  3. Click Fetch Info to preview
  4. Click Download & Process
  5. The video downloads and processes automatically

6️⃣ Explore the 3D Avatar

  • Play/Pause — Space bar or ▶ button
  • Scrub Timeline — Click anywhere on the timeline
  • Label Segments — Select frame range, add name/category, click Save
  • Voice Control — Click 🎙️ Voice Control, say "show me a kick"
  • Export — Click 💾 Export JSON to save labeled segments

💡 Use Cases

This project can be used for:

Sports & Fitness

  • Martial Arts Training — Analyze karate, taekwondo, kung fu techniques frame-by-frame
  • Dance Choreography — Extract 3D poses from dance videos, create move libraries
  • Yoga & Exercise — Document poses, create interactive tutorials
  • Sports Analysis — Study golf swings, tennis serves, basketball shots in 3D

Animation & Game Development

  • Motion Capture — Convert video to 3D animation data without expensive mocap equipment
  • Character Animation — Extract realistic human movements for game characters
  • Reference Library — Build a searchable database of 3D poses and movements
  • Animation Prototyping — Quickly test movement ideas from video reference

Education & Research

  • Biomechanics — Study human movement patterns, joint angles, body mechanics
  • Physical Therapy — Document patient movements, track rehabilitation progress
  • Ergonomics — Analyze workplace movements, optimize body positions
  • Academic Research — Dataset creation for computer vision, HMR research

Content Creation

  • Video Production — Create 3D visualizations from 2D video footage
  • Social Media — Generate unique 3D content from viral videos
  • Tutorials — Make interactive 3D guides from instructional videos
  • Art Projects — Use 3D human meshes as creative material

Pipeline Scripts

Script Description
download_karate_video.py Downloads the karate Heian Sandan video from YouTube (with cookie auth support)
download_models.py Downloads all model checkpoints from HuggingFace (SAM 3D Body, ViTDet, MoGe2, SAM2)
karate_pose_pipeline.py Main pipeline — extracts frames, labels with transcript, runs SAM 3D Body, builds pose library
process_karate_video.py Alternative video processing script (no transcript labeling, simpler)
export_web_data.py Exports mesh data to binary format for the Three.js web viewer
render_avatar_video.py Renders a standalone avatar-only video from the mesh data
llm_pose_controller.py Accepts JSON pose commands and generates interpolated transition videos
demo.py Original SAM 3D Body demo — single image inference
run_dancing.py Quick test script on a sample image

🔧 Troubleshooting (Windows)

YouTube Download Issues

Problem: "Download produced no file" or "yt-dlp not found"

Solution:

# Install yt-dlp and ffmpeg
pip install yt-dlp imageio-ffmpeg

# Add Python Scripts to PATH (if yt-dlp command not found)
$userScripts = "$env:APPDATA\Python\Python311\Scripts"
[Environment]::SetEnvironmentVariable("Path", "$userScripts;" + [Environment]::GetEnvironmentVariable("Path", "User"), "User")

# Restart your terminal/PowerShell

CUDA / GPU Issues

Problem: "CUDA out of memory" or "RuntimeError: No CUDA GPUs available"

Solutions:

  • Reduce frame skip: Use --frame_skip 10 (process every 10th frame)
  • Use body-only mode: Set inference type to "body" instead of "full"
  • Close other GPU applications (Chrome, games, etc.)
  • Check CUDA installation: nvidia-smi should show your GPU

Problem: Models run on CPU (very slow)

Solution:

# Verify PyTorch sees your GPU
python -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))"

# If False, reinstall PyTorch with CUDA
pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Model Download Issues

Problem: "401 Unauthorized" when downloading models

Solution:

  1. Go to https://huggingface.co/facebook/sam-3d-body-dinov3
  2. Click "Request Access" and wait for approval (usually instant)
  3. Run huggingface-cli login and enter your token
  4. Retry python download_models.py

Problem: Download interrupted or corrupted

Solution:

# Clear cache and re-download
Remove-Item -Recurse -Force "$env:USERPROFILE\.cache\huggingface\hub"
python download_models.py

Server Won't Start

Problem: "Address already in use" or port 8765 occupied

Solution:

# Find and kill the process using port 8765
Get-Process -Name python | Where-Object {$_.CommandLine -like '*studio_server*'} | Stop-Process -Force

# Or use a different port
python studio_server.py --port 8080

Browser Issues

Problem: Voice control doesn't work

Solution:

  • Use Chrome or Edge (Firefox has limited Web Speech API support)
  • Allow microphone permissions when prompted
  • Get a free Groq API key: https://console.groq.com/
  • The studio will prompt for the key on first use

Problem: 3D viewer is black or not loading

Solution:

  • Enable hardware acceleration in browser settings
  • Update your GPU drivers
  • Try a different browser (Chrome recommended)
  • Check browser console (F12) for errors

Performance Tips

  • Frame Skip: Start with 6-10 for testing, use 1-2 for final quality
  • Inference Mode: "body" is 2x faster than "full" (body+hands)
  • Video Resolution: Downscale large videos to 1080p before processing
  • Batch Size: Process shorter clips (30-60 seconds) for faster iteration

3D Avatar Studio (Web UI)

The avatar_studio.html file is a full-featured web application built with Three.js that provides:

  • 3D Mesh Viewer — orbit controls, PBR lighting, wireframe toggle, color presets
  • Video Sync — source video synced frame-by-frame to the 3D mesh
  • Split View — side-by-side 3D avatar and video
  • Timeline Scrubber — color-coded technique segments, playback controls (0.25x–4x speed)
  • Segment Editor — label frame ranges with technique names and categories
  • Smooth Morphing — vertex interpolation between frames for fluid animation
  • Keyboard Shortcuts — Space (play/pause), Arrow keys (step), Home/End (skip)
  • Export — save labeled segments as JSON

Voice Control Agent

The studio includes an agentic voice control system:

  1. Click the Voice Control button (or use Chrome/Edge with microphone)
  2. Speak a command like "show me a horse stance" or "do a crescent kick"
  3. The Web Speech API transcribes your voice
  4. The transcript is sent to Llama 3.3 70B (via Groq API) with the list of available techniques
  5. The LLM returns a structured JSON action (goto_move, play, stop)
  6. The studio navigates to the matching 3D animation and starts playback

This requires a Groq API key — the studio will prompt you on first use.


LLM Pose Controller

The llm_pose_controller.py script enables programmatic control of 3D pose transitions:

command = {
    "start_pose": "yoi ready stance",
    "end_pose": "kiba-dachi horse riding stance",
    "air_time": 2.0,       # transition duration in seconds
    "rotation": 15          # Y-axis rotation in degrees
}

This will:

  1. Look up both poses in the pose library (averaged 3D vertices per technique)
  2. Generate N interpolated frames between start and end
  3. Apply optional Y-axis rotation with ease-in/out
  4. Render front + side view video using the SAM 3D Body renderer

Output Structure

After running the full pipeline:

sam-3d-body/
├── karate_frames/                  # Extracted video frames
│   └── frame_000000.jpg ...
├── karate_output/
│   ├── visualized_frames/          # 4-panel visualizations per frame
│   │   └── frame_000000.jpg ...    # (original | skeleton | mesh front | mesh side)
│   ├── mesh_data/                  # Per-frame 3D data (.npz)
│   │   └── frame_000000.npz ...    # (vertices, joints, pose params, camera)
│   ├── web/                        # Binary data for Three.js viewer
│   │   ├── faces.bin               # Mesh face topology (int32)
│   │   ├── frame_*.bin             # Per-frame vertices (float32)
│   │   └── manifest.json           # Frame metadata index
│   ├── pose_timeline.csv           # Frame → technique → 3D joint positions
│   ├── pose_library.npz            # Averaged 3D poses per technique
│   ├── pose_library_index.json     # Human-readable technique index
│   └── heian_sandan_3d.mp4         # Compiled output video
└── karate_transcript.json          # Technique labels with timestamps

🖥️ System Requirements

Minimum

  • GPU: NVIDIA RTX 3060 (12GB VRAM) or better
  • RAM: 16GB system RAM
  • Storage: 20GB free space (models + processed data)
  • OS: Windows 10/11 (64-bit)

Recommended

  • GPU: NVIDIA RTX 4090 (24GB VRAM) or RTX 3090 (24GB VRAM)
  • RAM: 32GB system RAM
  • Storage: 50GB+ free space (SSD recommended)
  • OS: Windows 11 (64-bit)

Performance by Mode

Mode VRAM Speed (RTX 4090) Quality
Body-only, DINOv3-H+ ~14 GB ~0.8s/frame Excellent
Full (body+hands), DINOv3-H+ ~16 GB ~1.5s/frame Best
Full + SAM2 mask, DINOv3-H+ ~20 GB ~2.5s/frame Maximum

Note: Model checkpoints (~8GB total) are not included in this repository. You must download them separately using download_models.py after requesting access on HuggingFace.


SAM 3D Body (Upstream Model)

This project is built on SAM 3D Body by Meta Superintelligence Labs.

SAM 3D Body (3DB) is a promptable model for single-image full-body 3D human mesh recovery (HMR). It uses an encoder-decoder architecture with a DINOv3-H+ backbone and supports auxiliary prompts (2D keypoints, masks). Trained on high-quality annotations from multi-view geometry and differentiable optimization.

Checkpoints

Backbone (size) 3DPW (MPJPE) EMDB (MPJPE) RICH (PVE) COCO (PCK@.05) LSPET (PCK@.05) Freihand (PA-MPJPE)
DINOv3-H+ (840M)
(config, checkpoint)
54.8 61.7 60.3 86.5 68.0 5.5
ViT-H (631M)
(config, checkpoint)
54.8 62.9 61.7 86.8 68.9 5.5

Quick Single-Image Demo

import cv2
import numpy as np
from notebook.utils import setup_sam_3d_body
from tools.vis_utils import visualize_sample_together

estimator = setup_sam_3d_body(hf_repo_id="facebook/sam-3d-body-dinov3")
img_bgr = cv2.imread("path/to/image.jpg")
outputs = estimator.process_one_image(cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB))
rend_img = visualize_sample_together(img_bgr, outputs, estimator.faces)
cv2.imwrite("output.jpg", rend_img.astype(np.uint8))

For the complete upstream demo, see notebook/demo_human.ipynb.


Project Structure

sam-3d-body/
├── sam_3d_body/                    # Core SAM 3D Body model (Meta upstream)
│   ├── models/                     # Encoder, decoder, heads
│   ├── visualization/              # Renderer, skeleton visualizer
│   ├── data/                       # Data loading utilities
│   ├── metadata/                   # MHR70 joint definitions
│   └── utils/                      # Model utilities
├── tools/                          # Detector, FOV estimator, segmentor builders
│   ├── build_detector.py           # YOLOv8 / ViTDet human detector
│   ├── build_fov_estimator.py      # MoGe2 field-of-view estimator
│   ├── build_sam.py                # SAM2 human segmentor
│   └── vis_utils.py                # Visualization helpers
├── notebook/                       # Jupyter demo notebook
├── data/                           # Dataset download scripts (upstream)
├── karate_pose_pipeline.py         # Main video → 3D mesh pipeline
├── llm_pose_controller.py          # LLM-driven pose interpolation
├── avatar_studio.html              # Interactive 3D web studio
├── export_web_data.py              # Mesh → binary for Three.js
├── download_karate_video.py        # YouTube video downloader
├── download_models.py              # Model checkpoint downloader
├── karate_transcript.json          # Karate technique labels
├── .env.example                    # API key template
├── INSTALL.md                      # Dependency installation guide
├── KARATE_PIPELINE_README.md       # Detailed pipeline documentation
└── LICENSE                         # SAM License

License

The SAM 3D Body model checkpoints and code are licensed under SAM License.

Contributing

See contributing and the code of conduct.

Citing SAM 3D Body

If you use SAM 3D Body or the SAM 3D Body dataset in your research, please use the following BibTeX entry.

@article{yang2026sam3dbody,
  title={SAM 3D Body: Robust Full-Body Human Mesh Recovery},
  author={Yang, Xitong and Kukreja, Devansh and Pinkus, Don and Sagar, Anushka and Fan, Taosha and Park, Jinhyung and Shin, Soyong and Cao, Jinkun and Liu, Jiawei and Ugrinovic, Nicolas and Feiszli, Matt and Malik, Jitendra and Dollar, Piotr and Kitani, Kris},
  journal={arXiv preprint arXiv:2602.15989},
  year={2026}
}

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages