Training data: PsiBotAI/SynData for video SDH captioning model + home assistant visual understanding #42

New issue

Open

opened 2026-05-31 09:08:58 -07:00 by pyr0ball · 0 comments

pyr0ball commented

2026-05-31 09:08:58 -07:00

Owner

Dataset

HuggingFace: PsiBotAI/SynData
License: CC-BY-4.0
Size: 449k egocentric video clips
Modalities: RGB video, depth, IMU, camera intrinsics
Schema: clip_id, task_key, task_name, caption (0-749 chars), start_idx, end_idx, num_frames, fps, duration_sec

What it is

Egocentric (first-person) video of hand-object manipulation tasks: sorting, rearranging, relocating items. Each clip has a structured natural-language caption describing the action sequence temporally.

Example caption: "The task involves relocating packaged items on a table: the right hand first moves a pink packaged item from the table right side to the center, then the left hand moves it to the left side."

Use case 1: SDH captioning model

Most SDH training data is speech-to-text. This dataset provides visual event description training signal, pairing video with temporal action narration. The caption format (sequential, descriptive, temporally ordered) matches what good SDH output looks like. This is the harder half of the SDH problem and harder to source elsewhere.

Use case 2: Home assistant visual understanding

The egocentric perspective matches a wearable-mounted or robot-mounted camera. A FOSS/OSH home assistant product (see companion roadmap issue) needs to understand what the user is doing in the home. Egocentric manipulation captions are a strong foundation for this visual grounding layer.

107 unique task types, 167 unique task names across 449k clips.

Next steps

Sample 100 clips and assess caption quality and consistency
Identify overlap with household task vocabulary (loading, sorting, fetching, carrying)
Evaluate as fine-tune data for a vision-language model (LLaVA, InternVL) for SDH
Coordinate with home assistant product design on visual grounding requirements

## Dataset **HuggingFace:** `PsiBotAI/SynData` **License:** CC-BY-4.0 **Size:** 449k egocentric video clips **Modalities:** RGB video, depth, IMU, camera intrinsics **Schema:** `clip_id`, `task_key`, `task_name`, `caption` (0-749 chars), `start_idx`, `end_idx`, `num_frames`, `fps`, `duration_sec` ## What it is Egocentric (first-person) video of hand-object manipulation tasks: sorting, rearranging, relocating items. Each clip has a structured natural-language caption describing the action sequence temporally. Example caption: "The task involves relocating packaged items on a table: the right hand first moves a pink packaged item from the table right side to the center, then the left hand moves it to the left side." ## Use case 1: SDH captioning model Most SDH training data is speech-to-text. This dataset provides visual event description training signal, pairing video with temporal action narration. The caption format (sequential, descriptive, temporally ordered) matches what good SDH output looks like. This is the harder half of the SDH problem and harder to source elsewhere. ## Use case 2: Home assistant visual understanding The egocentric perspective matches a wearable-mounted or robot-mounted camera. A FOSS/OSH home assistant product (see companion roadmap issue) needs to understand what the user is doing in the home. Egocentric manipulation captions are a strong foundation for this visual grounding layer. 107 unique task types, 167 unique task names across 449k clips. ## Next steps - [ ] Sample 100 clips and assess caption quality and consistency - [ ] Identify overlap with household task vocabulary (loading, sorting, fetching, carrying) - [ ] Evaluate as fine-tune data for a vision-language model (LLaVA, InternVL) for SDH - [ ] Coordinate with home assistant product design on visual grounding requirements