Training data: PsiBotAI/SynData for video SDH captioning model + home assistant visual understanding #42
Labels
No labels
free-tier:live
priority:backlog
priority:high
priority:medium
status:active-dev
status:alpha
status:beta
status:concept
status:design
status:launched
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Circuit-Forge/roadmap#42
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Dataset
HuggingFace:
PsiBotAI/SynDataLicense: CC-BY-4.0
Size: 449k egocentric video clips
Modalities: RGB video, depth, IMU, camera intrinsics
Schema:
clip_id,task_key,task_name,caption(0-749 chars),start_idx,end_idx,num_frames,fps,duration_secWhat it is
Egocentric (first-person) video of hand-object manipulation tasks: sorting, rearranging, relocating items. Each clip has a structured natural-language caption describing the action sequence temporally.
Example caption: "The task involves relocating packaged items on a table: the right hand first moves a pink packaged item from the table right side to the center, then the left hand moves it to the left side."
Use case 1: SDH captioning model
Most SDH training data is speech-to-text. This dataset provides visual event description training signal, pairing video with temporal action narration. The caption format (sequential, descriptive, temporally ordered) matches what good SDH output looks like. This is the harder half of the SDH problem and harder to source elsewhere.
Use case 2: Home assistant visual understanding
The egocentric perspective matches a wearable-mounted or robot-mounted camera. A FOSS/OSH home assistant product (see companion roadmap issue) needs to understand what the user is doing in the home. Egocentric manipulation captions are a strong foundation for this visual grounding layer.
107 unique task types, 167 unique task names across 449k clips.
Next steps