Training data: SWE-ZERO-12M-trajectories as mid-training corpus for local LLM agentic tool-use #41

Open
opened 2026-05-31 09:08:22 -07:00 by pyr0ball · 0 comments
Owner

Dataset

HuggingFace: AlienKevin/SWE-ZERO-12M-trajectories
License: Apache 2.0
Size: 12.3M rollouts, 112B tokens, 122,908 unique PRs, 3,222 repos, 16 languages
Schema: instance_id, repo, messages (multi-turn, {role, content}), trajectory_format, exit_status, duration_sec

What it is

The largest agentic coding trace dataset to date. Each row is a multi-turn agent-environment trajectory where an LLM navigates a real repo (bash, file reads, edits) attempting to resolve a GitHub PR. The stated purpose is mid-training to instill agentic tool-use priors in code models — not SFT.

Submission rate findings

Sampled across 6 offsets spanning the full dataset. Observed roughly 2-5% exit_status=Submitted. Submitted trajectories tend to be short (4-8 messages) and simple (version bumps, single-file edits). Complex multi-file refactors rarely reach submission. Patches are unverified against tests.

Value for CF

Use case Value
Mid-training a base coder model High — all 12.3M rows teach tool navigation
SFT for agentic task execution Medium — filter to Submitted + 4-8 message trajectories
CF product-specific fine-tune Low unfiltered, High filtered + CF SFT on top

Recommended pipeline: mid-train on full corpus, SFT on filtered Submitted/simple-PR slice, then SFT again on CF-specific task data. This produces a local model with strong tool-use priors before CF product data is ever shown.

Relevant products: all CF products using local LLM agentic execution — Peregrine research loop, Turnstone log navigation, cf-orch task agents, Magpie scraping pipelines, future home assistant product.

Next steps

  • Decide base model target (Qwen 2.5 Coder, DeepSeek-Coder-V2, etc.)
  • Build filtered subset: exit_status=Submitted, len(messages) <= 10, Python/JS repos only
  • Evaluate mid-training cost on Sif (RTX 5060 Ti 16GB)
  • Define CF-specific SFT dataset format for product tasks
## Dataset **HuggingFace:** `AlienKevin/SWE-ZERO-12M-trajectories` **License:** Apache 2.0 **Size:** 12.3M rollouts, 112B tokens, 122,908 unique PRs, 3,222 repos, 16 languages **Schema:** `instance_id`, `repo`, `messages` (multi-turn, `{role, content}`), `trajectory_format`, `exit_status`, `duration_sec` ## What it is The largest agentic coding trace dataset to date. Each row is a multi-turn agent-environment trajectory where an LLM navigates a real repo (bash, file reads, edits) attempting to resolve a GitHub PR. The stated purpose is **mid-training to instill agentic tool-use priors** in code models — not SFT. ## Submission rate findings Sampled across 6 offsets spanning the full dataset. Observed roughly 2-5% `exit_status=Submitted`. Submitted trajectories tend to be short (4-8 messages) and simple (version bumps, single-file edits). Complex multi-file refactors rarely reach submission. Patches are unverified against tests. ## Value for CF | Use case | Value | |---|---| | Mid-training a base coder model | High — all 12.3M rows teach tool navigation | | SFT for agentic task execution | Medium — filter to `Submitted` + 4-8 message trajectories | | CF product-specific fine-tune | Low unfiltered, High filtered + CF SFT on top | **Recommended pipeline:** mid-train on full corpus, SFT on filtered `Submitted`/simple-PR slice, then SFT again on CF-specific task data. This produces a local model with strong tool-use priors before CF product data is ever shown. **Relevant products:** all CF products using local LLM agentic execution — Peregrine research loop, Turnstone log navigation, cf-orch task agents, Magpie scraping pipelines, future home assistant product. ## Next steps - [ ] Decide base model target (Qwen 2.5 Coder, DeepSeek-Coder-V2, etc.) - [ ] Build filtered subset: `exit_status=Submitted`, `len(messages) <= 10`, Python/JS repos only - [ ] Evaluate mid-training cost on Sif (RTX 5060 Ti 16GB) - [ ] Define CF-specific SFT dataset format for product tasks
pyr0ball added the
priority:backlog
status:concept
labels 2026-05-31 09:08:22 -07:00
Sign in to join this conversation.
No description provided.