feat: anonymized repair engagement corpus pipeline for Avocet fine-tuning #38

Open
opened 2026-04-22 10:02:10 -07:00 by pyr0ball · 0 comments
Owner

Summary

Each AI code repair engagement produces high-quality before/after training data: broken AI-generated code in, clean guardrailed code out, with audit notes explaining what went wrong. Anonymized, this is a valuable fine-tuning corpus that improves future repair quality automatically.

Data shape per engagement

Each engagement produces:

  • Input: broken codebase snapshot (or representative files) + AI tool context (which tool, what session went wrong)
  • Audit findings: structured notes on failure modes (hallucinated API, missing error handling, security hole, no tests, etc.)
  • Output: repaired code + CLAUDE.md + guardrail config
  • Repair notes: explanation of root cause and fix rationale

Pipeline

1. Engagement capture (during consulting work)

  • Structured intake form captures: stack, AI tool used, failure mode tags
  • Repair notes written in standard format during engagement
  • Client signs data consent clause in contract (anonymized use for model improvement)

2. Anonymization

  • Strip PII, company names, proprietary identifiers from code and notes
  • Replace with generic placeholders (e.g. → , API keys → )
  • Human review before adding to corpus

3. Avocet integration

  • Store anonymized pairs in data/repair_corpus/
  • Format: instruction-tuning pairs
    • Instruction: "Audit this code for AI-generation failure modes and produce a repair plan"
    • Input: broken code snippet
    • Output: audit findings + repaired code + explanation
  • Feed into Avocet fine-tune harness alongside voice corpus

4. Feedback loop

  • Fine-tuned model used in future engagements → generates first-pass audit
  • Engineer reviews + corrects → correction fed back into corpus
  • Quality improves with each engagement

Notes

  • Consent clause must be in standard consulting contract (update website#13 contract template)
  • Anonymization is a hard requirement -- no client code in corpus without explicit sign-off
  • Corpus is proprietary, not open-sourced (competitive moat)
  • Long-term: enough corpus → specialized "code repair" model as a CF product
## Summary Each AI code repair engagement produces high-quality before/after training data: broken AI-generated code in, clean guardrailed code out, with audit notes explaining what went wrong. Anonymized, this is a valuable fine-tuning corpus that improves future repair quality automatically. ## Data shape per engagement Each engagement produces: - **Input**: broken codebase snapshot (or representative files) + AI tool context (which tool, what session went wrong) - **Audit findings**: structured notes on failure modes (hallucinated API, missing error handling, security hole, no tests, etc.) - **Output**: repaired code + CLAUDE.md + guardrail config - **Repair notes**: explanation of root cause and fix rationale ## Pipeline ### 1. Engagement capture (during consulting work) - Structured intake form captures: stack, AI tool used, failure mode tags - Repair notes written in standard format during engagement - Client signs data consent clause in contract (anonymized use for model improvement) ### 2. Anonymization - Strip PII, company names, proprietary identifiers from code and notes - Replace with generic placeholders (e.g. → , API keys → ) - Human review before adding to corpus ### 3. Avocet integration - Store anonymized pairs in `data/repair_corpus/` - Format: instruction-tuning pairs - Instruction: "Audit this code for AI-generation failure modes and produce a repair plan" - Input: broken code snippet - Output: audit findings + repaired code + explanation - Feed into Avocet fine-tune harness alongside voice corpus ### 4. Feedback loop - Fine-tuned model used in future engagements → generates first-pass audit - Engineer reviews + corrects → correction fed back into corpus - Quality improves with each engagement ## Notes - Consent clause must be in standard consulting contract (update website#13 contract template) - Anonymization is a hard requirement -- no client code in corpus without explicit sign-off - Corpus is proprietary, not open-sourced (competitive moat) - Long-term: enough corpus → specialized "code repair" model as a CF product
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/avocet#38
No description provided.