feat: anonymized repair engagement corpus pipeline for Avocet fine-tuning #38

New issue

Open

opened 2026-04-22 10:02:10 -07:00 by pyr0ball · 0 comments

pyr0ball commented

2026-04-22 10:02:10 -07:00

Owner

Summary

Each AI code repair engagement produces high-quality before/after training data: broken AI-generated code in, clean guardrailed code out, with audit notes explaining what went wrong. Anonymized, this is a valuable fine-tuning corpus that improves future repair quality automatically.

Data shape per engagement

Each engagement produces:

Input: broken codebase snapshot (or representative files) + AI tool context (which tool, what session went wrong)
Audit findings: structured notes on failure modes (hallucinated API, missing error handling, security hole, no tests, etc.)
Output: repaired code + CLAUDE.md + guardrail config
Repair notes: explanation of root cause and fix rationale

Pipeline

1. Engagement capture (during consulting work)

Structured intake form captures: stack, AI tool used, failure mode tags
Repair notes written in standard format during engagement
Client signs data consent clause in contract (anonymized use for model improvement)

2. Anonymization

Strip PII, company names, proprietary identifiers from code and notes
Replace with generic placeholders (e.g. → , API keys → )
Human review before adding to corpus

3. Avocet integration

Store anonymized pairs in data/repair_corpus/
Format: instruction-tuning pairs
- Instruction: "Audit this code for AI-generation failure modes and produce a repair plan"
- Input: broken code snippet
- Output: audit findings + repaired code + explanation
Feed into Avocet fine-tune harness alongside voice corpus

4. Feedback loop

Fine-tuned model used in future engagements → generates first-pass audit
Engineer reviews + corrects → correction fed back into corpus
Quality improves with each engagement

Notes

Consent clause must be in standard consulting contract (update website#13 contract template)
Anonymization is a hard requirement -- no client code in corpus without explicit sign-off
Corpus is proprietary, not open-sourced (competitive moat)
Long-term: enough corpus → specialized "code repair" model as a CF product

## Summary Each AI code repair engagement produces high-quality before/after training data: broken AI-generated code in, clean guardrailed code out, with audit notes explaining what went wrong. Anonymized, this is a valuable fine-tuning corpus that improves future repair quality automatically. ## Data shape per engagement Each engagement produces: - **Input**: broken codebase snapshot (or representative files) + AI tool context (which tool, what session went wrong) - **Audit findings**: structured notes on failure modes (hallucinated API, missing error handling, security hole, no tests, etc.) - **Output**: repaired code + CLAUDE.md + guardrail config - **Repair notes**: explanation of root cause and fix rationale ## Pipeline ### 1. Engagement capture (during consulting work) - Structured intake form captures: stack, AI tool used, failure mode tags - Repair notes written in standard format during engagement - Client signs data consent clause in contract (anonymized use for model improvement) ### 2. Anonymization - Strip PII, company names, proprietary identifiers from code and notes - Replace with generic placeholders (e.g. → , API keys → ) - Human review before adding to corpus ### 3. Avocet integration - Store anonymized pairs in `data/repair_corpus/` - Format: instruction-tuning pairs - Instruction: "Audit this code for AI-generation failure modes and produce a repair plan" - Input: broken code snippet - Output: audit findings + repaired code + explanation - Feed into Avocet fine-tune harness alongside voice corpus ### 4. Feedback loop - Fine-tuned model used in future engagements → generates first-pass audit - Engineer reviews + corrects → correction fed back into corpus - Quality improves with each engagement ## Notes - Consent clause must be in standard consulting contract (update website#13 contract template) - Anonymization is a hard requirement -- no client code in corpus without explicit sign-off - Corpus is proprietary, not open-sourced (competitive moat) - Long-term: enough corpus → specialized "code repair" model as a CF product