Add embedding model and RAG classifier support to benchmark/finetune harness #55

Open
opened 2026-05-04 16:13:16 -07:00 by pyr0ball · 0 comments
Owner

Context

Pagepiper (new CF product) uses nomic-embed-text via Ollama for vector embeddings and a BM25+sqlite-vec hybrid retrieval pipeline. As we add embedding-based products, Avocet needs to be able to benchmark, label, and finetune embedding and RAG classifiers — not just sequence-to-sequence generation models.

Required additions

Embedding model support

  • Add embedding model config to llm.yaml / benchmark config (separate from chat model)
  • Support LLMRouter.embed() (cf-core v0.19.0) as the embedding backend in harness runs
  • Benchmark harness should be able to evaluate retrieval quality (top-k hit rate, MRR) not just generation quality

Embedding-based classifier support

  • Add classifier type: embedding_similarity — embeds input, computes cosine similarity to class exemplars, assigns label
  • Complement to current generation-based classifier (which prompts LLM to output a label)
  • Allows benchmarking zero-shot embedding classifiers vs. fine-tuned generation classifiers head-to-head

RAG pipeline evaluation

  • Harness support for measuring retrieval recall (was the correct page/chunk in the top-k?)
  • Benchmark dataset format: { query, relevant_doc_ids, expected_answer } for end-to-end RAG eval
  • Metrics: retrieval hit rate @k, answer faithfulness (LLM judge), citation accuracy

Fine-tuning targets

  • Research: can we fine-tune nomic-embed-text or a similar small embedding model on domain-specific corpora (TTRPG rulebooks, HR docs, etc.)?
  • Add fine-tune harness path for sentence-transformers / MTEB-compatible models
  • Finetune data format: { anchor, positive, negative } triplets (contrastive)

Acceptance criteria

  • Benchmark harness can run an embedding similarity classifier and report accuracy/F1
  • Benchmark harness can evaluate a RAG pipeline end-to-end (retrieval + generation)
  • Fine-tune harness supports at least one embedding model architecture (sentence-transformers)
  • All new harness modes covered by unit tests
  • Avocet label tool can label retrieval relevance judgments (binary: relevant / not relevant per chunk)

Notes

  • LLMRouter.embed() is available in cf-core v0.19.0 (just merged)
  • Pagepiper uses nomic-embed-text 768-dim via Ollama — good first target for domain fine-tune
  • Keep MIT/BSL boundary: embedding inference = BSL; retrieval pipeline scaffold = MIT
## Context Pagepiper (new CF product) uses `nomic-embed-text` via Ollama for vector embeddings and a BM25+sqlite-vec hybrid retrieval pipeline. As we add embedding-based products, Avocet needs to be able to benchmark, label, and finetune embedding and RAG classifiers — not just sequence-to-sequence generation models. ## Required additions ### Embedding model support - Add embedding model config to `llm.yaml` / benchmark config (separate from chat model) - Support `LLMRouter.embed()` (cf-core v0.19.0) as the embedding backend in harness runs - Benchmark harness should be able to evaluate retrieval quality (top-k hit rate, MRR) not just generation quality ### Embedding-based classifier support - Add classifier type: `embedding_similarity` — embeds input, computes cosine similarity to class exemplars, assigns label - Complement to current generation-based classifier (which prompts LLM to output a label) - Allows benchmarking zero-shot embedding classifiers vs. fine-tuned generation classifiers head-to-head ### RAG pipeline evaluation - Harness support for measuring retrieval recall (was the correct page/chunk in the top-k?) - Benchmark dataset format: `{ query, relevant_doc_ids, expected_answer }` for end-to-end RAG eval - Metrics: retrieval hit rate @k, answer faithfulness (LLM judge), citation accuracy ### Fine-tuning targets - Research: can we fine-tune `nomic-embed-text` or a similar small embedding model on domain-specific corpora (TTRPG rulebooks, HR docs, etc.)? - Add fine-tune harness path for sentence-transformers / MTEB-compatible models - Finetune data format: `{ anchor, positive, negative }` triplets (contrastive) ## Acceptance criteria - [ ] Benchmark harness can run an embedding similarity classifier and report accuracy/F1 - [ ] Benchmark harness can evaluate a RAG pipeline end-to-end (retrieval + generation) - [ ] Fine-tune harness supports at least one embedding model architecture (sentence-transformers) - [ ] All new harness modes covered by unit tests - [ ] Avocet label tool can label retrieval relevance judgments (binary: relevant / not relevant per chunk) ## Notes - `LLMRouter.embed()` is available in cf-core v0.19.0 (just merged) - Pagepiper uses `nomic-embed-text` 768-dim via Ollama — good first target for domain fine-tune - Keep MIT/BSL boundary: embedding inference = BSL; retrieval pipeline scaffold = MIT
pyr0ball added the
enhancement
backlog
ml
labels 2026-05-04 16:13:16 -07:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/avocet#55
No description provided.