Add embedding model and RAG classifier support to benchmark/finetune harness #55

New issue

Open

opened 2026-05-04 16:13:16 -07:00 by pyr0ball · 0 comments

pyr0ball commented

2026-05-04 16:13:16 -07:00

Owner

Context

Pagepiper (new CF product) uses nomic-embed-text via Ollama for vector embeddings and a BM25+sqlite-vec hybrid retrieval pipeline. As we add embedding-based products, Avocet needs to be able to benchmark, label, and finetune embedding and RAG classifiers — not just sequence-to-sequence generation models.

Required additions

Embedding model support

Add embedding model config to llm.yaml / benchmark config (separate from chat model)
Support LLMRouter.embed() (cf-core v0.19.0) as the embedding backend in harness runs
Benchmark harness should be able to evaluate retrieval quality (top-k hit rate, MRR) not just generation quality

Embedding-based classifier support

Add classifier type: embedding_similarity — embeds input, computes cosine similarity to class exemplars, assigns label
Complement to current generation-based classifier (which prompts LLM to output a label)
Allows benchmarking zero-shot embedding classifiers vs. fine-tuned generation classifiers head-to-head

RAG pipeline evaluation

Harness support for measuring retrieval recall (was the correct page/chunk in the top-k?)
Benchmark dataset format: { query, relevant_doc_ids, expected_answer } for end-to-end RAG eval
Metrics: retrieval hit rate @k, answer faithfulness (LLM judge), citation accuracy

Fine-tuning targets

Research: can we fine-tune nomic-embed-text or a similar small embedding model on domain-specific corpora (TTRPG rulebooks, HR docs, etc.)?
Add fine-tune harness path for sentence-transformers / MTEB-compatible models
Finetune data format: { anchor, positive, negative } triplets (contrastive)

Acceptance criteria

Benchmark harness can run an embedding similarity classifier and report accuracy/F1
Benchmark harness can evaluate a RAG pipeline end-to-end (retrieval + generation)
Fine-tune harness supports at least one embedding model architecture (sentence-transformers)
All new harness modes covered by unit tests
Avocet label tool can label retrieval relevance judgments (binary: relevant / not relevant per chunk)

Notes

LLMRouter.embed() is available in cf-core v0.19.0 (just merged)
Pagepiper uses nomic-embed-text 768-dim via Ollama — good first target for domain fine-tune
Keep MIT/BSL boundary: embedding inference = BSL; retrieval pipeline scaffold = MIT

## Context Pagepiper (new CF product) uses `nomic-embed-text` via Ollama for vector embeddings and a BM25+sqlite-vec hybrid retrieval pipeline. As we add embedding-based products, Avocet needs to be able to benchmark, label, and finetune embedding and RAG classifiers — not just sequence-to-sequence generation models. ## Required additions ### Embedding model support - Add embedding model config to `llm.yaml` / benchmark config (separate from chat model) - Support `LLMRouter.embed()` (cf-core v0.19.0) as the embedding backend in harness runs - Benchmark harness should be able to evaluate retrieval quality (top-k hit rate, MRR) not just generation quality ### Embedding-based classifier support - Add classifier type: `embedding_similarity` — embeds input, computes cosine similarity to class exemplars, assigns label - Complement to current generation-based classifier (which prompts LLM to output a label) - Allows benchmarking zero-shot embedding classifiers vs. fine-tuned generation classifiers head-to-head ### RAG pipeline evaluation - Harness support for measuring retrieval recall (was the correct page/chunk in the top-k?) - Benchmark dataset format: `{ query, relevant_doc_ids, expected_answer }` for end-to-end RAG eval - Metrics: retrieval hit rate @k, answer faithfulness (LLM judge), citation accuracy ### Fine-tuning targets - Research: can we fine-tune `nomic-embed-text` or a similar small embedding model on domain-specific corpora (TTRPG rulebooks, HR docs, etc.)? - Add fine-tune harness path for sentence-transformers / MTEB-compatible models - Finetune data format: `{ anchor, positive, negative }` triplets (contrastive) ## Acceptance criteria - [ ] Benchmark harness can run an embedding similarity classifier and report accuracy/F1 - [ ] Benchmark harness can evaluate a RAG pipeline end-to-end (retrieval + generation) - [ ] Fine-tune harness supports at least one embedding model architecture (sentence-transformers) - [ ] All new harness modes covered by unit tests - [ ] Avocet label tool can label retrieval relevance judgments (binary: relevant / not relevant per chunk) ## Notes - `LLMRouter.embed()` is available in cf-core v0.19.0 (just merged) - Pagepiper uses `nomic-embed-text` 768-dim via Ollama — good first target for domain fine-tune - Keep MIT/BSL boundary: embedding inference = BSL; retrieval pipeline scaffold = MIT