Embedding model comparison harness #59

New issue

Open

opened 2026-05-06 08:21:54 -07:00 by pyr0ball · 0 comments

pyr0ball commented

2026-05-06 08:21:54 -07:00

Owner

Goal

Add a way to benchmark and compare embedding model results within Avocet, so we can make informed decisions about which model to use across the menagerie (pagepiper, peregrine, etc).

Context

Pagepiper embeds page chunks with nomic-embed-text at 1024 dimensions via Ollama. When evaluating a new model (e.g. mxbai-embed-large, all-minilm, etc.), there is currently no tooling to compare retrieval quality side-by-side. Avocet already has the label/benchmark infrastructure that makes it the natural home for this.

Proposed scope

Select two or more embedding models (configured via Ollama BYOK)
Run a shared query set against each model's vector index
Display ranked retrieval results side-by-side with scores
Optionally: allow a human rater to label which result is more relevant (feeds back into the Avocet training corpus)
Export comparison report (JSON / CSV)

Acceptance criteria

Can compare at least 2 models on a shared document corpus
Results shown side-by-side with BM25 + vector scores visible
Human rating optional but supported
Works on Free tier (local Ollama only)

Labels

enhancement, backlog

## Goal Add a way to benchmark and compare embedding model results within Avocet, so we can make informed decisions about which model to use across the menagerie (pagepiper, peregrine, etc). ## Context Pagepiper embeds page chunks with `nomic-embed-text` at 1024 dimensions via Ollama. When evaluating a new model (e.g. `mxbai-embed-large`, `all-minilm`, etc.), there is currently no tooling to compare retrieval quality side-by-side. Avocet already has the label/benchmark infrastructure that makes it the natural home for this. ## Proposed scope - Select two or more embedding models (configured via Ollama BYOK) - Run a shared query set against each model's vector index - Display ranked retrieval results side-by-side with scores - Optionally: allow a human rater to label which result is more relevant (feeds back into the Avocet training corpus) - Export comparison report (JSON / CSV) ## Acceptance criteria - Can compare at least 2 models on a shared document corpus - Results shown side-by-side with BM25 + vector scores visible - Human rating optional but supported - Works on Free tier (local Ollama only) ## Labels `enhancement`, `backlog`