Embedding model comparison harness #59

Open
opened 2026-05-06 08:21:54 -07:00 by pyr0ball · 0 comments
Owner

Goal

Add a way to benchmark and compare embedding model results within Avocet, so we can make informed decisions about which model to use across the menagerie (pagepiper, peregrine, etc).

Context

Pagepiper embeds page chunks with nomic-embed-text at 1024 dimensions via Ollama. When evaluating a new model (e.g. mxbai-embed-large, all-minilm, etc.), there is currently no tooling to compare retrieval quality side-by-side. Avocet already has the label/benchmark infrastructure that makes it the natural home for this.

Proposed scope

  • Select two or more embedding models (configured via Ollama BYOK)
  • Run a shared query set against each model's vector index
  • Display ranked retrieval results side-by-side with scores
  • Optionally: allow a human rater to label which result is more relevant (feeds back into the Avocet training corpus)
  • Export comparison report (JSON / CSV)

Acceptance criteria

  • Can compare at least 2 models on a shared document corpus
  • Results shown side-by-side with BM25 + vector scores visible
  • Human rating optional but supported
  • Works on Free tier (local Ollama only)

Labels

enhancement, backlog

## Goal Add a way to benchmark and compare embedding model results within Avocet, so we can make informed decisions about which model to use across the menagerie (pagepiper, peregrine, etc). ## Context Pagepiper embeds page chunks with `nomic-embed-text` at 1024 dimensions via Ollama. When evaluating a new model (e.g. `mxbai-embed-large`, `all-minilm`, etc.), there is currently no tooling to compare retrieval quality side-by-side. Avocet already has the label/benchmark infrastructure that makes it the natural home for this. ## Proposed scope - Select two or more embedding models (configured via Ollama BYOK) - Run a shared query set against each model's vector index - Display ranked retrieval results side-by-side with scores - Optionally: allow a human rater to label which result is more relevant (feeds back into the Avocet training corpus) - Export comparison report (JSON / CSV) ## Acceptance criteria - Can compare at least 2 models on a shared document corpus - Results shown side-by-side with BM25 + vector scores visible - Human rating optional but supported - Works on Free tier (local Ollama only) ## Labels `enhancement`, `backlog`
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/avocet#59
No description provided.