feat: hybrid BM25 + vector RAG for diagnose — pattern recognition and red herring suppression #15

Closed
opened 2026-05-11 13:08:29 -07:00 by pyr0ball · 0 comments
Owner

Updated: Hybrid BM25 + Vector Search Architecture

Design spec: circuitforge-plans/turnstone/superpowers/specs/2026-05-24-hybrid-rag-multiagent-diagnose-design.md

The gap

Current BM25 FTS5 search misses semantically equivalent log entries with different vocabulary:

  • Query: "database connection failed"
  • Misses: ECONNREFUSED, backend gone away, max retries exceeded, connection reset by peer

Hybrid score

score = (alpha * bm25_score) + (beta * cosine_similarity)

Start alpha=0.6, beta=0.4 (tunable). Existing pattern-tag boost preserved; vector score is additive.

Vector index — implementation options (preference order)

  1. In-process numpy cosine — load context_chunks.embedding BLOB, compute in Python. Zero new dependencies. Fast for <100K entries. Start here.
  2. sqlite-vec — SQLite extension for ANN search. No new service. Preferred for >100K entries.
  3. Chroma / Qdrant / Weaviate — unnecessary infra for Turnstone scale. Do not use.

Vector retrieval also drives the multi-agent diagnose pipeline (#29):

  • Stage 3 (root-cause): embed anomaly list, retrieve runbook + past incident chunks
  • Stage 4 (false-positive suppressor): cosine similarity vs. known-good corpus
  • Stage 5 (cross-incident): embed new summary, retrieve similar historical incidents

Embedding infrastructure (prerequisite)

  • Model: local, CPU-capable, <500MB. HuggingFace search in progress for best fit.
  • app/services/embeddings.py — embed text, persist to context_chunks.embedding
  • Backfill embeddings for existing context_chunks
  • Hook into document upload pipeline (chunks exist, just need embeddings generated)

context_chunks.embedding BLOB already exists in schema — no migration needed.

Relates to: #29, #32

## Updated: Hybrid BM25 + Vector Search Architecture Design spec: `circuitforge-plans/turnstone/superpowers/specs/2026-05-24-hybrid-rag-multiagent-diagnose-design.md` ### The gap Current BM25 FTS5 search misses semantically equivalent log entries with different vocabulary: - Query: `"database connection failed"` - Misses: `ECONNREFUSED`, `backend gone away`, `max retries exceeded`, `connection reset by peer` ### Hybrid score ``` score = (alpha * bm25_score) + (beta * cosine_similarity) ``` Start alpha=0.6, beta=0.4 (tunable). Existing pattern-tag boost preserved; vector score is additive. ### Vector index — implementation options (preference order) 1. **In-process numpy cosine** — load `context_chunks.embedding BLOB`, compute in Python. Zero new dependencies. Fast for <100K entries. Start here. 2. **sqlite-vec** — SQLite extension for ANN search. No new service. Preferred for >100K entries. 3. ~~Chroma / Qdrant / Weaviate~~ — unnecessary infra for Turnstone scale. Do not use. ### RAG beyond search Vector retrieval also drives the multi-agent diagnose pipeline (#29): - Stage 3 (root-cause): embed anomaly list, retrieve runbook + past incident chunks - Stage 4 (false-positive suppressor): cosine similarity vs. known-good corpus - Stage 5 (cross-incident): embed new summary, retrieve similar historical incidents ### Embedding infrastructure (prerequisite) - Model: local, CPU-capable, <500MB. HuggingFace search in progress for best fit. - `app/services/embeddings.py` — embed text, persist to `context_chunks.embedding` - Backfill embeddings for existing context_chunks - Hook into document upload pipeline (chunks exist, just need embeddings generated) `context_chunks.embedding BLOB` already exists in schema — no migration needed. Relates to: #29, #32
pyr0ball added this to the beta milestone 2026-06-01 15:09:59 -07:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/turnstone#15
No description provided.