feat: evaluate Agent-ModernColBERT as semantic upgrade to FTS5 log search #18

Open
opened 2026-05-13 15:46:18 -07:00 by pyr0ball · 0 comments
Owner

Background

Turnstone currently uses SQLite FTS5 with Porter stemming for log retrieval (app/services/search.py). This is fast and zero-dependency, but purely keyword-based — it cannot match:

  • "why is my service restarting" → OOM kill / segfault log entries
  • "network unreachable" → DHCP failure / routing table entries
  • "disk full" → inode exhaustion / write error log entries

These semantic gaps are exactly what makes homelab diagnosis hard.

Proposed upgrade

lightonai/Agent-ModernColBERT — late-interaction token-level retriever designed for agentic/multi-hop queries. Would sit alongside (not replace) FTS5:

  1. FTS5 handles keyword-exact matches (fast, cheap, good for structured log fields)
  2. ColBERT handles semantic intent matches (slower, richer, good for natural language queries)
  3. Results are merged/reranked before passing to the LLM summarization step in llm.py

Model registered in cf-orch model registry as agent-moderncolbert (~800MB VRAM).

What to evaluate

  • Does ColBERT retrieval meaningfully improve diagnosis quality vs FTS5 alone on a sample of real homelab queries?
  • Is the index build time acceptable for a corpus of typical log volumes (~100k entries)?
  • Does the larger index footprint (per-token embeddings) fit within /devl/ storage budget?

Implementation sketch (if evaluation passes)

  • Add pylate (LightOn ColBERT library) to dependencies
  • app/services/search.py: add colbert_search(query, db_path) alongside fts_search()
  • app/services/diagnose.py (or wherever search is called): merge FTS5 + ColBERT results, deduplicate, pass merged list to llm.summarize()
  • Add turnstone.log_retrieve to assignments.yaml in cf-orch

FTS5 stays

Do not remove FTS5 — it handles exact log-level/source filters efficiently and is the right tool for structured field queries. ColBERT is additive.

  • cf-orch model registry: agent-moderncolbert (already registered)
  • app/services/search.py (FTS5 implementation)
  • app/services/diagnose.py, app/services/llm.py
  • pagepiper ticket for same model integration
## Background Turnstone currently uses SQLite FTS5 with Porter stemming for log retrieval (`app/services/search.py`). This is fast and zero-dependency, but purely keyword-based — it cannot match: - "why is my service restarting" → OOM kill / segfault log entries - "network unreachable" → DHCP failure / routing table entries - "disk full" → inode exhaustion / write error log entries These semantic gaps are exactly what makes homelab diagnosis hard. ## Proposed upgrade **`lightonai/Agent-ModernColBERT`** — late-interaction token-level retriever designed for agentic/multi-hop queries. Would sit alongside (not replace) FTS5: 1. FTS5 handles keyword-exact matches (fast, cheap, good for structured log fields) 2. ColBERT handles semantic intent matches (slower, richer, good for natural language queries) 3. Results are merged/reranked before passing to the LLM summarization step in `llm.py` Model registered in cf-orch model registry as `agent-moderncolbert` (~800MB VRAM). ## What to evaluate - [ ] Does ColBERT retrieval meaningfully improve diagnosis quality vs FTS5 alone on a sample of real homelab queries? - [ ] Is the index build time acceptable for a corpus of typical log volumes (~100k entries)? - [ ] Does the larger index footprint (per-token embeddings) fit within `/devl/` storage budget? ## Implementation sketch (if evaluation passes) - Add `pylate` (LightOn ColBERT library) to dependencies - `app/services/search.py`: add `colbert_search(query, db_path)` alongside `fts_search()` - `app/services/diagnose.py` (or wherever search is called): merge FTS5 + ColBERT results, deduplicate, pass merged list to `llm.summarize()` - Add `turnstone.log_retrieve` to `assignments.yaml` in cf-orch ## FTS5 stays Do not remove FTS5 — it handles exact log-level/source filters efficiently and is the right tool for structured field queries. ColBERT is additive. ## Related - cf-orch model registry: `agent-moderncolbert` (already registered) - `app/services/search.py` (FTS5 implementation) - `app/services/diagnose.py`, `app/services/llm.py` - pagepiper ticket for same model integration
pyr0ball added this to the v1.0 milestone 2026-06-01 15:10:00 -07:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/turnstone#18
No description provided.