Circuit-Forge/turnstone

Fork 0

feat(diagnose): 5-stage multi-agent diagnose pipeline (#29) #39

Merged

pyr0ball merged 17 commits from feat/29-multi-agent-diagnose into main

2026-05-25 19:59:35 -07:00

pyr0ball commented

2026-05-25 18:59:38 -07:00

Owner

Summary

Replaces the single-LLM summarize() call in diagnose_stream with a 5-stage ML pipeline, gated behind TURNSTONE_MULTI_AGENT_DIAGNOSE=true. 100% backward compatible — existing SSE event protocol unchanged, Vue frontend requires no changes.

Pipeline Architecture

Stage 1 — TimelineReconstructor (pure Python)
  → clusters log entries by time window, detects bursts and gaps

Stage 2 — SeverityClassifier (ML → pattern_tags → regex fallback)
  → HuggingFace byviz/bylastic_classification_logs (DistilBERT)
  → degrades to tag-based then regex if model unavailable

Stage 3 — RootCauseHypothesizer (LLM + RAG context)
  → cf-orch task endpoint + OpenAI-compat fallback
  → returns [] when no LLM, never blocks pipeline

Stage 4 — FalsePositiveSuppressor (embedding cosine similarity)
  → sentence-transformers/all-MiniLM-L6-v2
  → compares hypotheses against resolved incidents in SQLite
  → full passthrough when no embedding model

Stage 5 — SummarySynthesizer (LLM + deterministic fallback)
  → produces final markdown summary
  → deterministic fallback when no LLM

New Files

app/services/diagnose/__init__.py — feature flag wiring + legacy path
app/services/diagnose/legacy.py — verbatim copy of old diagnose.py
app/services/diagnose/models.py — 5 frozen dataclasses
app/services/diagnose/pipeline.py — async generator orchestrator
app/services/diagnose/timeline.py — Stage 1
app/services/diagnose/classifier.py — Stage 2
app/services/diagnose/hypothesizer.py — Stage 3
app/services/diagnose/suppressor.py — Stage 4
app/services/diagnose/synthesizer.py — Stage 5
tests/test_diagnose_timeline.py — 15 tests
tests/test_diagnose_classifier.py — 10 tests
tests/test_diagnose_hypothesizer.py — 12 tests
tests/test_diagnose_suppressor.py — 10 tests (incl. borderline boundary)
tests/test_diagnose_synthesizer.py — 8 tests
tests/test_diagnose_pipeline.py — 13 tests

Test Results

372 passed (up from 303 baseline, +69 new tests)

Notable Design Decisions

All dataclasses are frozen=True with tuple fields (no mutable lists)
MULTI_AGENT_ENABLED evaluated at import time from env var — no runtime reload needed
ML model singletons are module-level, reset by autouse fixtures in tests
asyncio.to_thread used for synchronous ML inference to avoid blocking the event loop
New SSE events: pipeline_stage (4x) and hypotheses; existing summary/entries/reasoning/done events unchanged

Fixes Caught in Code Review

HIGH (fixed): suppress_threshold semantics were inverted — was suppressing when similarity > 0.15 instead of > 0.85. Fixed to suppress = max_sim >= similarity_threshold. Added borderline test.
MEDIUM (fixed): suppression_reason display guard was too loose in synthesizer — tightened to if rh.suppress and rh.suppression_reason

Follow-up Issues Filed

#33 — MappingProxyType for ClassifiedTimeline.cluster_severities
#34 — Remove dead suppression branch in synthesizer
#35 — Extract shared _call_llm helper
#36 — Per-stage error isolation in pipeline.py
#37 — Move format_context_block() inside legacy branch
#38 — Coerce supporting_cluster_ids to str

Activation

To enable after merge:

echo "TURNSTONE_MULTI_AGENT_DIAGNOSE=true" >> .env

Closing #29.

## Summary Replaces the single-LLM `summarize()` call in `diagnose_stream` with a 5-stage ML pipeline, gated behind `TURNSTONE_MULTI_AGENT_DIAGNOSE=true`. **100% backward compatible** — existing SSE event protocol unchanged, Vue frontend requires no changes. ## Pipeline Architecture ``` Stage 1 — TimelineReconstructor (pure Python) → clusters log entries by time window, detects bursts and gaps Stage 2 — SeverityClassifier (ML → pattern_tags → regex fallback) → HuggingFace byviz/bylastic_classification_logs (DistilBERT) → degrades to tag-based then regex if model unavailable Stage 3 — RootCauseHypothesizer (LLM + RAG context) → cf-orch task endpoint + OpenAI-compat fallback → returns [] when no LLM, never blocks pipeline Stage 4 — FalsePositiveSuppressor (embedding cosine similarity) → sentence-transformers/all-MiniLM-L6-v2 → compares hypotheses against resolved incidents in SQLite → full passthrough when no embedding model Stage 5 — SummarySynthesizer (LLM + deterministic fallback) → produces final markdown summary → deterministic fallback when no LLM ``` ## New Files - `app/services/diagnose/__init__.py` — feature flag wiring + legacy path - `app/services/diagnose/legacy.py` — verbatim copy of old diagnose.py - `app/services/diagnose/models.py` — 5 frozen dataclasses - `app/services/diagnose/pipeline.py` — async generator orchestrator - `app/services/diagnose/timeline.py` — Stage 1 - `app/services/diagnose/classifier.py` — Stage 2 - `app/services/diagnose/hypothesizer.py` — Stage 3 - `app/services/diagnose/suppressor.py` — Stage 4 - `app/services/diagnose/synthesizer.py` — Stage 5 - `tests/test_diagnose_timeline.py` — 15 tests - `tests/test_diagnose_classifier.py` — 10 tests - `tests/test_diagnose_hypothesizer.py` — 12 tests - `tests/test_diagnose_suppressor.py` — 10 tests (incl. borderline boundary) - `tests/test_diagnose_synthesizer.py` — 8 tests - `tests/test_diagnose_pipeline.py` — 13 tests ## Test Results ``` 372 passed (up from 303 baseline, +69 new tests) ``` ## Notable Design Decisions - All dataclasses are `frozen=True` with `tuple` fields (no mutable lists) - `MULTI_AGENT_ENABLED` evaluated at import time from env var — no runtime reload needed - ML model singletons are module-level, reset by `autouse` fixtures in tests - `asyncio.to_thread` used for synchronous ML inference to avoid blocking the event loop - New SSE events: `pipeline_stage` (4x) and `hypotheses`; existing `summary/entries/reasoning/done` events unchanged ## Fixes Caught in Code Review - **HIGH (fixed):** `suppress_threshold` semantics were inverted — was suppressing when similarity > 0.15 instead of > 0.85. Fixed to `suppress = max_sim >= similarity_threshold`. Added borderline test. - **MEDIUM (fixed):** `suppression_reason` display guard was too loose in synthesizer — tightened to `if rh.suppress and rh.suppression_reason` ## Follow-up Issues Filed - #33 — MappingProxyType for ClassifiedTimeline.cluster_severities - #34 — Remove dead suppression branch in synthesizer - #35 — Extract shared _call_llm helper - #36 — Per-stage error isolation in pipeline.py - #37 — Move format_context_block() inside legacy branch - #38 — Coerce supporting_cluster_ids to str ## Activation To enable after merge: ```bash echo "TURNSTONE_MULTI_AGENT_DIAGNOSE=true" >> .env ``` Closing #29.

pyr0ball added 17 commits 2026-05-25 18:59:38 -07:00

refactor: rename ingest → glean throughout codebase 12cd0a23d5

Renames the app/ingest/ package to app/glean/ and updates all
references across Python modules, shell scripts, Vue components,
tests, and documentation.

Intentionally preserved:
- SQLite column name ingest_time (avoids schema migration)
- RetrievedEntry.ingest_time field (maps to the column above)
- Any public-facing JSON keys that reference ingest_time

Changes by category:
- app/ingest/ → app/glean/ (full package move, all parsers)
- app/tasks/ingest_scheduler.py → app/tasks/glean_scheduler.py
- scripts/ingest_corpus.py → scripts/glean_corpus.py
- tests/test_ingest_*.py → tests/test_glean_*.py
- Docstrings, log messages, comments: ingest → glean
- Env var: TURNSTONE_INGEST_INTERVAL → TURNSTONE_GLEAN_INTERVAL
- Shell scripts: glean.log, glean_corpus.py references
- README.md: multi-source ingest → multi-source glean
- .env.example: updated env var name
- patterns/: new diagnostic patterns from 2026-05-20 SSH incident
  (service_crash_loop, pkg_daemon_restart, ssh_forward_conflict)
- SourcesView.vue: pipeline label updated
- All test import paths updated to app.glean.*

285 tests passing.

feat: SSH remote host glean — transport layer and pipeline integration (closes #22 , backend) 81a9b0f49d

Adds SSH-based log collection from remote hosts via Paramiko.
One SSH connection per host, multiple log types per connection.

New files:
- app/glean/ssh.py: SSHTransport context manager + command builders
  for journald, syslog, plaintext, and docker log types
- tests/test_glean_ssh.py: 18 tests for transport layer (all mocked)
- tests/test_glean_pipeline_ssh.py: 15 tests for pipeline integration

Pipeline changes (app/glean/pipeline.py):
- glean_sources() now splits sources into local-file and SSH categories
- SSH sources use transport: ssh + glean: list schema in sources.yaml
- _glean_ssh_source(): one SSHTransport per host, N commands per connection
- _stream_and_write(): SSHCommandError caught per-item so one bad
  command does not abort the rest of the host's glean items
- SSHConnectionError skips the entire host with a warning log

SSH source schema (sources.yaml):
  - id: rack01
    transport: ssh
    host: 192.168.1.10
    user: admin
    key_path: ~/.ssh/id_ed25519
    glean:
      - type: journald
        args: [--since, 2 hours ago]
      - type: syslog
        path: /var/log/syslog
      - type: plaintext
        path: /var/log/app/error.log
      - type: docker
        containers: [myapp, nginx]

Key design decisions:
- Key-based auth only (no password prompts in daemon context)
- exit-status check fires after all stdout lines yielded; callers
  drain the iterator to trigger it
- Local file sources path unchanged; SSH sources co-exist in same yaml
- Docker multi-container: one exec_stream call per container,
  source_id scoped as host_id/type/container_name

Remaining for #22: REST endpoint, SourcesView UI, sources.yaml docs.
285 → 285 tests passing (33 new SSH tests).

feat: SSH remote glean — transport layer, pipeline integration, REST + UI (#22 ) e746d55730

Closes turnstone#22.

## Transport layer (app/glean/ssh.py)
- SSHTransport context manager: key-only auth, paramiko backend
- SSHConnectionError / SSHCommandError exception hierarchy
- exec_stream() generator: yields stdout lines, raises SSHCommandError on
  non-zero exit (isinstance(int) guard for test-mock safety)
- Command builders: _build_journald_command, _build_syslog_command,
  _build_plaintext_command, _build_docker_command
- 18 unit tests in tests/test_glean_ssh.py

## Pipeline integration (app/glean/pipeline.py)
- _stream_and_write(): per-item error isolation — SSHCommandError skips
  one glean item without aborting the rest of the host connection
- _glean_ssh_source(): one SSHTransport per host, dispatches all glean
  items (journald/syslog/plaintext/docker); SSHConnectionError aborts host
- glean_sources(): splits local vs SSH sources; local → _glean_files();
  SSH → _glean_ssh_source(); shared compiled patterns and DB connection
- glean_ssh_source(): public wrapper for REST use — manages DB connection,
  pattern compilation, FTS rebuild lifecycle
- 15 integration tests in tests/test_glean_pipeline_ssh.py
- All 285 tests passing

## REST layer (app/rest.py)
- GET /api/sources/configured: reads sources.yaml and enriches with DB
  stats; SSH sources appear before first glean (entry_count=0); sub-source
  IDs (rack01/journald, rack01/docker/myapp) aggregated per host entry
- POST /api/sources/{id}/glean: detects transport:ssh and dispatches to
  glean_ssh_source() wrapper; local sources unchanged
- Import: glean_ssh_source as _glean_ssh_source

## Frontend (web/src/views/SourcesView.vue)
- Fetches /api/sources/configured (primary) + /api/sources (DB-only) in
  parallel; merges into unified SourceRow list
- SSH sources show: ssh badge (with user@host tooltip), glean-type pills
  (journald/syslog/docker/etc.), host subtitle
- SSH sub-source IDs (rack01/journald) suppressed from the DB-only list
  since they are covered by the parent SSH row
- DB-only sources (uploads) appear below configured sources with 'uploaded'
  badge; reglean button disabled (not in sources.yaml)
- Delete zeroes out configured-source stats in-place rather than removing
  the row (so the source remains visible for re-gleaning)

feat: fingerprint-based incremental glean — skip unchanged files (#30 ) 2fde3a1814

- Add glean_fingerprints table to schema (sha256 + mtime + size)
- _fingerprint(), _fp_unchanged(), _save_fingerprint() helpers in pipeline.py
- _glean_files() now checks fingerprint; skips file if hash unchanged
- force=True param threads through glean_dir → glean_file → glean_sources
- POST /api/tasks/glean and POST /api/sources/{id}/glean accept force=true
- 14 unit tests in tests/test_glean_fingerprint.py, all passing

Closes: #30

refactor: extract embeddings service layer — decouple context embedder from Ollama 5f32a6678d

- New app/services/embeddings.py: TURNSTONE_EMBED_* env vars, multi-backend support
- embedder.py delegates to service layer; re-exports EMBEDDING_AVAILABLE for compat
- retriever.py updated to use service layer
- Test coverage updated in tests/context/test_embedder.py

refactor: convert diagnose module to package for multi-agent pipeline (issue #29 ) 664ab50433

- Move app/services/diagnose.py verbatim to app/services/diagnose/legacy.py
- Create app/services/diagnose/__init__.py with full implementation so that
  patch('app.services.diagnose._HAS_DATEPARSER') targets the correct namespace
  and all 303 existing tests continue to pass without modification
- Add app/services/diagnose/models.py with 5 pipeline dataclasses:
  EventCluster, TimelineResult, ClassifiedTimeline, Hypothesis, RankedHypothesis
- Add app/services/diagnose/pipeline.py with run_pipeline() stub (Task 6)
- Add MULTI_AGENT_ENABLED feature flag (off by default via env var)
- Zero behavior change; ruff clean

Closes: #29

fix: frozen dataclasses, clean __all__, improve exception logging in diagnose package 959a6cbf1c

feat: Stage 1 — TimelineReconstructor for multi-agent diagnose pipeline (issue #29 ) 7cff98b1c3

- Add app/services/diagnose/timeline.py: pure-Python TimelineReconstructor
  - Sorts entries by timestamp_iso (None entries appended at end)
  - Sliding-window clustering anchored to first entry in each cluster
  - Computes cluster_id (sha1[:12]), severity (highest wins), burst flag,
    gap_before_seconds, representative_text (highest rank, longest text tiebreak)
  - Builds TimelineResult with dominant_sources sorted by entry count descending
- Update pipeline.py stub to import TimelineReconstructor (Task 6 wiring prep)
- Add tests/test_diagnose_timeline.py: 15 tests covering all 13 required cases
  plus null-timestamp edge case variant; all 318 tests passing

Closes: #29

refactor: split TimelineReconstructor.reconstruct into helpers, fix magic number + error handling 3b04c81a2b

- Add gap_significance_seconds constructor param (default 30) to replace hardcoded magic number in gap_count computation
- _parse_iso now returns datetime | None with try/except on ValueError; all callers handle None return by treating malformed timestamps as absent
- Extract reconstruct into four private helpers: _sort_entries, _group_into_raw_clusters, _build_cluster, _dominant_sources_tuple
- Promote _sort_key to module-level function (was nested inside reconstruct)
- Rename old module-level _build_cluster to _make_event_cluster to avoid name collision with new instance method
- Add explanatory comment to type: ignore[arg-type] at _highest_severity call site
- Black-formatted

feat: Stage 2 — SeverityClassifier for multi-agent diagnose pipeline (issue #29 ) 912ba7ac16

Three-path classification: ML (transformers pipeline, lazy singleton) →
pattern_tags (YAML pattern severity dict) → regex (detect_severity).

- Path A: HF text-classification pipeline loaded lazily on first classify()
  call via module-level singleton; shim promotes ERROR+keyword hits to CRITICAL
  and demotes low-confidence INFO to DEBUG.
- Path B: maps cluster.pattern_tags through the loaded pattern severity dict;
  picks the highest severity across matching tags.
- Path C: falls back to detect_severity() regex scan on representative_text;
  defaults to INFO when no keyword matches.
- Pattern file resolved from constructor arg or TURNSTONE_PATTERNS env var
  (mirrors app/rest.py convention).
- No crash when transformers is not installed; ImportError on per-cluster ML
  inference triggers clean per-cluster fallback to pattern_tags/regex.
- ClassifiedTimeline.classifier_used reflects the primary session path.

Tests (10 new, 328 total, all passing):
- ML ERROR, CRITICAL promotion, DEBUG demotion, WARNING→WARN
- pattern_tags resolution from YAML fixture
- regex ERROR detection and INFO default
- ImportError clean fallback
- empty timeline no-crash
- ClassifiedTimeline FrozenInstanceError on mutation

Closes: #29

feat: Stage 3 — RootCauseHypothesizer for multi-agent diagnose pipeline (issue #29 ) eefd65f903

- Add app/services/diagnose/hypothesizer.py with RootCauseHypothesizer class
- Stage 3 of the multi-agent diagnose pipeline: accepts ClassifiedTimeline +
  RetrievedContext, builds a structured JSON prompt, calls the LLM via the
  same cf-orch task → OpenAI-compat fallback pattern used by llm.py
- Parses JSON array response into list[Hypothesis] dataclasses with UUID ids,
  severity validation (WARNING→WARN, unknown→ERROR), confidence coercion
- Gracefully returns [] when llm_url/llm_model absent or clusters empty
- Add tests/test_diagnose_hypothesizer.py: 12 tests, all mocked, no LLM I/O
  covering: valid response, UUID generation, malformed JSON, non-list JSON,
  empty clusters, missing URL/model, max_hypotheses cap, severity mapping,
  confidence string coercion
- 340 tests passing (328 prior + 12 new)

Closes: #29

fix: defensive coercion for LLM confidence and cluster fields in hypothesizer e8c66972fa

- Add _coerce_float() module-level helper: catches TypeError/ValueError from
  non-numeric LLM output (e.g. 'high', 'N/A') and returns a caller-supplied
  default instead of raising.
- Replace float(item.get('confidence', 0.5)) with
  _coerce_float(item.get('confidence'), 0.5) in _parse_response.
- Guard supporting_cluster_ids: tuple(item.get(...) or []) so a JSON null
  from the LLM does not cause TypeError('NoneType is not iterable').
- runbook_refs is hardcoded as () and not sourced from LLM output; no change
  needed there.
- Add test_non_numeric_confidence_uses_default (Test 10) to cover the 'high'
  string case: asserts no exception and confidence == 0.5.
- 341 tests passing (+1).

Closes: #29

feat: Stage 4 — FalsePositiveSuppressor for multi-agent diagnose pipeline (issue #29 ) 174cb126e6

- Implements FalsePositiveSuppressor using embedding cosine similarity
- Lazy corpus embedding via get_embedder() with module-level cache keyed by db_path
- Cache invalidated automatically when the resolved incident corpus changes
- Suppresses hypotheses with novelty_score below configurable threshold (default 0.85)
- Full fallback path (novelty=1.0, no suppression) when model_id empty, embedding
  service unavailable, or no resolved incidents found in DB
- Graceful handling of missing incidents table and DB query failures
- Numpy bool_ leakage prevented by explicit float()/bool() coercion at assignment
- Pure-Python cosine fallback for environments without numpy
- 9 new tests (all mocked, no real model downloads): passthrough, suppress, no-suppress,
  empty list, ranking, empty corpus, DB failure, service unavailable, cache invalidation
- 350 total tests passing (341 pre-existing + 9 new)

Closes: #29

refactor: extract _score_hypothesis helper, fix exception types, pass device in suppressor 9bfae16b54

feat: Stage 5 synthesizer + pipeline orchestrator + feature flag wiring (issue #29 ) 8cbd981ec7

- Add app/services/diagnose/synthesizer.py: SummarySynthesizer (Stage 5)
  - Builds structured LLM prompt from ranked hypotheses, timeline, RAG context
  - Excludes suppressed hypotheses from the narrative prompt
  - Deterministic fallback when no LLM configured or LLM call fails
  - Same cf-orch task endpoint + direct OpenAI-compat fallback pattern as other stages

- Replace pipeline.py stub with full run_pipeline() async generator
  - Orchestrates all 5 stages via asyncio.to_thread for each synchronous stage
  - Yields typed SSE event dicts: status, pipeline_stage (1-4), hypotheses, reasoning, done
  - Suppressor counts (active vs suppressed) reported in stage 4 event message

- Wire MULTI_AGENT_ENABLED feature flag into diagnose_stream()
  - TURNSTONE_MULTI_AGENT_DIAGNOSE=true routes through run_pipeline()
  - pipeline emits its own done event; legacy path unchanged when flag is false
  - Import of run_pipeline added to __init__.py

- Add 21 new tests (350 -> 371 passing):
  - tests/test_diagnose_synthesizer.py: 8 tests (with/without LLM, suppressed,
    empty ranked, LLM failure fallback)
  - tests/test_diagnose_pipeline.py: 13 tests (flag off, flag on event sequence,
    empty entries, no LLM, stage 1 cluster count message)

Closes: #29

fix: tighten suppression_reason display guard, document unused since/until params 255c9111d4

fix: invert suppress_threshold semantics to similarity_threshold in FalsePositiveSuppressor 86361f6c79

Was suppressing when novelty_score < 0.85 (i.e. similarity > 0.15), which
would suppress nearly every hypothesis once embeddings are active.

Now suppresses when max_sim >= similarity_threshold (0.85), meaning only
hypotheses that are 85%+ similar to a resolved incident are suppressed.

Also renames suppress_threshold → similarity_threshold for clarity and
adds a borderline boundary test (0.85 suppressed, 0.84 not suppressed).

Closes: #29