Compare commits

...

139 commits

Author SHA1 Message Date
db359d35b2 fix(search): qualify ambiguous column names with table alias in FTS JOIN
Both log_fts and log_entries have timestamp_iso, severity, source_id, and
matched_patterns columns. After the JOIN, unqualified references to any of
these caused SQLite to raise 'ambiguous column name', silently falling back
to the non-FTS scan path on every time-filtered or severity-filtered query.

Prefix all filter conditions that touch FTS-mirror columns with f. to
resolve the ambiguity. The e. prefix on tenant_id was already correct since
tenant_id is not present in the FTS virtual table.
2026-06-17 11:27:38 -07:00
04013757e7 chore: bump version to v0.7.0
Beta milestone complete: all open beta tickets closed.
2026-06-17 09:41:10 -07:00
5da8db2bcd fix(diagnose): pass full timeline clusters and hypothesis descriptions to synthesizer LLM
Stage 5 (SummarySynthesizer) was only sending aggregate timeline stats to the
LLM (cluster count, burst count, gap count) — the actual sequenced cluster data
that Stage 1 reconstructed was never included. The LLM had no per-cluster
timestamps, severity, burst flags, silence gaps, or representative text to
write the TIMELINE section from.

Added _build_timeline_block() to emit a numbered per-cluster summary matching
the format Stage 3 uses for the hypothesizer, and included it in the user
message alongside the hypothesis block.

Also fixed _build_hypothesis_block() to include the 2-4 sentence description
each hypothesis carries — previously only the title and novelty score reached
the LLM.

11 new tests cover _build_timeline_block() directly (burst label, gap threshold,
pattern tags, text truncation at 200 chars, null start_iso, multi-cluster
numbering). 529 tests passing.
2026-06-16 21:46:01 -07:00
4c1940d12e fix: strip reasoning-model thinking tags; surface untracked node names
- app/services/diagnose/_llm_client.py: strip <think>…</think> blocks
  (case-insensitive, multiline) from LLM response content before it
  reaches the UI or any JSON parser — affects DeepSeek-R1, Qwen QwQ,
  and any other model that emits chain-of-thought in content
- app/rest.py: suggest_sources now also returns untracked_names — query
  tokens that look like hostnames/service names but don't appear in any
  monitored source, so the UI can prompt the user to add them
- web/src/components/ChatDiagnose.vue: show amber "Not monitoring: X"
  banner with "Add as a log source →" link when untracked_names present
- tests/test_llm_client.py: 13 tests covering think-strip edge cases
  (single/multi-line, multiple blocks, case-insensitive, only-thinking)
  plus existing extract_content and JSON-fence helpers
2026-06-16 09:42:44 -07:00
6039ab2464 feat: incident ticket export — Notion and Jira integration (#12)
- app/services/ticket_export.py: plugin-dispatch architecture; Notion
  exporter (Notion API v1, blocks-based, 50 entry cap, 2000-char
  truncation per block); Jira exporter (REST API v3, Basic Auth, ADF
  description, configurable issue type defaulting to Bug)
- app/rest.py: POST /api/incidents/{id}/export endpoint; Notion/Jira
  credential fields added to SettingsBody and PATCH /api/settings handler
- web/src/views/IncidentsView.vue: "Export ticket ▾" dropdown in
  incident detail drawer — click-outside close, inline URL link on success
- web/src/views/SettingsView.vue: Ticket Trackers section with Notion
  token + database ID, Jira URL/email/token/project/issue-type; show/hide
  for secret fields
- tests/test_ticket_export.py: 17 tests covering dispatch, Notion
  success/error/config/payload/truncation paths, Jira success/error/
  auth/project/summary/default-issue-type
2026-06-14 15:46:11 -07:00
b8f766fb74 feat: SSH target manager — GUI editor for remote host configuration (#24)
- app/services/ssh_targets.py: full CRUD service with lazy paramiko
  import, key-path validation, permission warning, and test_connection
- app/db/schema.py: ssh_targets table (id, label, host, port, user,
  key_path, last_tested, last_ok, last_error, timestamps)
- app/rest.py: GET/POST /api/ssh-targets, PATCH/DELETE /{id},
  POST /{id}/test — key contents never returned in any response
- web/src/views/SettingsView.vue: Remote Hosts section with add/edit
  form, inline connection status badges, test-connection flow, delete
  with confirmation; new Set() pattern for reactive sshTesting state
- tests/test_ssh_targets.py: 22 tests — schema, CRUD, validation,
  key-warning, serialization, paramiko-absent path
2026-06-14 15:27:12 -07:00
7a2ab0bb46 feat(orchard): auto-enrollment API for branch node provisioning (#27)
Implements the Orchard branch grafting system for harvest.circuitforge.tech:

- POST /api/orchard/graft: provisions data dir, starts a new
  turnstone-submissions-<slug> Docker container on the next free port
  (ORCHARD_PORT_BASE=8538+), injects a handle_path block into the
  Caddyfile dynamic-branches marker section, restarts caddy-proxy,
  returns {submit_endpoint, api_key}
- GET /api/orchard/branches: list active/inactive branches (admin-only)
- DELETE /api/orchard/branches/<slug>: deactivate branch + stop container
- POST /api/orchard/branches/<slug>/anonymize: HMAC-based IP/username
  pseudonymization worker over a branch DB
- POST /api/glean/batch: optional TURNSTONE_BRANCH_KEY auth guard
- anonymized column added to log_entries schema (migration-safe)
- Updated Caddyfile with /huginn/* route (port 8536), /node2/* (8537),
  and dynamic-branch marker section
- All endpoints admin-gated via TURNSTONE_ORCHARD_ADMIN_KEY

Closes: #27
2026-06-14 14:30:18 -07:00
600e5a9eac feat(sources): context-aware filesystem log scanner (#23)
Add scan_log_directories() to discover.py that recursively walks
/var/log and /opt, filters to readable log files, and scores each
candidate by recency (mtime, 0.7 weight), file size (0.3), and
keyword match against an optional problem-context query (shifts
weights to 0.4/0.2/0.4 when a query is provided).

- GET /api/setup/scan?query=...&max_results=N — new API endpoint
- SourcesView: "Scan" button opens a panel with ranked candidates,
  checkboxes, and "Add selected" to write to sources.yaml
- 13 new unit tests, 466 passing total

Closes: #23
2026-06-14 14:01:45 -07:00
7ed01fbd48 chore: sanitize contributor names and personal node IDs
- docker-compose.submissions.yml: rename submissions-contrib1/contrib2
  to submissions-contrib1/contrib2; update paths and host env vars
- podman-standalone.sh: replace 'Contributor's instance' with generic
  'WireGuard-connected Docker hosts'
- docker-standalone.sh: replace personal node-id in harvest endpoint
2026-06-13 22:17:38 -07:00
58680b3b27 chore: replace vendor product name with generic ext_device throughout
- Rename _EXT_DEVICE_CODES → _EXT_DEVICE_CODES, gen_ext_device → gen_ext_device
- Rename corpus output directory ext_device/ → ext_device/
- Update default.yaml placeholder pattern name and description
- Update tests to match new directory and class names
- Corresponding Forgejo issue titles updated (#43, #44, #54)
2026-06-13 22:03:26 -07:00
be134a4465 chore: replace personal node-id in harvest endpoint example 2026-06-13 21:58:22 -07:00
8006d79a11 Merge feat/42-50-postgres-multitenant: dual-backend + full feature set
Brings in 18 commits since v0.6.2:
- Dual-backend SQLite/Postgres + multi-tenant source namespacing
- Anomaly scoring pipeline + cybersec zero-shot scoring
- Security alerts tab — full scorer integration
- Audio domain patterns (PipeWire/ALSA xrun, quantum)
- Incidents: auto-incident detection, timeline visualizer
- Diagnose: conversational chat mode, NL source discovery
- Corpus: synthetic log generator, watermark-preserving updates
- UI: security alert dedup/collapse, clickable criticals with inline
  LLM explanation, loading shimmer animations, default diagnose prompt
- Backend: DB-lock retry in anomaly scorer, FTS build via get_conn(),
  timeline_events in stats_summary
- Sanitize: internal hostnames and IPs replaced with generic placeholders
2026-06-13 10:02:59 -07:00
7c76217149 chore: sanitize internal hostnames and IP references
- Rename patterns/sources-example-node.yaml → patterns/sources-example.yaml
  and update header/comments to be host-agnostic
- Replace internal node names in gen_corpus.py _HOSTS with generic names
- Replace example-node hostname in syslog test fixtures with testhost
- Replace example-node example in mcp_server.py doc with myserver
- Replace private LAN IP (<YOUR_HOST_IP>) in docker-standalone.sh with
  <HEIMDALL_LAN_IP> placeholder
- Replace private IPs in sources-cluster.yaml comments with <YOUR_HOST_IP>
- Remove instance-specific hostname from llm.py fallback comment
- Replace Caddy example domain in podman-standalone.sh with placeholder
2026-06-13 10:02:46 -07:00
502ff54fd0 feat(ui): security alert dedup, clickable criticals, loading shimmer
Security Alerts:
- Client-side duplicate collapsing via anomaly_label + text fingerprint
- ×N count badge chip on collapsed rows; toggle to expand
- Skeleton shimmer rows replace "Loading..." text

Dashboard:
- Clickable Recent Criticals — inline LLM explanation via SSE stream
- ±5 min time window scoped to source_id for useful context
- Explanation cache keyed by entry_id (no re-fetch on re-expand)
- Default diagnose query injected on Diagnose button navigation to
  prevent local models hallucinating from bare log data
- Stat card and source-health skeleton shimmer loading states

Backend:
- anomaly.py: 4-attempt retry on "database is locked" with 10s backoff
- search.py: migrate build_fts_index to get_conn() (WAL race fix);
  add timeline_events to stats_summary for clickable criticals feature
- theme.css: @keyframes shimmer + .loading-shimmer utility;
  prefers-reduced-motion degrades gracefully to static muted block
2026-06-13 09:32:26 -07:00
f3d807d991 feat(diagnose): conversational chat mode + NL source discovery
- New ChatDiagnose.vue: multi-turn chat UI in the Diagnose tab
  - Textarea input (auto-grows) for long free-form problem descriptions
  - Source suggestion pre-flight: debounced POST /api/sources/suggest
    identifies relevant log sources from the query text and shows them
    as interactive chips (deselect to exclude before searching)
  - Conversation history preserved across turns with LLM reasoning,
    collapsible log entries, and "Save as incident" per turn
  - Reuses existing /api/diagnose/stream — no new pipeline
- DiagnoseView.vue: Chat is now default tab; viewport-height layout
- POST /api/sources/suggest: token-overlap source ranking, no LLM
- Fix: add missing 'import re' causing 500 on suggest route
2026-06-11 22:04:53 -07:00
b6b69e2150 feat(incidents): auto-incident detection + example-node Podman setup
Auto-incident detector:
- New app/tasks/incident_detector.py: post-glean error cluster detector
  - Sliding window algorithm: source + N errors within window_s seconds
  - Deduplication via issue_type='auto:{source_id}' + interval overlap check
  - Respects TURNSTONE_AUTO_INCIDENT_THRESHOLD (default 5) and
    TURNSTONE_AUTO_INCIDENT_WINDOW (default 600s) env vars
  - 20 tests all passing
- Wired into glean_scheduler.run_once() and scheduler_loop()
- TURNSTONE_AUTO_INCIDENT env var to disable (default enabled)

Podman standalone improvements:
- REPO_DIR auto-detected from script location (no longer hardcoded to /opt/turnstone)
- DATA_DIR/PATTERNS_DIR/HF_CACHE_DIR configurable via env vars
- Bootstrap step copies host-specific sources-<hostname>.yaml on first run
- Auto-incident env vars passed through

example-node sources:
- patterns/sources-example-node.yaml: Sonarr, Radarr, Bazarr, Prowlarr,
  Tautulli, autoscan, organizr, nextcloud, journal export
2026-06-11 18:37:53 -07:00
74c9de9ccf fix(corpus): glean_dir now recurses subdirs; fix docker SOURCE prefix
- Changed glob → rglob in glean_dir so corpus directories with format
  subfolders (journald/, docker/, etc.) are fully ingested
- Fixed gen_corpus.py docker SOURCE to emit "docker:<service>" prefix
  so the pipeline correctly detects format as 'docker' not 'plaintext'
- 17/17 gen_corpus tests passing

Closes: #46
2026-06-11 16:30:28 -07:00
5816ed69ae feat(corpus): synthetic log corpus generator for demos and testing
Adds scripts/gen_corpus.py that produces realistic-but-artificial log
files across all four supported formats (journald JSON, docker envelope,
qBittorrent hotio, EXT_DEVICE plaintext). Output feeds directly into
glean_corpus.py for demo environments and parser regression tests with
no production data required.

- Seed-based RNG with independent per-source sub-streams (same seed =
  same sequence for each file regardless of source count changes)
- Controllable time range, event density, and error injection rate
- Severity distribution mirrors real infrastructure (70% INFO, ~6% ERROR,
  ~2% CRITICAL) with adjustable boost via --error-rate
- 17 tests covering output structure, reproducibility, format correctness,
  parser round-trip, and CLI acceptance criteria

Also fixes a latent bug in app/glean/plaintext.py: ISO 8601 timestamps
were silently failing to parse because the T separator was normalised to
space in the input string but the strptime format string still contained T.
Fix: apply the same normalisation to the format before calling strptime.

Closes: #46
2026-06-11 10:57:20 -07:00
4dcc1a441a feat(incidents): incident timeline visualizer + fix entry lookup using wrong DB path
Adds IncidentTimeline.vue — a pure SVG time-axis component rendered inside the
incident detail drawer when entries are present:
- Horizontal strip scaled to incident window (preserveAspectRatio=none)
- Event ticks colored by severity, height proportional to severity level
- 50-bin density shading shows burst periods as blue bands
- Gap markers (dashed lines) for silence > 10% of window or > 60s
- Hover tooltip showing nearest entry's severity, time, and truncated text
- Click-to-scroll: clicking a tick highlights and scrolls to its entry in the list below
- Legend showing only severity levels present in the incident

Also fixes a pre-existing bug: get_incident_endpoint and both build_bundle callers
were passing INCIDENTS_DB_PATH to get_incident_entries/build_bundle, causing all
incident entry lookups to silently search the empty incidents DB instead of the
main log DB. This made all incident detail views show "No log entries found".

Closes: #57
2026-06-10 16:02:24 -07:00
5f7296ad6d chore(corpus): preserve watermark files across updates; document corpus env vars
update.sh now backs up data/corpus_watermark.txt and data/incident_watermark.txt
before git pull and restores them after, mirroring the existing watch.yaml pattern.
Without this, an update would reset watermarks to zero and re-push all corpus
entries from the beginning on the next export run.

.env.example adds a corpus export section documenting the three env vars
needed to opt a node into the Avocet training pipeline.

Closes: #6
2026-06-10 15:01:19 -07:00
313b25e0d0 feat(alerts): security alerts tab — full scorer integration
- Fix loadScorerStatus: was spreading data.state + data.config (both
  undefined); API returns flat object; now uses data directly
- Fix v-for to use filteredDetections (was using raw detections array,
  breaking the Unacknowledged tab filter)
- Fix double-prefix URL bug: BASE already contains /turnstone, so
  fetches to ${BASE}/turnstone/api/... doubled the prefix → returned
  SPA HTML → silent JSON parse failure. Fixed all fetch URLs to use
  ${BASE}/api/... in SecurityAlertsView and DashboardView
- Add CybersecStatus interface to replace Record<string, unknown>
- Add scorer field to Detection interface; show 'cybersec' badge in
  label cell when scorer !== 'anomaly'
- Add cybersecStatus.running to cybersec badge (pulse animation)
- Add ANOMALY / CYBERSEC stats rows side-by-side
- Add 'Run cybersec' button with cybersecTriggerLoading state and
  runCybersec() function posting to /api/cybersec/run
- Rename 'Run scorer' → 'Run anomaly' for clarity

Closes: #11
2026-06-10 14:32:43 -07:00
61816c26bd fix(cybersec): clean up debug traceback logging
Replaced manual traceback import with exc_info=True, which is the
idiomatic logging pattern and produces the same output.
2026-06-10 13:20:56 -07:00
971a859c0d fix(watcher): remove per-flush FTS sync to eliminate SQLite write lock contention
Each WatchSource was calling build_fts_index() every 3 flushes (~30s).
With 70+ active sources, this produced a near-continuous stream of FTS
INSERT operations, each holding the SQLite write lock for several seconds
while scanning the 5.4GB log_entries table. Every other writer (other
watcher flushes, cybersec scorer) timed out with 'database is locked'.

FTS index is now only updated by the glean scheduler (every 900s) and
the manual `build-fts` command — both already call build_fts_index()
through glean_dir(). Real-time freshness of watcher-ingested entries
in FTS was ~30s before; it's now up to ~15min, which is acceptable.

This is the root cause of the persistent 'database is locked' errors
blocking the cybersec scorer (issue #9).

Closes: #9
2026-06-10 12:42:24 -07:00
c17c6c42ea feat(patterns): add audio domain — PipeWire/ALSA xrun and quantum patterns
Six new patterns covering the PipeWire + ALSA audio failure modes that
surface as crackling/stuttering on Linux desktops:

- pipewire_overflow: protocol-pulse OVERFLOW channel messages (confirmed
  present in Muninn journal — dozens per incident)
- pipewire_underrun: pw.node/spa.alsa underrun messages
- alsa_xrun: ALSA-level xrun from kernel or ALSA lib (snd_pcm)
- pipewire_quantum_mismatch: sample-rate/quantum mismatch detection
- pipewire_node_error: PipeWire node failures (device unavailable)
- pipewire_jackdbus_missing: harmless JACK probe at INFO — suppresses
  false positives from daily PipeWire restarts

Also adds 'audio' as a valid domain value in the header comment.

Companion Robin knowledge doc:
  circuitforge-plans/robin/known-issues/pipewire-alsa-quantum-xrun.md
2026-06-10 11:33:19 -07:00
cffe6bcd31 feat: cybersec zero-shot scoring pipeline (#9)
Second-pass cybersec classifier using DeBERTa-v3-base-mnli (already
cached — no download required). Runs after each anomaly scoring pass on
entries flagged by the anomaly scorer or with pattern matches.

Architecture:
- app/services/cybersec.py: zero-shot-classification pipeline with 5
  cybersec candidate labels (auth failure, privilege escalation, network
  intrusion, malware, data exfiltration). Writes ml_score/ml_label/
  ml_scored_at to log_entries; inserts high-confidence hits into
  detections with scorer='cybersec'.
- app/tasks/cybersec_scorer.py: async background task (same shape as
  anomaly_scorer.py).
- REST: GET/POST /turnstone/api/cybersec/status|run|detections.
  GET /turnstone/api/anomaly/detections now accepts scorer= filter.

Schema: ml_score, ml_label, ml_scored_at added to log_entries; scorer
column added to detections (idempotent migrations + DDL for both SQLite
and Postgres).

UI: Security Alerts view gains Source dropdown (All / Anomaly / Cybersec)
and cybersec scorer status badge. Label dropdown split into optgroups.

Deployment: TURNSTONE_CYBERSEC_MODEL/DEVICE/THRESHOLD vars added to
.env.example, docker-compose.yml, docker-standalone.sh.

Tests: 10 new tests — no model, no eligible entries, scoring, detection
creation, normal label suppression, threshold filtering, pattern-tag
filtering, idempotency, list filtering, scorer column filter.
416/416 passing.

Closes: #9
2026-06-10 01:03:25 -07:00
6e228fe0bf feat: security alerts tab — UI view for anomaly detections (#11)
New SecurityAlertsView (/alerts route) surfaces the detections table built
in #10. Features:
- All / Unacknowledged tab filter with live counts
- Label dropdown (SECURITY_ANOMALY, SYSTEM_FAILURE, NETWORK_ANOMALY, etc.)
- Score confidence bar per detection (colour-coded by threshold)
- Acknowledge drawer: full log text, optional notes, in-place row dim on save
- Scorer status badge + manual "Run scorer" button
- Config warning when TURNSTONE_ANOMALY_MODEL is unset

Dashboard: new "Unreviewed Alerts" stat card (red border when > 0) links
to /alerts so alerts surface on the landing page without navigating away.

Closes: #11
2026-06-10 00:28:15 -07:00
40694a30e5 chore: wire anomaly scoring pipeline into deployment config
Add TURNSTONE_ANOMALY_* env vars to docker-compose.yml, docker-standalone.sh,
and .env.example. Mount shared HF model cache (/Library/Assets/LLM on Heimdall)
as read-only bind in both compose and standalone — avoids re-downloading models
that are already cached by the diagnose pipeline.

Heimdall: byviz/bylastic_classification_logs already cached, threshold 0.80,
glean-triggered only (TURNSTONE_ANOMALY_INTERVAL=0).
2026-06-09 23:01:48 -07:00
0693e1fd54 feat: anomaly scoring pipeline (#10)
- Add app/services/anomaly.py: batch scorer using HF text-classification
  pipeline; rewrites anomaly_score/anomaly_label/anomaly_scored_at on
  log_entries; inserts high-confidence hits into detections table
- Add app/tasks/anomaly_scorer.py: background task (same shape as
  glean_scheduler); triggered after each glean cycle when
  TURNSTONE_ANOMALY_MODEL is set
- DB schema: add anomaly_score/anomaly_label/anomaly_scored_at columns to
  log_entries (idempotent ALTER TABLE migration); add detections table
- Wire scorer into scheduler_loop and glean_scheduler.run_once; no-op when
  model env var is empty (safe to leave unconfigured)
- REST endpoints: GET/POST /api/anomaly/status, /api/anomaly/run,
  GET /api/anomaly/detections, POST /api/anomaly/detections/{id}/acknowledge
- Reuses Hybrid-BERT label map from diagnose/classifier.py; works with any
  HF text-classification model
- 12 new tests; 406/406 passing

Closes: #10
2026-06-09 11:15:13 -07:00
0311d72e53 feat: dual-backend SQLite/Postgres + multi-tenant source namespacing
- Add app/db/ abstraction layer: Backend enum, DbConn wrapper,
  dialect helper (q() for ? vs %s paramstyle), get_conn(), tenant_id()
- Auto-detect backend from DATABASE_URL; SQLite remains default when
  unset — no config change for local deployments
- Add tenant_id column to all three logical DBs (main, context, incidents);
  idempotent ALTER TABLE migration runs before schema scripts on existing DBs
- All INSERTs inject tenant_id; SELECTs use (tenant_id = ? OR tenant_id = '')
  for backward compat with pre-namespacing rows
- Add docker-compose.yml with named volume turnstone_pgdata (survives rebuilds)
  and optional external Postgres support via DATABASE_URL override
- Add scripts/migrate_sqlite_to_postgres.py — one-shot idempotent migration
  for existing SQLite data; ON CONFLICT DO NOTHING for safe re-runs
- Fix SSH glean path in pipeline.py to use ensure_schema + get_conn
  (was still using raw sqlite3.connect + old _SCHEMA without tenant_id)
- Fix FTS5 JOIN ambiguity: qualify repeat_count as f.repeat_count in search
- Update all tests to use ensure_*_schema fixtures; add row_factory where needed
- 394/394 tests passing

Closes: #42
Closes: #50
2026-06-08 08:37:54 -07:00
1de156ebde fix: reset browser UA button chrome for dark mode
HTML buttons get a ~#efefef background and 2px outset border from the
browser UA stylesheet. In light mode these blend in; in dark mode they
render as stark white boxes. Adding a global button reset in theme.css
clears the UA defaults — explicit bg-* utility classes still win.

Affects: theme toggle, hamburger nav button, dashboard diagnose buttons,
and all other icon/text buttons that had no explicit bg class.

Bumps version to 0.6.2.
2026-06-05 09:55:08 -07:00
93975dcc0c fix: settings page CSS — selected card bg and toggle switch thumb
- Replace bg-accent/10 with bg-accent-muted on selected radio cards
  (opacity modifiers on CSS variable colors are silently dropped by
  UnoCSS, causing full-opacity solid blue backgrounds)
- Add explicit left-0.5 to toggle switch thumb and set off-state to
  translate-x-0 — without an explicit left the browser auto-placed the
  thumb 18px inside the track, causing 14px overflow when translated on
2026-06-02 11:54:35 -07:00
876cfb9a63 fix: group journal sources by prefix:host stem in source health
source_ids with 3+ colon segments (e.g. muninn-journal:Muninn:ssh.service)
are now aggregated by their prefix:host key at the SQL level in both
list_sources() and stats_summary(). This collapses ~19K transient systemd
unit rows (crash-loop scope entries from Muninn) into ~24 grouped rows.

- list_sources: SQL CASE/INSTR group-by stem + unit_count field
- stats_summary: same stem grouping for dashboard source health table
- delete endpoint: LIKE-based cascade delete covers grouped stems
- SourcesView: unit_count badge (e.g. "2686 units") on grouped rows;
  delete confirmation names the unit count when deleting a group
- Bump version to v0.6.1
2026-06-02 04:35:26 -07:00
9cd7450591 chore: bump version to 0.6.0
Release summary:
- #60 split incidents tables to turnstone-incidents.db (eliminates FTS5 write lock starvation)
- #41 Hybrid-BERT label mapping shim (7-class vocabulary support in classifier)
- #15 hybrid BM25 + vector re-ranking for diagnose search (semantic=True, alpha=0.6/beta=0.4)
- #32 domain-view mapping: 42 patterns annotated across 10 domains, by_domain in diagnose summary
2026-06-01 20:52:35 -07:00
ce2a2b55a6 Merge feat/32-domain-view: domain-view mapping for patterns and diagnose output (#32) 2026-06-01 20:01:19 -07:00
eac9a4ba28 Merge feat/15-hybrid-rag: hybrid BM25 + vector re-ranking for diagnose search (#15) 2026-06-01 20:00:02 -07:00
cfddff6a2a Merge feat/41-hybrid-bert-shim: Hybrid-BERT label mapping shim (#41) 2026-06-01 19:59:34 -07:00
48816f4ef3 Merge feat/60-incidents-db: split incidents tables to dedicated DB (#60) 2026-06-01 19:58:49 -07:00
b1f3d68724 feat: domain-view mapping for patterns and diagnose output (#32)
Adds a domain: field to the pattern taxonomy and surfaces per-domain
hit counts in diagnose summaries for faster triage.

Changes:
- LogPattern gains domain: str = "" (backward-compatible default)
- load_patterns() reads domain from YAML via p.get("domain", "")
- All 42 patterns in default.yaml annotated across 10 domains:
    service_health | networking | auth | storage | memory |
    kernel | power | web_proxy | media | gpu
- _pattern_domain dict built at startup from compiled patterns
- _domain_counts() helper: maps matched_patterns tags to domains,
  counts hits per domain across a result set
- diagnose POST: summary includes by_domain: {domain: count}
- diagnose stream: summary SSE event includes by_domain when
  pattern_domain is provided (passed from rest.py at startup)
- /api/search gains ?domain= filter: post-filters results to entries
  whose matched_patterns include at least one tag in the given domain

Test fixtures: patch _pattern_domain={} and CONTEXT_DB_PATH in
test_blocklist_endpoints.py and test_glean_tautulli.py (worktree
has no data/ dir; same fix as feat/60-incidents-db).

372 tests passing.

Closes: #32
2026-06-01 19:57:16 -07:00
1abdcfb1f3 feat: hybrid BM25 + vector re-ranking for diagnose search (#15)
Adds late-fusion hybrid search to Turnstone's log retrieval layer:

  hybrid_score = 0.6 * bm25_normalized + 0.4 * cosine_similarity

Implementation:
- _bm25_search() extracts the existing FTS5 BM25 path as a named helper
- _hybrid_search() fetches an oversized BM25 candidate pool (5x limit,
  min 100), embeds the query and each candidate text in-process via the
  existing embeddings service, normalizes BM25 rank to [0,1], combines
  with cosine similarity, and re-ranks
- search() gets semantic=False param that dispatches to _hybrid_search()
  when True; pure BM25 remains the default for all existing call sites
- diagnose_stream() enables semantic=True so symptom-based queries
  ("database connection failed") surface semantically equivalent entries
  ("ECONNREFUSED", "backend gone away", "max retries exceeded")
- /api/search REST endpoint exposes ?semantic=true query param

Graceful degradation: falls back silently to pure BM25 when the embedding
backend is unavailable (EMBEDDING_AVAILABLE=False) or when embed_batch
raises an exception. No new infra — in-process numpy cosine, no vector DB.

11 new tests: BM25 helper, hybrid re-ranking, fallback paths, dispatcher.
372 + 11 = 383 tests passing.

Closes: #15
2026-06-01 18:13:09 -07:00
503a36d76c feat(classifier): add Hybrid-BERT label mapping shim (#41)
Adds _HYBRID_BERT_LABEL_MAP to translate the 7-class output vocabulary of
krishnas4415/log-anomaly-detection-models (Hybrid-BERT, MIT) to Turnstone
SeverityLabel. _map_label now checks the Hybrid-BERT map before the standard
map so either model family works via TURNSTONE_CLASSIFIER_MODEL without any
additional code path.

Mapping (confirmed from model config.json):
  normal            → INFO
  security_anomaly  → ERROR
  system_failure    → CRITICAL
  performance_issue → WARN
  network_anomaly   → WARN
  config_error      → ERROR
  hardware_issue    → CRITICAL

Keyword-based CRITICAL promotion and low-confidence DEBUG demotion apply on
top of the base mapping (same rules as the standard vocabulary).

11 new tests covering all 7 Hybrid-BERT labels, case-insensitivity, and
regression on standard-vocabulary labels. 372 tests passing total.

Note: custom loading code for the non-standard .pt checkpoint format is
explicitly out of scope — evaluate better-packaged HF alternatives first
(see #41 for candidate list).

Closes: #41
2026-06-01 16:20:31 -07:00
bd3923e163 fix: split incidents tables to dedicated turnstone-incidents.db (#60)
FTS5 bulk-insert write locks starved the incident API and bundle endpoints
during log bursts (sonarr/radarr, high-volume docker sources). Fix mirrors
the context_facts split (context -> turnstone-context.db):

- Add INCIDENTS_DB_PATH / TURNSTONE_INCIDENTS_DB env var in rest.py
- Add _INCIDENTS_SCHEMA, ensure_incidents_schema(), and
  migrate_incidents_to_dedicated_db() in glean/pipeline.py
- Stub out incidents/received_bundles/sent_bundles in _SCHEMA (no-op
  CREATE IF NOT EXISTS) so legacy single-file deployments still open
- Thread incidents_db_path through diagnose_stream -> run_pipeline ->
  FalsePositiveSuppressor.suppress -> _fetch_resolved_incidents
- One-shot migration on startup: copy existing rows from main DB to
  incidents DB via INSERT OR IGNORE (idempotent, safe to re-run)
- Fix test_blocklist_endpoints fixtures to patch CONTEXT_DB_PATH and
  INCIDENTS_DB_PATH alongside DB_PATH (worktree has no data/ dir)

372 tests passing.

Closes: #60
2026-06-01 15:54:23 -07:00
1131816666 feat: bundle PII sanitization, onboarding wizard, NL source addition (#51, #52, #53)
Bundle export (#51):
- _redact_text() with 5 compiled regex patterns (IPv4, email, user=, host=, password=)
- build_bundle(sanitize=False) — per-entry redaction at export time
- sent_bundles table tracks every outgoing export (GET and POST /send)
- GET /api/sent-bundles exposes history; SentBundle model added
- BundlesView: Received/Sent tabs, sanitized badge, 5-entry preview, re-download
- IncidentsView: Sanitize PII checkbox next to Send Bundle

Onboarding wizard (#52):
- app/services/discover.py: journald/Docker/file detection (best-effort, safe in containers)
- GET /api/setup/status, /discover, POST /api/setup/write (additive, appends to existing)
- SetupWizard.vue: 3-step Detect → Select → Confirm
  - Step 1 shows grouped summary (journald/file/docker counts)
  - Step 2: collapsible groups with All/None section toggles
    - journald + file: pre-selected; docker: collapsed, none pre-selected
  - Step 3: YAML preview before write
- SourcesView: shows wizard on first run; Add Source button reuses it

NL source addition (#53):
- app/services/nl_source.py: keyword shortcut (13 well-known apps) + LLM fallback
- POST /api/setup/interpret: keyword → LLM → null (graceful fallback)
- NL field in wizard step 2; manual form shown when interpretation fails
- Added sources appear in grouped list immediately
2026-05-29 14:14:28 -07:00
054ebfa0e3 feat(diagnose): tech-level post-processor, offline mode, API auth, context harvest
- synthesizer: 3 system prompts (sysadmin/homelab/executive) selected by tech_level pref
- settings: tech_level selector (UI + backend) persisted in preferences.json
- QuickCapture: shows active level label in diagnosis card header
- TURNSTONE_OFFLINE_MODE=1: sets HF_HUB_OFFLINE + TRANSFORMERS_OFFLINE before lib load
- TURNSTONE_API_KEY: bearer token auth on all /api/ routes (hmac.compare_digest)
- /health always open; unset key = no auth (backward compatible)
- docs/air-gapped-deployment.md: full offline deployment guide
- scripts/harvest_docs.py: generalized context doc bulk-uploader with manifest support
- scripts/manifests/: heimdall-devops.yaml (10 docs ingested) + example.yaml template
- fix: _ingest_upload -> _glean_upload in context doc upload endpoint (was 500)

Closes: #56
Closes: #45
Closes: #47
Closes: #49
Closes: #21
2026-05-28 08:51:05 -07:00
73a14bd782 fix(diagnose): add max_tokens to all LLM calls; fix reasoning card contrast
Truncation fix: call_llm() in _llm_client.py now accepts max_tokens (default
2048) and passes it in both the cf-orch task payload and the OpenAI-compat
fallback body. Hypothesizer uses max_tokens=1024 (JSON array output);
synthesizer and legacy summarize use 2048 (structured 5-section narrative).
Without this, backends use their own default (often 512 tokens), causing
mid-sentence truncation of the diagnosis output.

UI fix: reasoning card changed from bg-accent/5 border-accent/30 (opacity
modifiers on CSS variables don't compose reliably across themes) to the
callout pattern: bg-surface-raised with a solid border-l-4 border-accent.
Header label changed from text-text-dim to text-accent for visual anchoring.
Text remains text-text-primary for guaranteed contrast on both light and dark
themes.

Tracks: #56 (technical-level post-processor, filed as follow-on feature)
2026-05-27 22:23:36 -07:00
7f49961ec4 fix(db): add timeout=30s to all sqlite3.connect() calls across app
Watcher, REST endpoints, services (search, incidents, blocklist),
MCP server, context retriever, embedder, glean_scheduler, and
doc_upload all used the default 5-second SQLite busy timeout.
During collect glean write phases, watcher flush threads were hitting
'database is locked' errors when the glean held the write lock longer
than 5 seconds.

All connections now use timeout=30.0, matching the pipeline fix
from commit 5a9281a. No logic changes.
2026-05-26 23:12:48 -07:00
5a9281a686 fix(glean): add timeout=30s to all pipeline DB connections; add --force flag; new patterns
pipeline.py:
- Add timeout=30.0 to all sqlite3.connect() calls (5 total).
  Previously only ensure_context_schema() had it. The main glean
  writers would fail immediately under lock contention from the live
  watcher or concurrent manual glean runs.

glean_corpus.py:
- Add --force flag (passed through to glean_sources/glean_file/glean_dir).
  Without it, unchanged-fingerprint files were silently skipped even
  after pattern updates. Use after editing patterns/default.yaml.

patterns/default.yaml:
- Add 9 new patterns for Muninn / cluster-wide coverage:
    vpn_tunnel_fail     WireGuard/tunnel service failures
    vpn_handshake       WireGuard peer handshake events
    dns_degraded        systemd-resolved DNS fallback/degradation
    nvidia_api_mismatch NVIDIA kernel module vs userspace mismatch
    nvidia_xid          NVIDIA Xid GPU hardware faults
    nvidia_gpu_reset    NVIDIA GPU reset / NVLink faults
    acpi_error          ACPI firmware _DSM evaluation failures
    thermal_throttle    CPU/GPU thermal throttling / RAPL unavailable
    undervoltage        PSU undervoltage / brownout events
- Sync from /devl/turnstone-cluster/patterns/default.yaml (authoritative
  live copy updated first; repo copy was stale)
2026-05-26 22:36:45 -07:00
09b4912c8e fix(cluster): add Muninn to SSH collection, fix ingest_corpus → glean_corpus rename
- Add [muninn] to NODES map in collect_cluster_logs.sh
  Muninn is accessible via WireGuard (ssh muninn).
  One-time 7-day backfill already gleaned: 262,659 entries.
- Fix broken script reference: ingest_corpus.py was renamed to
  glean_corpus.py — ongoing cluster glean was silently broken since the rename
2026-05-26 17:02:53 -07:00
74e0d5fcd6 docs(container): fix GPU_SERVER_URL for Contributor2 — use public orch.circuitforge.tech
Contributor2's example-node.tv has no WireGuard route to Heimdall's LAN (10.1.10.x),
so the <YOUR_HOST_IP>:7700 private address is unreachable from there.

Use the public cf-orch endpoint instead:
  GPU_SERVER_URL=https://orch.circuitforge.tech

Contributor's Huginn has WireGuard to Heimdall LAN — <YOUR_HOST_IP>:7700 stays correct.
Added both options to docker-standalone.sh for clarity.
2026-05-26 13:39:38 -07:00
3a83e0e31d feat(container): add docker-standalone.sh for Docker hosts (Contributor/Huginn)
Mirrors podman-standalone.sh for Docker-native setups. Key differences:
- Uses ~/turnstone as default REPO_DIR (no /opt assumption)
- -p 8534:8534 port mapping instead of --net=host
- No systemd unit generation (Docker --restart=unless-stopped handles reboots)
- Volume mounts without :Z (Docker SELinux labeling differs from Podman)

Documents the multi-agent setup steps for Huginn:
  export GPU_SERVER_URL=http://<YOUR_HOST_IP>:7700
  export TURNSTONE_MULTI_AGENT_DIAGNOSE=true
  bash ~/turnstone/docker-standalone.sh
2026-05-26 13:21:54 -07:00
2a4a5a5152 feat(container): multi-agent env vars, HF cache mount, and ML deps
podman-standalone.sh:
- Add HF_CACHE_DIR=/opt/turnstone/hf-cache with mkdir guard
- Mount HF_HOME=/hf-cache so model weights persist across restarts
- Forward all multi-agent env vars (TURNSTONE_MULTI_AGENT_DIAGNOSE,
  GPU_SERVER_URL, TURNSTONE_CLASSIFIER_MODEL, TURNSTONE_EMBED_*)
- Add documentation comments for Contributor/Contributor2 remote instance setup

requirements.txt:
- Add torch (CPU-only), transformers, sentence-transformers for the
  5-stage multi-agent diagnose pipeline (classifier + suppressor stages)
- Use --extra-index-url for cpu wheel to keep image ~2GB lighter
- Both modules keep ImportError guards so server starts without them,
  but container images should ship fully capable
2026-05-26 13:20:26 -07:00
3cfd587d16 fix: separate context KB into own SQLite file to eliminate write-lock contention
context_facts, context_documents, and context_chunks now live in
turnstone-context.db (sibling of turnstone.db).  The glean scheduler
held write locks on the main DB long enough to cause 5-second timeout
failures on context fact inserts; separate files have independent WAL
write locks so they never contend.

Changes:
- pipeline.py: extract _CONTEXT_SCHEMA + ensure_context_schema()
- rest.py: CONTEXT_DB_PATH (TURNSTONE_CONTEXT_DB env var, defaults to
  sibling file); init via ensure_context_schema(); all context routes
  pass CONTEXT_DB_PATH; diagnose_stream receives context_db_path kwarg
- diagnose/__init__.py: diagnose_stream() accepts context_db_path
  (falls back to db_path for backward compat); retrieve_context uses it
- store.py: sqlite3.connect() timeout=30.0 — Python driver retry loop
  is independent of PRAGMA busy_timeout; needed for any remaining
  contention during test or single-file deployments

Closes: #42
2026-05-25 21:19:32 -07:00
e851099e5c fix(hypothesizer): extract first JSON array to handle reasoning model double-output
Reasoning models (e.g. foundation-sec-8b) emit valid JSON then repeat it
inside a markdown fence block. json.loads() fails on the combined text.

extract_first_json_array() scans for the first '[' and walks to its
matching ']' with proper string/escape/nesting handling, then returns
just that slice. Combined with strip_json_fences(), this handles all
observed output patterns:
  - bare JSON array (standard models)
  - fenced JSON array (fence-wrapping models)
  - bare array followed by fenced repeat (reasoning models)
2026-05-25 21:01:14 -07:00
b19bea8f2a Merge pull request 'refactor: pipeline cleanup — 6 follow-up fixes (#33–#38)' (#40) from feat/pipeline-cleanup into main 2026-05-25 20:00:11 -07:00
f302f27350 Merge pull request 'feat(diagnose): 5-stage multi-agent diagnose pipeline (#29)' (#39) from feat/29-multi-agent-diagnose into main 2026-05-25 19:59:34 -07:00
39ef1320b0 feat(manage): source .env before starting uvicorn
Enables TURNSTONE_MULTI_AGENT_DIAGNOSE and other env vars set in
.env to reach the running process without manual export. Variables
already set in the caller's environment take precedence.
2026-05-25 19:15:33 -07:00
2375e073ba feat(pipeline): add TURNSTONE_CLASSIFIER_MODEL env var for Stage 2 ML config
Makes the HuggingFace classifier model for Stage 2 configurable via
TURNSTONE_CLASSIFIER_MODEL. When unset (default), Stage 2 falls back
to pattern_tags then regex — no download required on first run.

Also documents TURNSTONE_MULTI_AGENT_DIAGNOSE, TURNSTONE_CLASSIFIER_MODEL,
TURNSTONE_EMBED_BACKEND/MODEL/DEVICE in .env.example.
2026-05-25 19:11:32 -07:00
85e7a70536 refactor: pipeline cleanup — 6 follow-up fixes (#33-#38)
- #33: Wrap ClassifiedTimeline.cluster_severities in MappingProxyType for
  true immutability (frozen=True only blocks field reassignment, not dict
  mutation).

- #34: Remove dead suppression branch in synthesizer._build_hypothesis_block.
  active[] is already filtered to not rh.suppress, so the 'Yes — suppressed'
  branch was unreachable. Now shows novelty score only.

- #35: Extract shared _llm_client.py with call_llm() + extract_content() +
  strip_json_fences(). Both RootCauseHypothesizer and SummarySynthesizer
  now import from one source. Also strips JSON fences from LLM output before
  parsing in hypothesizer._parse_response.

- #36: Add per-stage try/except in pipeline.run_pipeline(). Unhandled
  stage exceptions now emit {type: 'error'} + {type: 'done'} SSE events
  instead of silently closing the stream.

- #37: Move format_context_block() call inside the legacy LLM branch in
  diagnose/__init__.py — it was being computed unconditionally but only
  used in the non-pipeline path.

- #38: Coerce supporting_cluster_ids items to str() in hypothesizer
  _parse_response to guard against LLMs returning integers instead of
  string cluster IDs.
2026-05-25 19:05:56 -07:00
25b7ae340b fix: invert suppress_threshold semantics to similarity_threshold in FalsePositiveSuppressor
Was suppressing when novelty_score < 0.85 (i.e. similarity > 0.15), which
would suppress nearly every hypothesis once embeddings are active.

Now suppresses when max_sim >= similarity_threshold (0.85), meaning only
hypotheses that are 85%+ similar to a resolved incident are suppressed.

Also renames suppress_threshold → similarity_threshold for clarity and
adds a borderline boundary test (0.85 suppressed, 0.84 not suppressed).

Closes: #29
2026-05-25 18:58:52 -07:00
1b949337da fix: tighten suppression_reason display guard, document unused since/until params 2026-05-25 15:02:48 -07:00
1865ba1f02 feat: Stage 5 synthesizer + pipeline orchestrator + feature flag wiring (issue #29)
- Add app/services/diagnose/synthesizer.py: SummarySynthesizer (Stage 5)
  - Builds structured LLM prompt from ranked hypotheses, timeline, RAG context
  - Excludes suppressed hypotheses from the narrative prompt
  - Deterministic fallback when no LLM configured or LLM call fails
  - Same cf-orch task endpoint + direct OpenAI-compat fallback pattern as other stages

- Replace pipeline.py stub with full run_pipeline() async generator
  - Orchestrates all 5 stages via asyncio.to_thread for each synchronous stage
  - Yields typed SSE event dicts: status, pipeline_stage (1-4), hypotheses, reasoning, done
  - Suppressor counts (active vs suppressed) reported in stage 4 event message

- Wire MULTI_AGENT_ENABLED feature flag into diagnose_stream()
  - TURNSTONE_MULTI_AGENT_DIAGNOSE=true routes through run_pipeline()
  - pipeline emits its own done event; legacy path unchanged when flag is false
  - Import of run_pipeline added to __init__.py

- Add 21 new tests (350 -> 371 passing):
  - tests/test_diagnose_synthesizer.py: 8 tests (with/without LLM, suppressed,
    empty ranked, LLM failure fallback)
  - tests/test_diagnose_pipeline.py: 13 tests (flag off, flag on event sequence,
    empty entries, no LLM, stage 1 cluster count message)

Closes: #29
2026-05-25 14:56:25 -07:00
54d4ec5325 refactor: extract _score_hypothesis helper, fix exception types, pass device in suppressor 2026-05-25 14:41:33 -07:00
84e0cf5245 feat: Stage 4 — FalsePositiveSuppressor for multi-agent diagnose pipeline (issue #29)
- Implements FalsePositiveSuppressor using embedding cosine similarity
- Lazy corpus embedding via get_embedder() with module-level cache keyed by db_path
- Cache invalidated automatically when the resolved incident corpus changes
- Suppresses hypotheses with novelty_score below configurable threshold (default 0.85)
- Full fallback path (novelty=1.0, no suppression) when model_id empty, embedding
  service unavailable, or no resolved incidents found in DB
- Graceful handling of missing incidents table and DB query failures
- Numpy bool_ leakage prevented by explicit float()/bool() coercion at assignment
- Pure-Python cosine fallback for environments without numpy
- 9 new tests (all mocked, no real model downloads): passthrough, suppress, no-suppress,
  empty list, ranking, empty corpus, DB failure, service unavailable, cache invalidation
- 350 total tests passing (341 pre-existing + 9 new)

Closes: #29
2026-05-25 14:28:31 -07:00
a2916f958a fix: defensive coercion for LLM confidence and cluster fields in hypothesizer
- Add _coerce_float() module-level helper: catches TypeError/ValueError from
  non-numeric LLM output (e.g. 'high', 'N/A') and returns a caller-supplied
  default instead of raising.
- Replace float(item.get('confidence', 0.5)) with
  _coerce_float(item.get('confidence'), 0.5) in _parse_response.
- Guard supporting_cluster_ids: tuple(item.get(...) or []) so a JSON null
  from the LLM does not cause TypeError('NoneType is not iterable').
- runbook_refs is hardcoded as () and not sourced from LLM output; no change
  needed there.
- Add test_non_numeric_confidence_uses_default (Test 10) to cover the 'high'
  string case: asserts no exception and confidence == 0.5.
- 341 tests passing (+1).

Closes: #29
2026-05-25 14:00:30 -07:00
34fb8f501d feat: Stage 3 — RootCauseHypothesizer for multi-agent diagnose pipeline (issue #29)
- Add app/services/diagnose/hypothesizer.py with RootCauseHypothesizer class
- Stage 3 of the multi-agent diagnose pipeline: accepts ClassifiedTimeline +
  RetrievedContext, builds a structured JSON prompt, calls the LLM via the
  same cf-orch task → OpenAI-compat fallback pattern used by llm.py
- Parses JSON array response into list[Hypothesis] dataclasses with UUID ids,
  severity validation (WARNING→WARN, unknown→ERROR), confidence coercion
- Gracefully returns [] when llm_url/llm_model absent or clusters empty
- Add tests/test_diagnose_hypothesizer.py: 12 tests, all mocked, no LLM I/O
  covering: valid response, UUID generation, malformed JSON, non-list JSON,
  empty clusters, missing URL/model, max_hypotheses cap, severity mapping,
  confidence string coercion
- 340 tests passing (328 prior + 12 new)

Closes: #29
2026-05-25 13:49:18 -07:00
6ea8fbfec1 feat: Stage 2 — SeverityClassifier for multi-agent diagnose pipeline (issue #29)
Three-path classification: ML (transformers pipeline, lazy singleton) →
pattern_tags (YAML pattern severity dict) → regex (detect_severity).

- Path A: HF text-classification pipeline loaded lazily on first classify()
  call via module-level singleton; shim promotes ERROR+keyword hits to CRITICAL
  and demotes low-confidence INFO to DEBUG.
- Path B: maps cluster.pattern_tags through the loaded pattern severity dict;
  picks the highest severity across matching tags.
- Path C: falls back to detect_severity() regex scan on representative_text;
  defaults to INFO when no keyword matches.
- Pattern file resolved from constructor arg or TURNSTONE_PATTERNS env var
  (mirrors app/rest.py convention).
- No crash when transformers is not installed; ImportError on per-cluster ML
  inference triggers clean per-cluster fallback to pattern_tags/regex.
- ClassifiedTimeline.classifier_used reflects the primary session path.

Tests (10 new, 328 total, all passing):
- ML ERROR, CRITICAL promotion, DEBUG demotion, WARNING→WARN
- pattern_tags resolution from YAML fixture
- regex ERROR detection and INFO default
- ImportError clean fallback
- empty timeline no-crash
- ClassifiedTimeline FrozenInstanceError on mutation

Closes: #29
2026-05-25 13:27:17 -07:00
7abb76e628 refactor: split TimelineReconstructor.reconstruct into helpers, fix magic number + error handling
- Add gap_significance_seconds constructor param (default 30) to replace hardcoded magic number in gap_count computation
- _parse_iso now returns datetime | None with try/except on ValueError; all callers handle None return by treating malformed timestamps as absent
- Extract reconstruct into four private helpers: _sort_entries, _group_into_raw_clusters, _build_cluster, _dominant_sources_tuple
- Promote _sort_key to module-level function (was nested inside reconstruct)
- Rename old module-level _build_cluster to _make_event_cluster to avoid name collision with new instance method
- Add explanatory comment to type: ignore[arg-type] at _highest_severity call site
- Black-formatted
2026-05-25 13:22:18 -07:00
f7429ee963 feat: Stage 1 — TimelineReconstructor for multi-agent diagnose pipeline (issue #29)
- Add app/services/diagnose/timeline.py: pure-Python TimelineReconstructor
  - Sorts entries by timestamp_iso (None entries appended at end)
  - Sliding-window clustering anchored to first entry in each cluster
  - Computes cluster_id (sha1[:12]), severity (highest wins), burst flag,
    gap_before_seconds, representative_text (highest rank, longest text tiebreak)
  - Builds TimelineResult with dominant_sources sorted by entry count descending
- Update pipeline.py stub to import TimelineReconstructor (Task 6 wiring prep)
- Add tests/test_diagnose_timeline.py: 15 tests covering all 13 required cases
  plus null-timestamp edge case variant; all 318 tests passing

Closes: #29
2026-05-25 12:54:15 -07:00
afab3ca869 fix: frozen dataclasses, clean __all__, improve exception logging in diagnose package 2026-05-25 12:31:07 -07:00
da28757a20 refactor: convert diagnose module to package for multi-agent pipeline (issue #29)
- Move app/services/diagnose.py verbatim to app/services/diagnose/legacy.py
- Create app/services/diagnose/__init__.py with full implementation so that
  patch('app.services.diagnose._HAS_DATEPARSER') targets the correct namespace
  and all 303 existing tests continue to pass without modification
- Add app/services/diagnose/models.py with 5 pipeline dataclasses:
  EventCluster, TimelineResult, ClassifiedTimeline, Hypothesis, RankedHypothesis
- Add app/services/diagnose/pipeline.py with run_pipeline() stub (Task 6)
- Add MULTI_AGENT_ENABLED feature flag (off by default via env var)
- Zero behavior change; ruff clean

Closes: #29
2026-05-25 11:12:39 -07:00
f7bcc6c9b7 refactor: extract embeddings service layer — decouple context embedder from Ollama
- New app/services/embeddings.py: TURNSTONE_EMBED_* env vars, multi-backend support
- embedder.py delegates to service layer; re-exports EMBEDDING_AVAILABLE for compat
- retriever.py updated to use service layer
- Test coverage updated in tests/context/test_embedder.py
2026-05-25 11:01:25 -07:00
6fec294a53 feat: fingerprint-based incremental glean — skip unchanged files (#30)
- Add glean_fingerprints table to schema (sha256 + mtime + size)
- _fingerprint(), _fp_unchanged(), _save_fingerprint() helpers in pipeline.py
- _glean_files() now checks fingerprint; skips file if hash unchanged
- force=True param threads through glean_dir → glean_file → glean_sources
- POST /api/tasks/glean and POST /api/sources/{id}/glean accept force=true
- 14 unit tests in tests/test_glean_fingerprint.py, all passing

Closes: #30
2026-05-25 11:01:18 -07:00
41fc89c474 feat: SSH remote glean — transport layer, pipeline integration, REST + UI (#22)
Closes turnstone#22.

## Transport layer (app/glean/ssh.py)
- SSHTransport context manager: key-only auth, paramiko backend
- SSHConnectionError / SSHCommandError exception hierarchy
- exec_stream() generator: yields stdout lines, raises SSHCommandError on
  non-zero exit (isinstance(int) guard for test-mock safety)
- Command builders: _build_journald_command, _build_syslog_command,
  _build_plaintext_command, _build_docker_command
- 18 unit tests in tests/test_glean_ssh.py

## Pipeline integration (app/glean/pipeline.py)
- _stream_and_write(): per-item error isolation — SSHCommandError skips
  one glean item without aborting the rest of the host connection
- _glean_ssh_source(): one SSHTransport per host, dispatches all glean
  items (journald/syslog/plaintext/docker); SSHConnectionError aborts host
- glean_sources(): splits local vs SSH sources; local → _glean_files();
  SSH → _glean_ssh_source(); shared compiled patterns and DB connection
- glean_ssh_source(): public wrapper for REST use — manages DB connection,
  pattern compilation, FTS rebuild lifecycle
- 15 integration tests in tests/test_glean_pipeline_ssh.py
- All 285 tests passing

## REST layer (app/rest.py)
- GET /api/sources/configured: reads sources.yaml and enriches with DB
  stats; SSH sources appear before first glean (entry_count=0); sub-source
  IDs (rack01/journald, rack01/docker/myapp) aggregated per host entry
- POST /api/sources/{id}/glean: detects transport:ssh and dispatches to
  glean_ssh_source() wrapper; local sources unchanged
- Import: glean_ssh_source as _glean_ssh_source

## Frontend (web/src/views/SourcesView.vue)
- Fetches /api/sources/configured (primary) + /api/sources (DB-only) in
  parallel; merges into unified SourceRow list
- SSH sources show: ssh badge (with user@host tooltip), glean-type pills
  (journald/syslog/docker/etc.), host subtitle
- SSH sub-source IDs (rack01/journald) suppressed from the DB-only list
  since they are covered by the parent SSH row
- DB-only sources (uploads) appear below configured sources with 'uploaded'
  badge; reglean button disabled (not in sources.yaml)
- Delete zeroes out configured-source stats in-place rather than removing
  the row (so the source remains visible for re-gleaning)
2026-05-21 12:37:30 -07:00
39c13f39ba feat: SSH remote host glean — transport layer and pipeline integration (closes #22, backend)
Adds SSH-based log collection from remote hosts via Paramiko.
One SSH connection per host, multiple log types per connection.

New files:
- app/glean/ssh.py: SSHTransport context manager + command builders
  for journald, syslog, plaintext, and docker log types
- tests/test_glean_ssh.py: 18 tests for transport layer (all mocked)
- tests/test_glean_pipeline_ssh.py: 15 tests for pipeline integration

Pipeline changes (app/glean/pipeline.py):
- glean_sources() now splits sources into local-file and SSH categories
- SSH sources use transport: ssh + glean: list schema in sources.yaml
- _glean_ssh_source(): one SSHTransport per host, N commands per connection
- _stream_and_write(): SSHCommandError caught per-item so one bad
  command does not abort the rest of the host's glean items
- SSHConnectionError skips the entire host with a warning log

SSH source schema (sources.yaml):
  - id: rack01
    transport: ssh
    host: 192.168.1.10
    user: admin
    key_path: ~/.ssh/id_ed25519
    glean:
      - type: journald
        args: [--since, 2 hours ago]
      - type: syslog
        path: /var/log/syslog
      - type: plaintext
        path: /var/log/app/error.log
      - type: docker
        containers: [myapp, nginx]

Key design decisions:
- Key-based auth only (no password prompts in daemon context)
- exit-status check fires after all stdout lines yielded; callers
  drain the iterator to trigger it
- Local file sources path unchanged; SSH sources co-exist in same yaml
- Docker multi-container: one exec_stream call per container,
  source_id scoped as host_id/type/container_name

Remaining for #22: REST endpoint, SourcesView UI, sources.yaml docs.
285 → 285 tests passing (33 new SSH tests).
2026-05-20 23:03:13 -07:00
828b69768a refactor: rename ingest → glean throughout codebase
Renames the app/ingest/ package to app/glean/ and updates all
references across Python modules, shell scripts, Vue components,
tests, and documentation.

Intentionally preserved:
- SQLite column name ingest_time (avoids schema migration)
- RetrievedEntry.ingest_time field (maps to the column above)
- Any public-facing JSON keys that reference ingest_time

Changes by category:
- app/ingest/ → app/glean/ (full package move, all parsers)
- app/tasks/ingest_scheduler.py → app/tasks/glean_scheduler.py
- scripts/ingest_corpus.py → scripts/glean_corpus.py
- tests/test_ingest_*.py → tests/test_glean_*.py
- Docstrings, log messages, comments: ingest → glean
- Env var: TURNSTONE_INGEST_INTERVAL → TURNSTONE_GLEAN_INTERVAL
- Shell scripts: glean.log, glean_corpus.py references
- README.md: multi-source ingest → multi-source glean
- .env.example: updated env var name
- patterns/: new diagnostic patterns from 2026-05-20 SSH incident
  (service_crash_loop, pkg_daemon_restart, ssh_forward_conflict)
- SourcesView.vue: pipeline label updated
- All test import paths updated to app.glean.*

285 tests passing.
2026-05-20 23:02:55 -07:00
63c742a708 feat: periodic ingest scheduler + Orchard submission pipeline
Adds asyncio-native background scheduler (TURNSTONE_INGEST_INTERVAL,
default 900s) that runs batch ingest then pushes pattern-matched entries
to a remote CF harvest endpoint (TURNSTONE_SUBMIT_ENDPOINT).

- app/tasks/ingest_scheduler.py: IngestState, scheduler_loop, run_once,
  submit_matched, _query_matched_since — asyncio.Lock prevents concurrent runs
- app/rest.py: POST /api/ingest/batch (pre-parsed entry receiver),
  GET /api/tasks/ingest/status, POST /api/tasks/ingest (manual trigger),
  TURNSTONE_INGEST_INTERVAL + TURNSTONE_SUBMIT_ENDPOINT env wiring in lifespan
- docker-compose.submissions.yml: segregated contrib1 (8536) + contrib2 (8537)
  receiving instances on Heimdall, isolated DBs under
  /devl/docker/turnstone-submissions/<node>/
- podman-standalone.sh: pass-through for TURNSTONE_SUBMIT_ENDPOINT +
  TURNSTONE_SOURCE_HOST
- app/ingest/mqtt_subscriber.py: MQTT log source adapter
- app/ingest/wazuh.py: Wazuh alert JSON adapter
- tests/test_ingest_wazuh.py: Wazuh adapter test suite
2026-05-20 08:57:25 -07:00
6144ba99d9 fix: make sqlite-vec download non-fatal in Dockerfile 2026-05-19 13:02:15 -07:00
510499aba3 fix: use curl instead of wget for sqlite-vec download in Dockerfile 2026-05-19 13:01:45 -07:00
ed0a4bb469 feat: Alpha milestone — corpus management, upload ingest, harvester agent
Closes #1 (incident tagging — already implemented), #2, #3, #5.

- feat(api): DELETE /api/sources/{id} — purge entries + FTS rows for a source
- feat(api): POST /api/sources/{id}/ingest — re-ingest from sources.yaml
- feat(api): POST /api/ingest/upload — multipart log file upload with auto-detect
- feat(ui): SourcesView reingest + delete buttons and upload file input (#2)
- feat(harvester): harvester.py push + incident subcommands (#5)
- feat(harvester): Dockerfile, docker-compose.yml, harvester.sh (containerless)
- feat(config): GPU_SERVER_URL → CF_ORCH_URL resolution + write-back (#20)
- docs: .env.example, README Configuration table, version bump to 0.5.0
2026-05-19 07:45:58 -07:00
1361547c36 docs: bump version badge to match latest Forgejo release 2026-05-17 11:19:13 -07:00
9f2ae5464a fix(ui): nested overflow wrapper to prevent overflow-hidden clipping table columns
overflow-hidden and overflow-x-auto on the same element conflict in Tailwind's
CSS generation order. The shorthand overflow:hidden can override overflow-x:auto,
clipping the rightmost column (diagnose buttons). Fix: outer div keeps
overflow-hidden for rounded corners, inner div handles overflow-x-auto scrolling.
2026-05-16 09:11:42 -07:00
0d60533576 feat(ui): mobile fixes for Dashboard and Diagnose views
- DashboardView: p-4 sm:p-6 padding, overflow-x-auto on source health table
- DiagnoseView: p-4 sm:p-6 padding
- QuickCapture: px-4 sm:px-6 + shrink-0 on Search button to avoid input squeeze
2026-05-16 09:04:37 -07:00
807fe516a6 feat(ui): mobile responsive layout
- App: hamburger menu on mobile, nav links hidden below md breakpoint
- LogSearch: collapsible sidebar on mobile, stacks above results vertically
- Incidents/Sources: overflow-x-auto on table containers, min-w to preserve
  column layout on desktop; drawer action buttons flex-wrap on small screens
- Bundles: flex-wrap on header row, hide source_host + timestamp below sm
- General: p-4 sm:p-6 padding on all standard views
2026-05-16 02:11:58 -07:00
83691ceb94 fix(blocklist): render llm_score, fix load() error handling, fix severity override mutations
- BlocklistView: display llm_score/llm_reason when non-null (spec gap)
- BlocklistView: set scanError on non-ok load() response (was silent)
- SettingsView: replace in-place splice/property mutation with immutable
  spread pattern in toggleOverride/deleteOverride
2026-05-16 01:57:18 -07:00
175bdff9cd feat(blocklist): BlocklistView + Pi-hole settings UI 2026-05-15 21:23:03 -07:00
5263a67fb3 fix(blocklist): get_candidate for O(1) push/unblock, 400 on malformed device_names JSON 2026-05-15 21:19:02 -07:00
1e186591d7 feat(blocklist): 6 REST endpoints + Pi-hole settings fields
Add blocklist candidate listing, scan trigger, status update,
push/unblock to Pi-hole, and connection test endpoints.
Add pihole_url/version/api_key and router_source_ids/device_names
fields to SettingsBody and prefs handling in patch_settings.
Add PiholeClient.__post_init__ validation so 503 fires naturally
when url/api_key are unconfigured (mock-safe: bypassed in tests).
2026-05-15 21:15:09 -07:00
aa55a1ce24 feat(blocklist): extraction scan + candidate CRUD + full test suite 2026-05-15 21:05:49 -07:00
38138dc0c0 fix(blocklist): validate _v6_auth session JSON, add auth-failure test 2026-05-15 21:03:03 -07:00
dceb2d30ca feat(blocklist): Pi-hole v5/v6 API client + tests
PiholeClient dataclass supporting both Pi-hole v5 (PHP /admin/api.php)
and v6 (REST /api/) with public block/unblock/test_connection methods.
9 tests covering both API versions, auth flow, and error handling.
2026-05-15 21:00:01 -07:00
383b855483 fix(blocklist): remove premature imports from blocklist.py (Task 2 scope) 2026-05-15 20:58:04 -07:00
f469692c52 feat(blocklist): telemetry YAML list + loader + domain matcher
Adds patterns/telemetry.yaml with 6 rule groups (samsung, belkin, roku, lg, amazon, advertising).
Adds app/services/blocklist.py with TelemetryRule and BlocklistCandidate dataclasses, load_telemetry_rules(), and matches_telemetry() with exact and subdomain matching.
6 new TestTelemetry tests pass; 199 total passing.
2026-05-15 20:54:40 -07:00
4d7c436721 feat(blocklist): blocklist_candidates schema + tests
Add blocklist_candidates table and indexes to _SCHEMA in pipeline.py.
Add TestSchema tests verifying table existence, column set, and status/hit_count defaults.
All 193 tests pass.
2026-05-15 20:51:00 -07:00
135dd02423 docs: update status badge to beta 2026-05-15 20:13:47 -07:00
842d83b68e chore: remove stale load_patterns import from rest.py 2026-05-13 21:52:03 -07:00
279b01902f fix: tautulli — hmac token compare, public pattern loader, startup cache, endpoint tests 2026-05-13 19:08:49 -07:00
581e0314b4 fix: tautulli — entry_id collision on missing ts, token settings, test coverage 2026-05-13 19:04:07 -07:00
4fbac2554e feat: Tautulli webhook ingest endpoint — plex events -> log_entries
POST /turnstone/api/ingest/tautulli accepts Tautulli notification agent
payloads and stores them as log_entries under source 'tautulli'. Severity
maps error->CRITICAL, buffer->WARN, all others->None. Optional bearer token
auth via X-Tautulli-Token header + tautulli_token pref. FTS index rebuilt
as a background task after each write. 28 new tests, all passing.
2026-05-13 18:41:03 -07:00
3501240231 fix: time window regex misses fuzzy quantifiers like 'last few hours'
The relative-time regex only matched digits between 'last/past' and
the unit, so 'last few hours' fell through to dateparser which then
found the bare word 'hours' and resolved it as midnight local time.

Extended the regex to capture 'few', 'couple of', 'several', 'a few'
as approximate quantifiers, mapped to 3 units each. Numeric expressions
and bare 'last hour' still work as before.
2026-05-13 18:32:54 -07:00
0b3d95cd26 fix: ingestors treat naive log timestamps as local time, not UTC
All five parsers (plex, syslog, servarr, qbittorrent, plaintext) were
using .replace(tzinfo=timezone.utc) on naive datetimes parsed from log
files, which slaps a UTC label on what is actually local-time data.
On a UTC-7 system a 2pm entry was stored as 14:00Z instead of 21:00Z,
causing time-window searches to return zero results.

Fix: use .astimezone(timezone.utc) instead, which treats the naive
datetime as local time and converts correctly.

Tests updated to round-trip back to local time for assertion so they
pass on any timezone, not just UTC.
2026-05-13 18:16:33 -07:00
251109ae96 fix: final review fixes — port guard, network error handling, wizard back nav, tablist arrow keys, dialog focus trap
- wizard.py: wrap syslog_port int() in try/except to default 514 on non-numeric input
- ContextView: add try/catch to doDelete, doDeleteFact, addFact for network errors
- ContextView: arrow-key navigation for tablist (ArrowLeft/ArrowRight)
- DiagnoseView: arrow-key navigation for tablist (ArrowLeft/ArrowRight)
- WizardOverlay: reset current_step to last schema step when clicking 'Go back and edit'
- WizardOverlay: focus trap on Tab/Shift+Tab within dialog element
2026-05-13 17:40:40 -07:00
a047555031 fix: drag flicker guard, error body parsing, wizard session restore answer 2026-05-13 17:07:56 -07:00
5068fabb54 feat: WizardOverlay and DocUploadZone — accessible multi-step wizard and upload UI 2026-05-13 17:04:15 -07:00
6f9cfb8018 fix: add error handling to context doc/fact load functions 2026-05-13 17:00:29 -07:00
4096f890be feat: Context view — document and fact management with accessible tables
Adds /context route with tabbed UI for managing uploaded documents and
manually-entered environment facts. Includes inline confirm-before-delete,
add-fact form with category/key/value fields, wizard CTA panel, and
stub components for DocUploadZone and WizardOverlay (Task 14).
2026-05-13 16:57:38 -07:00
88b27a1454 fix: a11y — tab panels v-show, radio roving-tabindex, table header label 2026-05-13 16:53:41 -07:00
29fb31d76c fix: a11y — tablist, health dots, table headers, switch roles, nav landmark 2026-05-13 16:48:38 -07:00
b7e71b0e78 fix: a11y — QuickCapture label/role/aria-live/spinner, LogEntryRow expand button 2026-05-13 16:42:46 -07:00
e0bfa11642 feat: optional sqlite-vec embedding pipeline for Paid-tier RAG 2026-05-13 16:32:57 -07:00
d8c3eba0f8 feat: context REST API — docs, facts, wizard, and debug endpoints
Wires the context/RAG layer into FastAPI via a dedicated _ctx router
(/turnstone/api/context/*): document upload (POST/GET/DELETE /docs),
fact CRUD (POST/GET/DELETE /facts), wizard state machine
(/wizard/schema, /wizard/step, /wizard/apply), and a debug search
endpoint (/debug/search). All blocking DB calls are dispatched via
asyncio.to_thread to keep the event loop free.
2026-05-13 16:31:07 -07:00
b5ce0a24b2 feat: inject environment context into diagnose pipeline and LLM prompt
- Add context_block param to summarize() and thread it into _PROMPT_TEMPLATE
- Wire retrieve_context/format_context_block into diagnose_stream() before
  log search; emit context SSE event (facts + chunks) to the client
- 3 new tests covering prompt injection and SSE event emission (155 total, all pass)
2026-05-13 16:29:26 -07:00
783edbe496 feat: wizard state machine — structured Q&A writes context facts and source config 2026-05-13 16:25:52 -07:00
ef8d164188 feat: context retriever — keyword fact lookup and chunk search 2026-05-13 16:23:54 -07:00
ebbb1af32d feat: doc upload adapter — writes facts, document, and chunks to context store 2026-05-13 16:21:55 -07:00
b23a60a602 feat: context chunker — type detection, YAML extraction, text chunking
- Implement document type detection for yaml/json/markdown/text
- Extract service facts from docker-compose YAML (names, images, ports)
- Split text into overlapping word chunks (300-word default with 50-word overlap)
- Enforce 5 MB file size limit
- Comprehensive TDD test suite: 15 tests passing
2026-05-13 15:54:51 -07:00
54c756dfe8 feat: context store — fact and document CRUD 2026-05-13 15:53:03 -07:00
7461953021 feat: add context_facts, context_documents, context_chunks tables to schema 2026-05-13 15:51:19 -07:00
625b55324a fix: a11y foundation — text-dim contrast, focus-visible, prefers-reduced-motion 2026-05-13 15:48:12 -07:00
784a4072b4 feat: SSE streaming diagnose, severity filter pills, per-source-cap search
- diagnose_stream() async generator: status/summary/entries/reasoning/done events
- POST /api/diagnose/stream SSE endpoint wired in rest.py
- entries_in_window() gains per_source_cap to prevent high-volume sources crowding results
- QuickCapture: severity filter pills, filtered entries view, pipeline status spinner
- llm.py: remove overly broad HTTPStatusError re-raise
2026-05-13 15:45:35 -07:00
b70c89e7b5 feat: try cf-orch task endpoint first; fall back to direct model call
POST /api/inference/task with product=turnstone task=log_analysis routes to
the security reasoning model assigned in cf-orch. Falls back to the OpenAI-
compat /v1/chat/completions path on 404 (no assignment) or if the task
endpoint is absent (local instances, example-node).
2026-05-13 08:20:29 -07:00
caa85b3d30 feat: source-scoped diagnose; multi-node Docker log collection
- Diagnose: add source_filter param threaded through entries_in_window,
  search, _diagnose, and DiagnoseRequest — clicking diagnose on a
  dashboard source now scopes both keyword and window hits to that source
- QuickCapture: read route.query.source; show scope badge with clear ✕;
  auto-run when source param is present without a query
- DashboardView: pass source= (not q=) when navigating to diagnose
- collect_cluster_logs.sh: auto-discover Docker containers on all nodes
  (Heimdall non-watched, Navi, Strahl via SSH); collect Cass Plex logs
  via SSH; write to per-node dirs for directory-mode ingest
- turnstone-cluster.service: add --reload for hot-reload during dev
2026-05-13 08:10:42 -07:00
53fa350adf fix: correct cf-orch port to 7700; fix relative time parsing in diagnose; fix syslog PRI prefix 2026-05-13 05:33:41 -07:00
c1fa5ef70d fix: write ingest log to data dir (alan lacks /var/log write access) 2026-05-13 05:20:56 -07:00
e9faabc07f fix: run collect service as alan user; call ingest directly without Docker 2026-05-13 05:17:43 -07:00
a4ec5a6951 feat: add UDP syslog receiver for network device log collection
scripts/syslog_receiver.py: asyncio UDP server listening on port 5140,
appends raw syslog lines to network-syslog.txt for the Turnstone live
watcher to tail. Requires no root — port 5140 is non-privileged.

scripts/turnstone-syslog-receiver.service: systemd unit for auto-start.

app/ingest/syslog.py: strip optional RFC 3164 <PRI> prefix before
parsing so network-forwarded syslog (OpenWRT logd, Arista EOS, etc.)
is handled correctly without the PRI value breaking the regex.
2026-05-13 04:58:51 -07:00
d769be04d4 refactor: use live watcher + systemd timer instead of cron for cluster ingest
Local Heimdall sources (journal, Docker containers, network syslog) are now
tailed continuously by the built-in watcher via watch.yaml — no periodic
collection needed for those.

SSH collection of remote node journals is now handled by a systemd timer
(turnstone-cluster-collect.service/.timer) instead of cron.
collect_cluster_logs.sh simplified to only SSH-collect remote nodes and
trigger ingest directly.

docker-cluster.sh updated to mount:
  - /var/run/docker.sock (so watcher can run docker logs -f)
  - /run/systemd/journal (so watcher can run journalctl -f)
  - /devl/turnstone-cluster/patterns/ (cluster-specific watch.yaml)
2026-05-13 04:55:25 -07:00
1e8a118f71 feat: add cluster-wide log collection and Heimdall Turnstone deployment
- scripts/collect_cluster_logs.sh: collects journals from Heimdall (local),
  Navi, Sif, Cass, Strahl (SSH), Docker services, and a network syslog
  placeholder; designed for 15-min cron before ingest
- patterns/sources-cluster.yaml: ingest sources config for the full
  CircuitForge cluster stack; points at /devl/turnstone-cluster/data/
- scripts/docker-cluster.sh: Docker deployment for Heimdall cluster monitor;
  seeds preferences.json with cf-orch coordinator URL (localhost:7701) so
  LLM summarization works on first ingest without manual UI config
2026-05-12 18:53:58 -07:00
a21c158917 fix: increase LLM summarize timeout to 120s for remote cf-orch routing
20s was too tight for first-request model swaps in Ollama (model cold load
can take 30-60s). 120s matches coordinator inference timeout.
2026-05-12 18:27:52 -07:00
bb94f8251c fix: podman-standalone.sh builds image and regenerates systemd unit on each run
Running the script after a git pull previously left a stale image in place.
Now: build → run → regenerate systemd unit → daemon-reload, all in one step.
2026-05-12 16:18:37 -07:00
7d46314e86 feat: switch LLM backend to OpenAI-compat; add cf-orch remote inference support
Turnstone now calls /v1/chat/completions instead of Ollama's /api/generate.
This format works with both local Ollama (>=0.1.24) and a remote cf-orch
coordinator, enabling GPU-less nodes like Contributor2's to route diagnoses through
the cluster without any local model.

- llm.py: OpenAI-compat messages format, optional Bearer auth header
- diagnose.py: thread llm_api_key through the call chain
- rest.py: llm_api_key pref (default empty), SettingsBody field, passed to diagnose
- SettingsView.vue: API Key field, label updated from "Ollama URL" to "LLM Endpoint URL"
- tests: updated mocks for new response shape; added bearer token assertion test
2026-05-12 12:58:38 -07:00
afcac6ff05 feat: periodic corpus export — push ERROR/CRITICAL entries and incidents to Avocet
Watermark-based batch export script (scripts/export_corpus.py) pushes up to 500
ERROR/CRITICAL entries and labeled incidents per run to AVOCET_CORPUS_ENDPOINT.
Uses SQLite rowid watermark (entry log) and ISO timestamp watermark (incidents).
Skips silently when AVOCET_CORPUS_ENDPOINT is not set. 19 tests. Closes turnstone#6.
2026-05-11 17:08:35 -07:00
85785a3f76 chore: add update.sh deploy script; gitignore patterns/watch.yaml
update.sh pulls a named branch (default: main), preserves the local
watch.yaml around the pull, rebuilds the image, restarts the service,
and polls health until ready.

Usage: sudo bash /opt/turnstone/scripts/update.sh [branch]

patterns/watch.yaml is site-specific config — gitignored so host
customizations survive git pulls. The template is preserved in git
history (feat/live-watch) for reference.
2026-05-11 16:07:07 -07:00
bb8206d5a1 Merge pull request 'feat: live watch mode — tail journald/docker/podman continuously (#4)' (#16) from feat/live-watch into main 2026-05-11 15:45:30 -07:00
9cc8bf3662 feat: add file tail source type; configure example-node watchers
- type: file uses tail -F (handles rotation) with auto-format detection
- _parse_lines dispatches to journald/servarr/qbit/caddy/syslog/plaintext
  based on first-line format detection — same logic as batch ingest
- watch.yaml updated with file type docs and example-node-specific example
- scripts/journal-bridge.sh + .service written directly to example-node

Contributor2's watch.yaml covers: system-journal-live (via bridge file),
sonarr, radarr, lidarr, prowlarr, bazarr, qbittorrent, nzbget, tautulli
2026-05-11 15:44:10 -07:00
3fd81e5ab1 feat: live watch mode — tail journald/docker/podman sources continuously (#4)
Adds background watcher that tails active log sources and ingests entries
in near-real-time, keeping the DB fresh without manual ingest runs.

- app/watch/watcher.py: Watcher + WatchSource using subprocess + select
  loop; flushes every 10s or 100 lines; syncs FTS index every 3 flushes
- patterns/watch.yaml: declarative source config (journald/docker/podman)
- app/rest.py: lifespan context manager starts/stops watcher on app
  startup/shutdown; GET /api/watch/status + POST /api/watch/reload
- web/src/views/DashboardView.vue: live/manual indicator chip + stale
  banner copy adapts to whether live watching is active
- tests/test_watch_watcher.py: 16 tests covering config load, command
  building, docker timestamp stripping, orchestrator lifecycle

Closes #4
2026-05-11 15:34:13 -07:00
3c758d3626 Merge pull request 'feat: LLM reasoning, severity overrides, dashboard freshness' (#14) from feat/llm-reasoning into main 2026-05-11 13:00:52 -07:00
c12cc6d68a feat: severity overrides + last_ingested timestamp on dashboard 2026-05-11 13:00:11 -07:00
b3c02eebf7 docs: add README — diagnostic log intelligence layer 2026-05-11 12:57:32 -07:00
0882083755 feat: LLM reasoning layer — Ollama summarization on diagnose results 2026-05-11 11:35:07 -07:00
18d80cbfad Merge pull request 'feat: frictionless incident capture' (#13) from feat/frictionless-capture into main 2026-05-11 09:53:25 -07:00
165 changed files with 26559 additions and 1069 deletions

104
.env.example Normal file
View file

@ -0,0 +1,104 @@
# Turnstone environment variables
# Copy to .env and adjust for your setup. All variables are optional unless noted.
# --- Database & paths ---
# TURNSTONE_DB=/data/turnstone.db
# TURNSTONE_PATTERNS=/patterns
# TURNSTONE_SOURCE_HOST=my-server
# --- GPU / LLM inference ---
# GPU_SERVER_URL — URL of your GPU inference server (Ollama, vLLM, or cf-orch coordinator).
# Paid+ users: leave unset to auto-default to https://orch.circuitforge.tech via CF_LICENSE_KEY.
# Local Ollama (default if unset): http://localhost:11434
# Local cf-orch coordinator: http://<YOUR_HOST_IP>:7700
# CF_ORCH_URL is also accepted as a backward-compatible alias.
# GPU_SERVER_URL=http://localhost:11434
# --- CircuitForge license (Paid+) ---
# Enables cloud GPU inference and premium features.
# When set, GPU_SERVER_URL defaults to https://orch.circuitforge.tech automatically.
# CF_LICENSE_KEY=CFG-TRSN-XXXX-XXXX-XXXX
# --- Bundle endpoint (optional) ---
# Remote endpoint to push diagnostic bundles for escalation.
# TURNSTONE_BUNDLE_ENDPOINT=https://example.com/api/bundles
# --- Log corpus export to Avocet (optional) ---
# Push ERROR/CRITICAL entries and labeled incidents to the Avocet corpus endpoint
# for logreading fine-tune training. Requires a consent token issued by CF.
# Contact alan@circuitforge.tech to register your node and receive a token.
# Watermarks are stored at data/corpus_watermark.txt and data/incident_watermark.txt.
# AVOCET_CORPUS_ENDPOINT=https://avocet.circuitforge.tech/api/corpus/log-batch
# AVOCET_CONSENT_TOKEN=your-uuid-token-here
# TURNSTONE_SOURCE_HOST=my-server-name # defaults to system hostname if unset
# --- Periodic batch glean ---
# Seconds between automatic glean runs from sources.yaml. Set to 0 to disable.
# TURNSTONE_GLEAN_INTERVAL=900
# --- Multi-agent diagnose pipeline (experimental) ---
# Enable the 5-stage ML pipeline instead of the single-LLM summarize() call.
# TURNSTONE_MULTI_AGENT_DIAGNOSE=true
# Stage 2 — ML severity classifier (optional; falls back to pattern_tags then regex).
# Recommended: byviz/bylastic_classification_logs (~300MB, downloaded from HuggingFace)
# TURNSTONE_CLASSIFIER_MODEL=byviz/bylastic_classification_logs
# Stage 4 — Embedding backend for false-positive suppression.
# sentence_transformers: in-process local model (downloads on first use)
# ollama: uses a running Ollama instance (no download needed if model is already pulled)
# TURNSTONE_EMBED_BACKEND=sentence_transformers
# TURNSTONE_EMBED_MODEL=BAAI/bge-small-en-v1.5
# TURNSTONE_EMBED_DEVICE=cpu
# --- Cybersec scoring pipeline (zero-shot, second-pass on flagged entries) ---
# Runs a zero-shot classifier on entries already flagged by the anomaly scorer
# or that have pattern matches — a focused second opinion using cybersec vocabulary.
# The DeBERTa-v3-base-mnli model (required by the diagnose pipeline) is the recommended
# zero-shot classifier — it produces human-readable cybersec labels with no fine-tuning.
# TURNSTONE_CYBERSEC_MODEL=MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli
# TURNSTONE_CYBERSEC_DEVICE=cpu
# TURNSTONE_CYBERSEC_THRESHOLD=0.60 # lower than anomaly threshold (zero-shot is calibrated differently)
# --- Anomaly scoring pipeline (IDS / watchdog) ---
# Batch-scores every ingested log entry after each glean cycle.
# Any HuggingFace text-classification model works; the byviz classifier (already
# required by the diagnose pipeline) is the recommended starting point.
# Detections above the threshold are inserted into the detections table and
# surfaced in the Security Alerts tab.
#
# Set TURNSTONE_ANOMALY_MODEL to enable; leave unset to disable (safe default).
# TURNSTONE_ANOMALY_MODEL=byviz/bylastic_classification_logs
# TURNSTONE_ANOMALY_DEVICE=cpu # or "cuda" / "mps" for GPU inference
# TURNSTONE_ANOMALY_THRESHOLD=0.80 # confidence floor for detection insertion
# TURNSTONE_ANOMALY_INTERVAL=0 # standalone loop (0 = glean-triggered only)
#
# HuggingFace model cache — share with the host to avoid re-downloading models.
# HF_HOME=/hf_cache # inside container (set in docker-compose)
# HF_CACHE_PATH=/Library/Assets/LLM # host bind-mount source (docker-compose only)
# --- Air-gapped / offline deployment ---
# Set to 1 to block all HuggingFace hub network access at runtime.
# Pre-download models to ~/.cache/huggingface/ before deploying — see docs/air-gapped-deployment.md.
# TURNSTONE_OFFLINE_MODE=1
# --- API authentication ---
# When set, all /api/ requests require: Authorization: Bearer <token>
# Generate a token: python -c "import secrets; print(secrets.token_urlsafe(32))"
# TURNSTONE_API_KEY=your-secret-token-here
# --- The Orchard (harvest receiver only) ---
# Set on the central harvest.circuitforge.tech instance to enable branch management.
# TURNSTONE_ORCHARD_ADMIN_KEY=your-admin-secret-here
# TURNSTONE_ORCHARD_DATA_ROOT=/devl/docker/turnstone-submissions
# TURNSTONE_ORCHARD_CADDYFILE=/devl/caddy-proxy/Caddyfile
# TURNSTONE_ORCHARD_CADDY_CONTAINER=caddy-proxy
# TURNSTONE_ORCHARD_HARVEST_HOST=https://harvest.circuitforge.tech
# TURNSTONE_ORCHARD_PORT_BASE=8538
# TURNSTONE_ORCHARD_IMAGE=localhost/turnstone:latest
# --- Orchard branch (submitting node) ---
# Set TURNSTONE_SUBMIT_ENDPOINT to push pattern-matched log entries to the harvest receiver.
# Generate your branch slug and API key via: POST /api/orchard/graft on the harvest instance.
# TURNSTONE_SUBMIT_ENDPOINT=https://harvest.circuitforge.tech/your-slug
# TURNSTONE_BRANCH_KEY=api-key-from-graft-response

1
.gitignore vendored
View file

@ -1,5 +1,6 @@
data/ data/
corpus/raw/ corpus/raw/
patterns/watch.yaml
log/ log/
__pycache__/ __pycache__/
*.pyc *.pyc

308
.nfs0000000000bbcf52000002e7 Executable file
View file

@ -0,0 +1,308 @@
#!/usr/bin/env bash
# manage.sh — Turnstone diagnostic intelligence layer
# Usage: ./manage.sh <command> [args]
set -euo pipefail
# Only emit color codes when stdout is a real terminal
if [[ -t 1 ]]; then
RED='\033[0;31m'; GREEN='\033[0;32m'; YELLOW='\033[1;33m'; BLUE='\033[0;34m'; NC='\033[0m'
else
RED=''; GREEN=''; YELLOW=''; BLUE=''; NC=''
fi
info() { echo -e "${BLUE}[turnstone]${NC} $*"; }
success() { echo -e "${GREEN}[turnstone]${NC} $*"; }
warn() { echo -e "${YELLOW}[turnstone]${NC} $*"; }
error() { echo -e "${RED}[turnstone]${NC} $*" >&2; exit 1; }
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
cd "$SCRIPT_DIR"
API_PORT=8534 # FastAPI: serves REST API + built Vue SPA
VITE_PORT=5174 # Vite HMR port in dev mode (proxies /api → 8534)
LOG_DIR="log"
API_PID_FILE=".turnstone-api.pid"
DB="${TURNSTONE_DB:-${SCRIPT_DIR}/data/turnstone.db}"
CONDA_BASE="${CONDA_BASE:-/devl/miniconda3}"
PYTHON="${CONDA_BASE}/envs/cf/bin/python"
# ── Helpers ───────────────────────────────────────────────────────────────────
_is_alive() {
local pid_file="$1"
[[ -f "$pid_file" ]] && kill -0 "$(<"$pid_file")" 2>/dev/null
}
_kill_pid_file() {
local pid_file="$1" label="$2"
if [[ -f "$pid_file" ]]; then
local pid
pid=$(<"$pid_file")
if kill -0 "$pid" 2>/dev/null; then
kill "$pid" && rm -f "$pid_file"
success "$label stopped (PID $pid)."
else
warn "Stale PID file for $label (PID $pid not running). Cleaning up."
rm -f "$pid_file"
fi
else
warn "$label not running."
fi
}
_wait_for_port() {
local port="$1" label="$2" pid_file="$3"
for _i in $(seq 1 20); do
sleep 0.5
(echo "" >/dev/tcp/127.0.0.1/"$port") 2>/dev/null && return 0
if ! _is_alive "$pid_file"; then
rm -f "$pid_file"
error "$label died during startup. Check ${LOG_DIR}/api.log"
fi
done
error "$label did not bind to port $port within 10 s."
}
# ── Usage ─────────────────────────────────────────────────────────────────────
usage() {
echo ""
echo -e " ${BLUE}Turnstone — Diagnostic Log Intelligence${NC}"
echo ""
echo " Usage: ./manage.sh <command> [args]"
echo ""
echo " Production-like (built SPA + uvicorn):"
echo -e " ${GREEN}start${NC} Build Vue SPA, start FastAPI + SPA on :${API_PORT}"
echo -e " ${GREEN}stop${NC} Stop the server"
echo -e " ${GREEN}restart${NC} Stop then start"
echo -e " ${GREEN}status${NC} Show running process"
echo -e " ${GREEN}logs${NC} Tail server log"
echo -e " ${GREEN}open${NC} Open UI in browser"
echo ""
echo " Development (hot-reload):"
echo -e " ${GREEN}dev${NC} uvicorn --reload (:${API_PORT}) + Vite HMR (:${VITE_PORT})"
echo ""
echo " Data:"
echo -e " ${GREEN}ingest PATH [DB]${NC} Ingest a log file or corpus directory"
echo -e " ${GREEN}ingest-plex [HOST]${NC} Pull Plex log from Cass (or HOST) and ingest"
echo -e " ${GREEN}ingest-qbit [HOST]${NC} Pull qBittorrent log locally or from HOST via SSH"
echo -e " ${GREEN}build-fts${NC} Rebuild the FTS search index"
echo ""
echo " Tests:"
echo -e " ${GREEN}test [args]${NC} Run pytest suite"
echo ""
echo " DB: ${DB}"
echo " Conda env: cf"
echo ""
echo " Examples:"
echo " ./manage.sh start"
echo " ./manage.sh dev"
echo " ./manage.sh ingest corpus/raw/"
echo " ./manage.sh ingest corpus/raw/ data/custom.db"
echo ""
}
# ── Commands ──────────────────────────────────────────────────────────────────
CMD="${1:-help}"
shift || true
case "$CMD" in
start)
if _is_alive "$API_PID_FILE"; then
warn "Already running (PID $(<"$API_PID_FILE")) — use 'restart' to rebuild."
exit 0
fi
mkdir -p "$LOG_DIR" data
info "Building Vue SPA…"
(cd web && npm run build) 2>&1 | tee "${LOG_DIR}/build.log" | grep -E "built in|error" || true
success "SPA built → web/dist/"
info "Starting on port ${API_PORT}…"
TURNSTONE_DB="$DB" nohup "$PYTHON" -m uvicorn app.rest:app \
--host 0.0.0.0 --port "$API_PORT" \
>> "${LOG_DIR}/api.log" 2>&1 &
echo $! > "$API_PID_FILE"
_wait_for_port "$API_PORT" "Turnstone" "$API_PID_FILE"
success "Running → http://localhost:${API_PORT} (PID $(<"$API_PID_FILE"))"
;;
stop)
_kill_pid_file "$API_PID_FILE" "Turnstone"
;;
restart)
bash "$0" stop
exec bash "$0" start
;;
status)
echo ""
if _is_alive "$API_PID_FILE"; then
success "Turnstone RUNNING PID $(<"$API_PID_FILE") → http://localhost:${API_PORT}"
else
echo -e " Turnstone ${RED}STOPPED${NC}"
fi
echo ""
;;
logs)
tail -f "${LOG_DIR}/api.log"
;;
open)
URL="http://localhost:${API_PORT}"
info "Opening ${URL}"
if command -v xdg-open &>/dev/null; then xdg-open "$URL"
elif command -v open &>/dev/null; then open "$URL"
else echo "$URL"
fi
;;
dev)
DEV_API_PID=".turnstone-dev-api.pid"
mkdir -p "$LOG_DIR" data
if _is_alive "$DEV_API_PID"; then
warn "Dev API already running (PID $(<"$DEV_API_PID"))"
else
info "Starting uvicorn --reload on port ${API_PORT}…"
TURNSTONE_DB="$DB" nohup "$PYTHON" -m uvicorn app.rest:app \
--host 0.0.0.0 --port "$API_PORT" --reload \
>> "${LOG_DIR}/api.log" 2>&1 &
echo $! > "$DEV_API_PID"
_wait_for_port "$API_PORT" "FastAPI (dev)" "$DEV_API_PID"
success "API (hot-reload) → http://localhost:${API_PORT}"
fi
_cleanup_dev() {
local pid
pid=$(<"$DEV_API_PID" 2>/dev/null) || true
[[ -n "${pid:-}" ]] && kill "$pid" 2>/dev/null && rm -f "$DEV_API_PID"
info "Dev servers stopped."
}
trap _cleanup_dev EXIT INT TERM
info "Starting Vite HMR on port ${VITE_PORT}…"
success "Frontend (HMR) → http://localhost:${VITE_PORT}"
(cd web && npm run dev -- --port "$VITE_PORT")
;;
ingest)
if [[ $# -lt 1 ]]; then
error "Usage: ./manage.sh ingest <file_or_dir> [DB_PATH]"
fi
info "Ingesting $1 → ${2:-$DB}…"
"$PYTHON" scripts/ingest_corpus.py "$1" "${2:-$DB}"
;;
ingest-plex)
PLEX_HOST="${1:-cass}"
PLEX_LOG_DIR="/var/lib/plexmediaserver/Library/Application Support/Plex Media Server/Logs"
TMP_DIR="/tmp/turnstone-plex-$$"
mkdir -p "$TMP_DIR"
info "Listing Plex logs on ${PLEX_HOST}…"
# Get list of all rotated + active Plex logs
mapfile -t REMOTE_LOGS < <(ssh "$PLEX_HOST" \
"ls '${PLEX_LOG_DIR}'/Plex\ Media\ Server*.log 2>/dev/null") \
|| { rm -rf "$TMP_DIR"; error "SSH to ${PLEX_HOST} failed."; }
if [[ ${#REMOTE_LOGS[@]} -eq 0 ]]; then
rm -rf "$TMP_DIR"
error "No Plex logs found on ${PLEX_HOST} at ${PLEX_LOG_DIR}"
fi
for remote_path in "${REMOTE_LOGS[@]}"; do
# Plex Media Server.1.log → cass-plex_media_server.1.log
local_name="${PLEX_HOST}-$(basename "$remote_path" | tr ' ' '_' | tr '[:upper:]' '[:lower:]')"
local_path="${TMP_DIR}/${local_name}"
info " ← $(basename "$remote_path")"
ssh "$PLEX_HOST" "cat '${remote_path}'" > "$local_path"
done
info "Ingesting ${#REMOTE_LOGS[@]} log file(s) into ${DB}…"
for f in "$TMP_DIR"/*.log; do
"$PYTHON" scripts/ingest_corpus.py "$f" "$DB"
done
rm -rf "$TMP_DIR"
info "Done. Restarting server…"
exec bash "$0" restart
;;
ingest-qbit)
QBIT_HOST="${1:-}"
# Default log locations in priority order
QBIT_LOG_PATHS=(
"$HOME/.local/share/qBittorrent/logs/qbittorrent.log"
"$HOME/.config/qBittorrent/logs/qbittorrent.log"
"/var/log/qbittorrent/qbittorrent.log"
)
TMP_DIR="/tmp/turnstone-qbit-$$"
mkdir -p "$TMP_DIR"
if [[ -n "$QBIT_HOST" ]]; then
info "Fetching qBittorrent log from ${QBIT_HOST}…"
REMOTE_LOG=""
for p in "${QBIT_LOG_PATHS[@]}"; do
if ssh "$QBIT_HOST" "test -f '$p'" 2>/dev/null; then
REMOTE_LOG="$p"
break
fi
done
if [[ -z "$REMOTE_LOG" ]]; then
rm -rf "$TMP_DIR"
error "No qBittorrent log found on ${QBIT_HOST}. Tried: ${QBIT_LOG_PATHS[*]}"
fi
local_name="${QBIT_HOST}-qbittorrent.log"
ssh "$QBIT_HOST" "cat '$REMOTE_LOG'" > "${TMP_DIR}/${local_name}"
info " ← ${REMOTE_LOG} (${QBIT_HOST})"
else
LOCAL_LOG=""
for p in "${QBIT_LOG_PATHS[@]}"; do
if [[ -f "$p" ]]; then
LOCAL_LOG="$p"
break
fi
done
if [[ -z "$LOCAL_LOG" ]]; then
rm -rf "$TMP_DIR"
error "No qBittorrent log found locally. Tried: ${QBIT_LOG_PATHS[*]}"
fi
cp "$LOCAL_LOG" "${TMP_DIR}/qbittorrent.log"
info " ← ${LOCAL_LOG}"
fi
info "Ingesting into ${DB}…"
"$PYTHON" scripts/ingest_corpus.py "${TMP_DIR}"/*.log "$DB"
rm -rf "$TMP_DIR"
info "Done. Restarting server…"
exec bash "$0" restart
;;
build-fts)
info "Rebuilding FTS index for ${DB}…"
TURNSTONE_DB="$DB" "$PYTHON" scripts/build_fts_index.py "$DB"
success "FTS index rebuilt."
;;
test)
info "Running test suite…"
PYTEST="${CONDA_BASE}/envs/cf/bin/pytest"
[[ -x "$PYTEST" ]] || error "pytest not found in cf env at ${PYTEST}"
TURNSTONE_DB=":memory:" "$PYTEST" tests/ -v "$@"
;;
help|--help|-h)
usage
;;
*)
error "Unknown command: ${CMD}. Run './manage.sh help' for usage."
;;
esac

View file

@ -17,6 +17,21 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
COPY requirements.txt . COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt RUN pip install --no-cache-dir -r requirements.txt
# sqlite-vec: optional vector search extension for context embedding (Paid tier)
RUN set -e; \
SVEC_VER=0.1.6; \
ARCH=$(uname -m); \
case "$ARCH" in \
x86_64) SVEC_ARCH="x86_64-linux-gnu" ;; \
aarch64) SVEC_ARCH="aarch64-linux-gnu" ;; \
*) echo "sqlite-vec: unsupported arch $ARCH — skipping" && exit 0 ;; \
esac; \
curl -fsSL -o /tmp/sqlite_vec.tar.gz \
"https://github.com/asg017/sqlite-vec/releases/download/v${SVEC_VER}/sqlite-vec-${SVEC_VER}-loadable-linux-${SVEC_ARCH}.tar.gz" \
&& tar -xz -C /usr/lib/python3/ -f /tmp/sqlite_vec.tar.gz --wildcards '*.so' \
&& rm /tmp/sqlite_vec.tar.gz \
|| echo "sqlite-vec optional extension unavailable — vector search disabled"
COPY app/ ./app/ COPY app/ ./app/
COPY patterns/ ./patterns/ COPY patterns/ ./patterns/
COPY scripts/ ./scripts/ COPY scripts/ ./scripts/

175
README.md Normal file
View file

@ -0,0 +1,175 @@
# Turnstone
> **Diagnostic log intelligence for self-hosted infrastructure.**
[![Status](https://img.shields.io/badge/status-beta-blue)](https://git.opensourcesolarpunk.com/Circuit-Forge/turnstone)
[![Version](https://img.shields.io/badge/version-0.5.0-green)](https://git.opensourcesolarpunk.com/Circuit-Forge/turnstone/releases)
[![License](https://img.shields.io/badge/license-private-red)](LICENSE)
[![Python](https://img.shields.io/badge/python-3.11%2B-blue)](requirements.txt)
Turnstone ingests logs from your services, indexes them for full-text and pattern search, and lets you tag incidents, build diagnostic bundles, and query across your infrastructure — from a web UI or an MCP-compatible agent client.
---
## What it does
```
Service logs (journald, Docker, syslog, Caddy, Plex, arr stack, qBittorrent, dmesg)
→ Ingest pipeline (auto-detect format, parse, deduplicate, pattern-tag)
→ SQLite + FTS index
→ REST API → Vue web UI / MCP server → agent clients (Orchard)
```
**Human workflow:** Search logs by symptom or time window, create incidents, attach relevant log entries, bundle everything into a diagnostic package for hand-off or archival.
**Agent workflow:** MCP tools expose search, incident management, and diagnose over a standard protocol — Orchard agents can query Turnstone as part of automated triage and resolution pipelines.
---
## Features
- **Multi-source glean** — journald, Docker, syslog, Caddy, dmesg, Plex, Servarr (arr stack), qBittorrent, plaintext; paths configured in `patterns/sources.yaml`
- **Pattern tagging** — named regex patterns applied at glean time (`service_restart`, `auth_failure`, `oom`, `segfault`, `disk_full`, `timeout`, …); extend in `patterns/default.yaml`
- **Full-text search** — SQLite FTS5 index across all ingested entries; filter by source, severity, time window
- **Natural-language time queries** — "what happened yesterday morning", "show me errors from the last 3 hours"; powered by dateparser
- **Incident management** — create, label, and track incidents; attach supporting log entries
- **Diagnostic bundles** — group log entries + incident metadata into a shareable bundle for escalation or archival
- **MCP server** — exposes search, incident, and diagnose tools to MCP-compatible agent clients
- **Dark/light theme** — Vue 3 + UnoCSS, system-aware
---
## Quick start (Docker)
```bash
git clone https://git.opensourcesolarpunk.com/Circuit-Forge/turnstone.git
cd turnstone
# Edit sources to match your paths
cp patterns/sources.yaml.example patterns/sources.yaml
$EDITOR patterns/sources.yaml
docker build -t turnstone:latest .
docker run -d --name turnstone \
-p 8534:8534 \
-v $(pwd)/data:/data \
-v $(pwd)/patterns:/patterns \
turnstone:latest
```
Open `http://localhost:8534/turnstone/`
---
## Quick start (dev)
```bash
# Backend
conda run -n cf pip install -r requirements.txt
conda run -n cf bash manage.sh start
# Frontend (separate terminal, hot-reload)
cd web && npm install && npm run dev
```
API: `http://localhost:8534/turnstone/docs`
UI: `http://localhost:5174/`
---
## Deployment (Podman + systemd)
See [`podman-standalone.sh`](podman-standalone.sh) for rootful Podman setup with systemd unit generation. Suitable for hosts that run system Podman rather than Docker Compose.
For Caddy reverse-proxy setup (e.g. `menagerie.circuitforge.tech/turnstone`), see [`docs/caddy-routing-pattern.md`](docs/caddy-routing-pattern.md) — all routes are pre-mounted at `/turnstone` so no prefix stripping is needed.
---
## Log source configuration
Edit `patterns/sources.yaml` to tell Turnstone where your logs live (container-side paths):
```yaml
sources:
- id: system-journal
path: /data/journal-export.jsonl # exported by export_journal.sh on host
- id: docker-logs
path: /var/log/docker # bind-mounted from host
- id: caddy
path: /var/log/caddy/access.log
```
For `journald` sources, run `scripts/export_journal.sh` on the host before each glean (e.g. via cron). Missing paths are skipped with a warning — safe to leave entries for services that are temporarily down.
---
## Pattern library
Named patterns in `patterns/default.yaml` are matched against every log entry at glean time. Matched pattern names are stored and used to boost search relevance for diagnostic queries.
```yaml
patterns:
- name: oom
pattern: "(out of memory|OOM|killed process|cannot allocate)"
severity: CRITICAL
description: Out-of-memory condition
```
Add domain-specific patterns for your stack. Multiple patterns can match a single entry.
---
## MCP server
Turnstone exposes an MCP (Model Context Protocol) server for agent clients. Start it alongside the REST API:
```bash
conda run -n cf python -m app.mcp_server
```
Tools exposed: `search`, `diagnose`, `create_incident`, `list_incidents`, `build_bundle`.
---
## Manage script
```bash
bash manage.sh start # start API (and Vite dev server if --dev)
bash manage.sh stop # stop API
bash manage.sh restart # restart
bash manage.sh status # show process state and port bindings
bash manage.sh logs # tail API log
```
---
## Configuration
Copy `.env.example` to `.env` (or pass as `-e` flags to Docker/Podman). All variables are optional.
| Variable | Default | Description |
|----------|---------|-------------|
| `GPU_SERVER_URL` | `http://localhost:11434` | GPU inference server (Ollama, vLLM, or cf-orch). `CF_ORCH_URL` is accepted as a backward-compat alias. Paid+ users: leave unset — auto-defaults to `https://orch.circuitforge.tech` when `CF_LICENSE_KEY` is present. |
| `CF_LICENSE_KEY` | — | CircuitForge Paid+ license key. Enables cloud GPU inference and premium features. |
| `TURNSTONE_DB` | `/data/turnstone.db` | Path to the SQLite database. |
| `TURNSTONE_PATTERNS` | `./patterns` | Pattern directory (default.yaml, sources.yaml, watch.yaml). |
| `TURNSTONE_SOURCE_HOST` | `unknown` | Host identifier stamped on ingested entries. |
| `TURNSTONE_BUNDLE_ENDPOINT` | — | Remote URL to push diagnostic bundles for escalation. |
| `TURNSTONE_GLEAN_INTERVAL` | `900` | Seconds between automatic batch glean runs. Set to `0` to disable. |
---
## Ports
| Service | Port | Notes |
|---------|------|-------|
| FastAPI + Vue SPA | `8534` | Production: REST API + built frontend |
| Vite HMR | `5174` | Dev only: hot-reload frontend, proxies `/api` → 8534 |
---
## License
Private — CircuitForge internal tooling. Not licensed for redistribution.

97
app/context/chunker.py Normal file
View file

@ -0,0 +1,97 @@
"""Document type detection, fact extraction, and text chunking — MIT licensed."""
from __future__ import annotations
import json
from dataclasses import dataclass
from pathlib import Path
ACCEPTED_SUFFIXES = {".md", ".txt", ".yaml", ".yml", ".json", ".conf", ".config", ".toml"}
MAX_FILE_BYTES = 5 * 1024 * 1024 # 5 MB
CHUNK_WORDS = 300
CHUNK_OVERLAP = 50
class UnsupportedDocType(Exception):
pass
class FileTooLarge(Exception):
pass
@dataclass(frozen=True)
class ExtractedFact:
category: str
key: str
value: str
def detect_type(filename: str, content: bytes) -> str: # noqa: ARG001
suffix = Path(filename).suffix.lower()
if suffix not in ACCEPTED_SUFFIXES:
raise UnsupportedDocType(
f"File type {suffix!r} is not supported. "
f"Accepted: {', '.join(sorted(ACCEPTED_SUFFIXES))}"
)
if suffix in {".yaml", ".yml"}:
return "yaml"
if suffix == ".json":
return "json"
if suffix == ".md":
return "markdown"
return "text"
def extract_facts_from_yaml(text: str) -> list[ExtractedFact]:
"""Extract service names and ports from docker-compose-style YAML."""
try:
import yaml
data = yaml.safe_load(text)
except Exception:
return []
if not isinstance(data, dict):
return []
services = data.get("services")
if not isinstance(services, dict):
return []
facts = []
for name, definition in services.items():
if not isinstance(definition, dict):
continue
parts: list[str] = []
image = definition.get("image")
if image:
parts.append(f"image:{image}")
for port in definition.get("ports", []):
parts.append(f"port:{port}")
facts.append(ExtractedFact(
category="service",
key=str(name),
value=" ".join(parts) if parts else "configured",
))
return facts
def chunk_text(text: str, chunk_size: int = CHUNK_WORDS, overlap: int = CHUNK_OVERLAP) -> list[str]:
words = text.split()
if not words:
return []
chunks: list[str] = []
i = 0
while i < len(words):
chunks.append(" ".join(words[i: i + chunk_size]))
i += chunk_size - overlap
return chunks
def process_upload(filename: str, content: bytes) -> tuple[str, list[ExtractedFact], list[str]]:
"""Return (doc_type, extracted_facts, text_chunks). Raises on bad type or size."""
if len(content) > MAX_FILE_BYTES:
raise FileTooLarge(f"File exceeds {MAX_FILE_BYTES // (1024 * 1024)} MB limit.")
text = content.decode("utf-8", errors="replace")
doc_type = detect_type(filename, content)
facts: list[ExtractedFact] = []
if doc_type == "yaml":
facts = extract_facts_from_yaml(text)
chunks = chunk_text(text)
return doc_type, facts, chunks

81
app/context/embedder.py Normal file
View file

@ -0,0 +1,81 @@
"""Context chunk embedding — BSL licensed.
Thin wrapper around app.services.embeddings that handles the DB I/O for
context_chunks. All backend configuration (model, device, backend type) is
delegated to the service layer via TURNSTONE_EMBED_* env vars.
Re-exports EMBEDDING_AVAILABLE so callers that imported it from here continue
to work without changes.
"""
from __future__ import annotations
import logging
import sqlite3
from pathlib import Path
from app.services.embeddings import (
EMBEDDING_AVAILABLE, # re-export for backward compat
get_embedder,
pack_vector,
)
__all__ = ["EMBEDDING_AVAILABLE", "embed_chunks"]
logger = logging.getLogger(__name__)
def embed_chunks(
db_path: Path,
document_id: str,
# Legacy params kept for backward compat — ignored when the ST backend is active.
llm_url: str = "",
model: str = "",
timeout: float = 60.0,
) -> int:
"""Embed all un-embedded chunks for *document_id*.
Uses the configured embedder (sentence-transformers by default; Ollama when
TURNSTONE_EMBED_BACKEND=ollama). Returns the count of newly embedded chunks.
Returns 0 silently when no embedder is available.
The legacy ``llm_url`` and ``model`` parameters are accepted but ignored when
the sentence-transformers backend is active configure via env vars instead.
"""
embedder = get_embedder()
if embedder is None:
return 0
conn = sqlite3.connect(str(db_path), timeout=30.0)
conn.execute("PRAGMA journal_mode=WAL")
conn.row_factory = sqlite3.Row
rows = conn.execute(
"SELECT id, text FROM context_chunks WHERE document_id = ? AND embedding IS NULL",
(document_id,),
).fetchall()
if not rows:
conn.close()
return 0
texts = [r["text"] for r in rows]
ids = [r["id"] for r in rows]
count = 0
try:
vectors = embedder.embed_batch(texts)
for chunk_id, vec in zip(ids, vectors):
blob = pack_vector(vec)
conn.execute(
"UPDATE context_chunks SET embedding = ? WHERE id = ?",
(blob, chunk_id),
)
count += 1
conn.commit()
except Exception as exc:
logger.warning("Batch embedding failed for document %s: %s", document_id, exc)
finally:
conn.close()
logger.debug("Embedded %d chunk(s) for document %s", count, document_id)
return count

183
app/context/retriever.py Normal file
View file

@ -0,0 +1,183 @@
"""Context retrieval — structured keyword lookup (Free) + chunk search — MIT licensed.
Two retrieval modes for context_chunks:
Vector search cosine similarity over stored embeddings (when available)
Keyword search LIKE-based fallback when no embedder is configured
Both modes are called from retrieve_context(); the best available mode is used
automatically so callers need not check EMBEDDING_AVAILABLE themselves.
"""
from __future__ import annotations
import logging
import sqlite3
from dataclasses import dataclass, field
from pathlib import Path
import numpy as np
from app.services.embeddings import (
EMBEDDING_AVAILABLE,
cosine_similarity,
get_embedder,
unpack_vector,
)
logger = logging.getLogger(__name__)
@dataclass
class RetrievedContext:
facts: list[dict[str, str]] = field(default_factory=list)
chunks: list[dict[str, str]] = field(default_factory=list)
# ── Structured fact retrieval (always runs) ───────────────────────────────────
def get_relevant_facts(db_path: Path, query: str) -> list[dict[str, str]]:
"""Keyword match against context_facts. Always runs — Free tier."""
try:
conn = sqlite3.connect(str(db_path), timeout=30.0)
conn.execute("PRAGMA journal_mode=WAL")
conn.row_factory = sqlite3.Row
keywords = [w.lower() for w in query.split() if len(w) > 2]
if not keywords:
rows = conn.execute(
"SELECT category, key, value, source FROM context_facts"
" ORDER BY category LIMIT 20"
).fetchall()
else:
conditions = " OR ".join(
"(LOWER(key) LIKE ? OR LOWER(value) LIKE ?)" for _ in keywords
)
params: list[str] = []
for kw in keywords:
params.extend([f"%{kw}%", f"%{kw}%"])
rows = conn.execute(
f"SELECT category, key, value, source FROM context_facts"
f" WHERE {conditions} ORDER BY category LIMIT 10",
params,
).fetchall()
conn.close()
return [dict(r) for r in rows]
except sqlite3.OperationalError:
return []
# ── Chunk retrieval: vector path ──────────────────────────────────────────────
def _search_chunks_vector(
db_path: Path,
query: str,
top_k: int = 3,
) -> list[dict[str, str]]:
"""Cosine similarity search over embedded context_chunks.
Loads all stored embeddings into memory and scores in-process with numpy.
Skips any chunk whose BLOB dimension does not match the current model dim
(stale embeddings from a previous model they will be re-embedded on the
next document upload).
Returns at most *top_k* results ordered by similarity descending.
"""
embedder = get_embedder()
if embedder is None:
return []
try:
query_vec: np.ndarray = embedder.embed(query)
model_dim: int = embedder.dim
except Exception as exc:
logger.warning("Query embedding failed: %s", exc)
return []
try:
conn = sqlite3.connect(str(db_path), timeout=30.0)
conn.execute("PRAGMA journal_mode=WAL")
conn.row_factory = sqlite3.Row
rows = conn.execute(
"SELECT cc.id, cc.text, cc.embedding, cd.filename"
" FROM context_chunks cc"
" JOIN context_documents cd ON cc.document_id = cd.id"
" WHERE cc.embedding IS NOT NULL"
).fetchall()
conn.close()
except sqlite3.OperationalError:
return []
scored: list[tuple[float, dict[str, str]]] = []
for row in rows:
blob: bytes = row["embedding"]
# Guard against blobs from a different-dimension model
if len(blob) // 4 != model_dim:
continue
try:
chunk_vec = unpack_vector(blob)
score = cosine_similarity(query_vec, chunk_vec)
scored.append((score, {"text": row["text"], "filename": row["filename"]}))
except Exception:
continue
scored.sort(key=lambda t: t[0], reverse=True)
return [item for _, item in scored[:top_k]]
# ── Chunk retrieval: keyword fallback ─────────────────────────────────────────
def _search_chunks_keyword(db_path: Path, query: str) -> list[dict[str, str]]:
"""LIKE-based keyword search across context_chunks. Fallback when no embedder."""
try:
conn = sqlite3.connect(str(db_path), timeout=30.0)
conn.execute("PRAGMA journal_mode=WAL")
conn.row_factory = sqlite3.Row
keywords = [w.lower() for w in query.split() if len(w) > 2][:5]
if not keywords:
conn.close()
return []
conditions = " OR ".join("LOWER(cc.text) LIKE ?" for _ in keywords)
params = [f"%{kw}%" for kw in keywords]
rows = conn.execute(
f"SELECT cc.text, cd.filename FROM context_chunks cc"
f" JOIN context_documents cd ON cc.document_id = cd.id"
f" WHERE {conditions} LIMIT 3",
params,
).fetchall()
conn.close()
return [{"text": r["text"], "filename": r["filename"]} for r in rows]
except sqlite3.OperationalError:
return []
# ── Public interface ──────────────────────────────────────────────────────────
def retrieve_context(db_path: Path, query: str) -> RetrievedContext:
"""Retrieve structured facts and relevant chunks for a query.
Chunk retrieval uses vector search when an embedder is available and at
least one embedded chunk exists; falls back to keyword search otherwise.
"""
facts = get_relevant_facts(db_path, query)
if EMBEDDING_AVAILABLE:
chunks = _search_chunks_vector(db_path, query)
if not chunks:
# Vector search returned nothing (no embedded chunks yet) — fall back.
chunks = _search_chunks_keyword(db_path, query)
else:
chunks = _search_chunks_keyword(db_path, query)
return RetrievedContext(facts=facts, chunks=chunks)
def format_context_block(ctx: RetrievedContext) -> str | None:
"""Format context for injection into an LLM prompt. Returns None when empty."""
lines: list[str] = []
if ctx.facts:
lines.append("Known environment facts:")
for f in ctx.facts:
lines.append(f" [{f['category']}] {f['key']}: {f['value']}")
if ctx.chunks:
lines.append("Relevant documentation:")
for c in ctx.chunks:
lines.append(f" [{c['filename']}] {c['text'][:200]}")
return "\n".join(lines) if lines else None

135
app/context/store.py Normal file
View file

@ -0,0 +1,135 @@
"""Context fact and document CRUD — MIT licensed."""
from __future__ import annotations
import uuid
from dataclasses import dataclass
from datetime import datetime, timezone
from pathlib import Path
from app.db import get_conn, resolve_tenant_id
@dataclass(frozen=True)
class ContextFact:
id: str
category: str
key: str
value: str
source: str | None
created_at: str
@dataclass(frozen=True)
class ContextDocument:
id: str
filename: str
doc_type: str
full_text: str
file_size: int | None
uploaded_at: str
def add_fact(db_path: Path, category: str, key: str, value: str, source: str | None = None) -> ContextFact:
tid = resolve_tenant_id()
fact = ContextFact(
id=str(uuid.uuid4()),
category=category,
key=key,
value=value,
source=source,
created_at=datetime.now(timezone.utc).isoformat(),
)
with get_conn(db_path) as conn:
conn.execute(
"INSERT INTO context_facts(id, tenant_id, category, key, value, source, created_at) VALUES (?,?,?,?,?,?,?)",
(fact.id, tid, fact.category, fact.key, fact.value, fact.source, fact.created_at),
)
conn.commit()
return fact
def list_facts(db_path: Path, category: str | None = None) -> list[ContextFact]:
tid = resolve_tenant_id()
with get_conn(db_path) as conn:
if category:
rows = conn.execute(
"SELECT * FROM context_facts WHERE category=? AND (tenant_id=? OR tenant_id='') ORDER BY created_at",
(category, tid),
).fetchall()
else:
rows = conn.execute(
"SELECT * FROM context_facts WHERE (tenant_id=? OR tenant_id='') ORDER BY category, created_at",
(tid,),
).fetchall()
return [
ContextFact(
id=r["id"], category=r["category"], key=r["key"],
value=r["value"], source=r["source"], created_at=r["created_at"],
)
for r in rows
]
def delete_fact(db_path: Path, fact_id: str) -> bool:
tid = resolve_tenant_id()
with get_conn(db_path) as conn:
cursor = conn.execute(
"DELETE FROM context_facts WHERE id=? AND (tenant_id=? OR tenant_id='')",
(fact_id, tid),
)
conn.commit()
return cursor.rowcount > 0
def add_document(
db_path: Path,
filename: str,
doc_type: str,
full_text: str,
file_size: int | None = None,
) -> ContextDocument:
tid = resolve_tenant_id()
doc = ContextDocument(
id=str(uuid.uuid4()),
filename=filename,
doc_type=doc_type,
full_text=full_text,
file_size=file_size,
uploaded_at=datetime.now(timezone.utc).isoformat(),
)
with get_conn(db_path) as conn:
conn.execute(
"INSERT INTO context_documents(id, tenant_id, filename, doc_type, full_text, file_size, uploaded_at)"
" VALUES (?,?,?,?,?,?,?)",
(doc.id, tid, doc.filename, doc.doc_type, doc.full_text, doc.file_size, doc.uploaded_at),
)
conn.commit()
return doc
def list_documents(db_path: Path) -> list[ContextDocument]:
tid = resolve_tenant_id()
with get_conn(db_path) as conn:
rows = conn.execute(
"SELECT id, filename, doc_type, full_text, file_size, uploaded_at"
" FROM context_documents WHERE (tenant_id=? OR tenant_id='') ORDER BY uploaded_at DESC",
(tid,),
).fetchall()
return [
ContextDocument(
id=r["id"], filename=r["filename"], doc_type=r["doc_type"],
full_text=r["full_text"], file_size=r["file_size"], uploaded_at=r["uploaded_at"],
)
for r in rows
]
def delete_document(db_path: Path, doc_id: str) -> bool:
tid = resolve_tenant_id()
with get_conn(db_path) as conn:
cursor = conn.execute(
"DELETE FROM context_documents WHERE id=? AND (tenant_id=? OR tenant_id='')",
(doc_id, tid),
)
conn.commit()
return cursor.rowcount > 0

128
app/context/wizard.py Normal file
View file

@ -0,0 +1,128 @@
"""Wizard state machine — MIT (structured Q&A); BSL gate reserved for LLM path."""
from __future__ import annotations
from pathlib import Path
from typing import Any
from app.context.store import add_fact
WIZARD_STEPS: list[dict[str, Any]] = [
{
"step": 1,
"id": "os",
"title": "What operating system is this server running?",
"type": "select",
"options": ["Linux (systemd/journald)", "Linux (other init)", "Other"],
"optional": False,
"help": "This determines which log sources Turnstone can watch.",
},
{
"step": 2,
"id": "hostname",
"title": "What is this server's hostname?",
"type": "text",
"placeholder": "e.g. heimdall.local",
"optional": False,
"help": "Used to label log sources in diagnosis results.",
},
{
"step": 3,
"id": "services",
"title": "Which named systemd services do you want to monitor?",
"type": "text",
"placeholder": "e.g. plex.service, sonarr.service",
"optional": True,
"help": "Comma-separated. Leave blank to collect all journal output.",
},
{
"step": 4,
"id": "docker",
"title": "Are you running Docker or Podman containers?",
"type": "select",
"options": ["Yes — Docker", "Yes — Podman", "No"],
"optional": False,
"help": "Turnstone can tail container log streams directly.",
},
{
"step": 5,
"id": "syslog",
"title": "Do any network devices send syslog UDP to this server?",
"type": "select",
"options": ["Yes — UDP 514", "Yes — custom port", "No"],
"optional": False,
"help": "Routers, switches, and APs can forward logs via UDP syslog.",
},
{
"step": 6,
"id": "syslog_port",
"title": "What UDP port does your syslog device send to?",
"type": "text",
"placeholder": "e.g. 514",
"optional": True,
"condition": {"step_id": "syslog", "value": "Yes — custom port"},
"help": "Only needed if you chose 'custom port' above.",
},
]
TOTAL_STEPS: int = len(WIZARD_STEPS)
def get_schema() -> list[dict[str, Any]]:
"""Return the wizard schema (all steps)."""
return WIZARD_STEPS
def advance_step(session: dict[str, Any], step_id: str, answer: Any) -> dict[str, Any]:
"""Return a new session dict with the answer recorded and current_step incremented."""
answers = {**session.get("answers", {}), step_id: answer}
return {**session, "answers": answers, "current_step": session.get("current_step", 1) + 1}
def is_complete(session: dict[str, Any]) -> bool:
"""Check if wizard session has progressed past all steps."""
return session.get("current_step", 1) > TOTAL_STEPS
def apply_session(db_path: Path, session: dict[str, Any]) -> dict[str, Any]:
"""Write context facts and return source config from a completed wizard session."""
answers: dict[str, Any] = session.get("answers", {})
facts_written = 0
hostname = str(answers.get("hostname") or "this-host")
if answers.get("hostname"):
add_fact(db_path, "host", "hostname", hostname, source="wizard")
facts_written += 1
if answers.get("os"):
add_fact(db_path, "host", "os", str(answers["os"]), source="wizard")
facts_written += 1
sources: list[dict[str, Any]] = []
services_raw = str(answers.get("services") or "")
services = [s.strip() for s in services_raw.split(",") if s.strip()]
if services:
for svc in services:
sources.append({"type": "journald", "id": f"journal:{hostname}:{svc}", "unit": svc})
else:
sources.append({"type": "journald", "id": f"journal:{hostname}"})
docker_answer = str(answers.get("docker") or "No")
if "Docker" in docker_answer:
sources.append({"type": "docker", "id": f"docker:{hostname}"})
elif "Podman" in docker_answer:
sources.append({"type": "docker", "id": f"podman:{hostname}", "runtime": "podman"})
syslog_answer = str(answers.get("syslog") or "No")
if syslog_answer.startswith("Yes"):
try:
port = int(answers.get("syslog_port") or 514)
except (ValueError, TypeError):
port = 514
sources.append({"type": "syslog", "id": f"syslog:{hostname}", "port": port})
return {
"facts_written": facts_written,
"sources": sources,
"source_count": len(sources),
}

36
app/db/__init__.py Normal file
View file

@ -0,0 +1,36 @@
"""Turnstone database abstraction — unified SQLite / Postgres interface.
Public API:
BACKEND Backend.SQLITE or Backend.POSTGRES
get_conn(path) context manager yielding a DbConn
resolve_tenant_id() this node's tenant ID (env or hostname)
q(sql) rewrite ? placeholders to %s for Postgres
frag SQL fragment helpers (insert_or_ignore, source_group_expr, )
ensure_schema idempotent schema init
close_pool call during shutdown when using Postgres
"""
from app.db.backend import BACKEND, Backend
from app.db.conn import DbConn, close_pool, get_conn
from app.db.dialect import frag, q
from app.db.schema import (
ensure_context_schema,
ensure_incidents_schema,
ensure_schema,
migrate_incidents_to_dedicated_db,
)
from app.db.tenant import resolve_tenant_id
__all__ = [
"BACKEND",
"Backend",
"DbConn",
"close_pool",
"get_conn",
"frag",
"q",
"ensure_schema",
"ensure_context_schema",
"ensure_incidents_schema",
"migrate_incidents_to_dedicated_db",
"resolve_tenant_id",
]

20
app/db/backend.py Normal file
View file

@ -0,0 +1,20 @@
"""Backend detection — SQLITE (default) or POSTGRES based on DATABASE_URL."""
from __future__ import annotations
import os
from enum import Enum
class Backend(Enum):
SQLITE = "sqlite"
POSTGRES = "postgres"
def _detect() -> Backend:
url = os.environ.get("DATABASE_URL", "")
if url.startswith(("postgresql://", "postgres://", "postgresql+psycopg://")):
return Backend.POSTGRES
return Backend.SQLITE
BACKEND: Backend = _detect()

137
app/db/conn.py Normal file
View file

@ -0,0 +1,137 @@
"""Uniform connection wrapper over sqlite3 and psycopg3.
Usage:
with get_conn(db_path) as conn:
conn.execute("SELECT ...", (param,))
conn.commit()
For Postgres, db_path is ignored all connections go through the shared pool.
The pool is initialized lazily on first use from DATABASE_URL.
"""
from __future__ import annotations
import logging
import os
import sqlite3
from contextlib import contextmanager
from pathlib import Path
from typing import Any, Generator
from app.db.backend import BACKEND, Backend
logger = logging.getLogger(__name__)
_pool: Any = None # psycopg_pool.ConnectionPool, typed as Any to avoid import-time errors
class _NopCursor:
"""Returned when a PRAGMA or other SQLite-only statement is skipped on Postgres."""
rowcount = 0
def fetchall(self) -> list:
return []
def fetchone(self) -> None:
return None
def __iter__(self):
return iter([])
class DbConn:
"""Wraps a raw sqlite3 or psycopg connection with a uniform execute API.
Row access is always dict-like:
- SQLite: conn.row_factory = sqlite3.Row (supports row["col"] and row[0])
- Postgres: row_factory = dict_row (returns plain dicts)
"""
__slots__ = ("_c", "_backend")
def __init__(self, raw: Any, backend: Backend) -> None:
self._c = raw
self._backend = backend
def _prep(self, sql: str) -> str | None:
"""Return None to skip (PRAGMA on Postgres), else return ready-to-execute SQL."""
stripped = sql.strip()
if self._backend == Backend.POSTGRES and stripped.lower().startswith("pragma"):
return None
if self._backend == Backend.POSTGRES:
return stripped.replace("?", "%s")
return stripped
def execute(self, sql: str, params: Any = ()) -> Any:
prepared = self._prep(sql)
if prepared is None:
return _NopCursor()
return self._c.execute(prepared, params)
def executemany(self, sql: str, params_seq: Any) -> Any:
prepared = self._prep(sql)
if prepared is None:
return _NopCursor()
return self._c.executemany(prepared, params_seq)
def commit(self) -> None:
self._c.commit()
def close(self) -> None:
self._c.close()
def __enter__(self) -> "DbConn":
return self
def __exit__(self, *_: Any) -> None:
self.close()
def _get_pool() -> Any:
global _pool
if _pool is not None:
return _pool
try:
from psycopg_pool import ConnectionPool # type: ignore[import]
url = os.environ["DATABASE_URL"]
_pool = ConnectionPool(url, min_size=2, max_size=10, open=True)
logger.info("Postgres connection pool opened (DATABASE_URL set)")
return _pool
except ImportError as exc:
raise RuntimeError(
"psycopg[binary,pool] is required for Postgres backend. "
"Run: pip install 'psycopg[binary,pool]'"
) from exc
except KeyError:
raise RuntimeError("DATABASE_URL must be set when using Postgres backend") from None
@contextmanager
def get_conn(db_path: Path | None = None) -> Generator[DbConn, None, None]:
"""Yield a DbConn backed by sqlite3 (db_path required) or the Postgres pool."""
if BACKEND == Backend.POSTGRES:
pool = _get_pool()
from psycopg.rows import dict_row # type: ignore[import]
with pool.connection() as raw:
raw.row_factory = dict_row
yield DbConn(raw, BACKEND)
else:
if db_path is None:
raise ValueError("db_path is required for SQLite backend")
raw = sqlite3.connect(str(db_path), timeout=90.0)
raw.row_factory = sqlite3.Row
try:
raw.execute("PRAGMA journal_mode=WAL")
raw.execute("PRAGMA busy_timeout=90000")
raw.execute("PRAGMA foreign_keys=ON")
yield DbConn(raw, BACKEND)
finally:
raw.close()
def close_pool() -> None:
"""Close the Postgres connection pool — call during application shutdown."""
global _pool
if _pool is not None:
_pool.close()
_pool = None
logger.info("Postgres connection pool closed")

93
app/db/dialect.py Normal file
View file

@ -0,0 +1,93 @@
"""Per-backend SQL fragments and placeholder rewriting.
All production SQL should be written with SQLite-style `?` placeholders.
Call q(sql) before passing to execute/executemany it rewrites to %s for
Postgres and leaves SQLite queries untouched.
"""
from __future__ import annotations
from app.db.backend import BACKEND, Backend
def q(sql: str) -> str:
"""Rewrite ? placeholders to %s for Postgres; no-op for SQLite."""
if BACKEND == Backend.POSTGRES:
return sql.replace("?", "%s")
return sql
class _Fragments:
"""SQL fragments that differ between backends."""
@property
def insert_or_ignore(self) -> str:
return "INSERT" if BACKEND == Backend.POSTGRES else "INSERT OR IGNORE"
@property
def on_conflict_ignore(self) -> str:
# Caller must substitute the column name(s) at use time when using Postgres.
# For log_entries: ON CONFLICT (tenant_id, id) DO NOTHING
# For generic use this property is a no-op sentinel; prefer insert_ignore_into().
return ""
def insert_ignore_entries(self) -> str:
"""Full INSERT ... ON CONFLICT clause for log_entries."""
if BACKEND == Backend.POSTGRES:
return "INSERT INTO log_entries"
return "INSERT OR IGNORE INTO log_entries"
def entries_conflict_clause(self) -> str:
if BACKEND == Backend.POSTGRES:
return "ON CONFLICT (tenant_id, id) DO NOTHING"
return ""
def fingerprint_upsert(self) -> str:
if BACKEND == Backend.POSTGRES:
return (
"INSERT INTO glean_fingerprints (tenant_id, path, mtime, size, gleaned_at)"
" VALUES (%s, %s, %s, %s, %s)"
" ON CONFLICT (tenant_id, path)"
" DO UPDATE SET mtime=EXCLUDED.mtime, size=EXCLUDED.size, gleaned_at=EXCLUDED.gleaned_at"
)
return (
"INSERT OR REPLACE INTO glean_fingerprints (tenant_id, path, mtime, size, gleaned_at)"
" VALUES (?,?,?,?,?)"
)
def source_group_expr(self, col: str = "source_id") -> str:
"""SQL expression that collapses prefix:host:unit → prefix:host stem."""
if BACKEND == Backend.POSTGRES:
return f"""
CASE
WHEN array_length(string_to_array({col}, ':'), 1) >= 3
THEN split_part({col}, ':', 1) || ':' || split_part({col}, ':', 2)
ELSE {col}
END
"""
return f"""
CASE
WHEN INSTR(SUBSTR({col}, INSTR({col}, ':')+1), ':') > 0
THEN SUBSTR({col}, 1,
INSTR({col}, ':')
+ INSTR(SUBSTR({col}, INSTR({col}, ':')+1), ':')
- 1)
ELSE {col}
END
"""
def fts_match_clause(self) -> str:
"""WHERE clause fragment for FTS query. Caller supplies the query param."""
if BACKEND == Backend.POSTGRES:
return "text_tsv @@ websearch_to_tsquery('english', %s)"
return "log_fts MATCH ?"
def fts_rank_expr(self) -> str:
"""ORDER BY expression for FTS rank (best match first). Postgres needs the query twice."""
if BACKEND == Backend.POSTGRES:
# ts_rank returns 0..1 where higher is better; pass the query again as param
return "ts_rank(text_tsv, websearch_to_tsquery('english', %s)) DESC"
# FTS5 rank is negative BM25; ASC = most-negative = best match
return "rank ASC"
frag = _Fragments()

537
app/db/schema.py Normal file
View file

@ -0,0 +1,537 @@
"""Schema creation and idempotent migrations for all Turnstone databases.
Three logical databases (main, context, incidents) map to:
- SQLite: three separate .db files (avoids write-lock contention)
- Postgres: three table-groups in one physical DB (row-level locking makes separation unnecessary)
All ensure_* functions are idempotent: safe to call on every startup.
"""
from __future__ import annotations
import logging
import sqlite3
from pathlib import Path
from app.db.backend import BACKEND, Backend
from app.db.conn import get_conn
logger = logging.getLogger(__name__)
# ---------------------------------------------------------------------------
# SQLite DDL — kept as executescript strings (SQLite only)
# ---------------------------------------------------------------------------
_MAIN_SCHEMA_SQLITE = """
CREATE TABLE IF NOT EXISTS log_entries (
id TEXT NOT NULL,
tenant_id TEXT NOT NULL DEFAULT '',
source_id TEXT NOT NULL,
sequence INTEGER NOT NULL,
timestamp_raw TEXT,
timestamp_iso TEXT,
ingest_time TEXT NOT NULL,
severity TEXT,
repeat_count INTEGER DEFAULT 1,
out_of_order INTEGER DEFAULT 0,
matched_patterns TEXT DEFAULT '[]',
text TEXT NOT NULL,
anomaly_score REAL,
anomaly_label TEXT,
anomaly_scored_at TEXT,
ml_score REAL,
ml_label TEXT,
ml_scored_at TEXT,
PRIMARY KEY (tenant_id, id)
);
CREATE INDEX IF NOT EXISTS idx_source ON log_entries(source_id);
CREATE INDEX IF NOT EXISTS idx_tenant_src ON log_entries(tenant_id, source_id);
CREATE INDEX IF NOT EXISTS idx_timestamp ON log_entries(timestamp_iso);
CREATE INDEX IF NOT EXISTS idx_ts_repeat ON log_entries(timestamp_iso, repeat_count);
CREATE INDEX IF NOT EXISTS idx_severity ON log_entries(tenant_id, severity);
CREATE INDEX IF NOT EXISTS idx_patterns ON log_entries(matched_patterns);
CREATE INDEX IF NOT EXISTS idx_anomaly ON log_entries(tenant_id, anomaly_score);
CREATE INDEX IF NOT EXISTS idx_ml_scored ON log_entries(tenant_id, ml_scored_at);
CREATE TABLE IF NOT EXISTS detections (
id TEXT PRIMARY KEY,
tenant_id TEXT NOT NULL DEFAULT '',
entry_id TEXT NOT NULL,
source_id TEXT NOT NULL,
anomaly_label TEXT NOT NULL,
anomaly_score REAL NOT NULL,
severity TEXT NOT NULL,
text TEXT NOT NULL,
timestamp_iso TEXT,
detected_at TEXT NOT NULL,
acknowledged INTEGER NOT NULL DEFAULT 0,
acknowledged_at TEXT,
notes TEXT NOT NULL DEFAULT '',
scorer TEXT NOT NULL DEFAULT 'anomaly'
);
CREATE INDEX IF NOT EXISTS idx_detections_tenant ON detections(tenant_id, detected_at);
CREATE INDEX IF NOT EXISTS idx_detections_ack ON detections(acknowledged);
CREATE INDEX IF NOT EXISTS idx_detections_label ON detections(anomaly_label);
CREATE INDEX IF NOT EXISTS idx_detections_entry ON detections(entry_id);
CREATE INDEX IF NOT EXISTS idx_detections_scorer ON detections(scorer);
CREATE TABLE IF NOT EXISTS glean_fingerprints (
tenant_id TEXT NOT NULL DEFAULT '',
path TEXT NOT NULL,
mtime REAL NOT NULL,
size INTEGER NOT NULL,
gleaned_at TEXT NOT NULL,
PRIMARY KEY (tenant_id, path)
);
CREATE TABLE IF NOT EXISTS incidents (
id TEXT PRIMARY KEY,
tenant_id TEXT NOT NULL DEFAULT '',
label TEXT NOT NULL,
issue_type TEXT NOT NULL DEFAULT '',
started_at TEXT,
ended_at TEXT,
notes TEXT NOT NULL DEFAULT '',
created_at TEXT NOT NULL,
severity TEXT NOT NULL DEFAULT 'medium'
);
CREATE INDEX IF NOT EXISTS idx_incidents_time ON incidents(started_at, ended_at);
CREATE INDEX IF NOT EXISTS idx_incidents_tenant ON incidents(tenant_id);
CREATE TABLE IF NOT EXISTS received_bundles (
id TEXT PRIMARY KEY,
tenant_id TEXT NOT NULL DEFAULT '',
source_host TEXT NOT NULL,
issue_type TEXT NOT NULL DEFAULT '',
label TEXT NOT NULL,
severity TEXT NOT NULL DEFAULT 'medium',
started_at TEXT,
bundled_at TEXT NOT NULL,
entry_count INTEGER NOT NULL DEFAULT 0,
bundle_json TEXT NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_bundles_bundled ON received_bundles(bundled_at);
CREATE INDEX IF NOT EXISTS idx_bundles_type ON received_bundles(issue_type);
CREATE TABLE IF NOT EXISTS sent_bundles (
id TEXT PRIMARY KEY,
tenant_id TEXT NOT NULL DEFAULT '',
incident_id TEXT NOT NULL,
exported_at TEXT NOT NULL,
sanitized INTEGER NOT NULL DEFAULT 0,
entry_count INTEGER NOT NULL DEFAULT 0,
bundle_json TEXT NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_sent_bundles_incident ON sent_bundles(incident_id);
CREATE INDEX IF NOT EXISTS idx_sent_bundles_time ON sent_bundles(exported_at);
CREATE TABLE IF NOT EXISTS blocklist_candidates (
id TEXT PRIMARY KEY,
tenant_id TEXT NOT NULL DEFAULT '',
domain_or_ip TEXT NOT NULL,
source_device_ip TEXT,
source_device_name TEXT,
first_seen TEXT NOT NULL,
last_seen TEXT NOT NULL,
hit_count INTEGER DEFAULT 1,
status TEXT DEFAULT 'pending',
pushed_at TEXT,
log_evidence TEXT DEFAULT '[]',
matched_rule TEXT,
llm_score REAL,
llm_reason TEXT
);
CREATE INDEX IF NOT EXISTS idx_blocklist_device ON blocklist_candidates(source_device_ip);
CREATE INDEX IF NOT EXISTS idx_blocklist_status ON blocklist_candidates(status);
CREATE INDEX IF NOT EXISTS idx_blocklist_domain ON blocklist_candidates(domain_or_ip);
CREATE INDEX IF NOT EXISTS idx_blocklist_tenant ON blocklist_candidates(tenant_id);
CREATE TABLE IF NOT EXISTS ssh_targets (
id TEXT PRIMARY KEY,
label TEXT NOT NULL,
host TEXT NOT NULL,
port INTEGER NOT NULL DEFAULT 22,
user TEXT NOT NULL,
key_path TEXT NOT NULL,
last_tested TEXT,
last_ok INTEGER DEFAULT NULL,
last_error TEXT,
created_at TEXT NOT NULL,
updated_at TEXT NOT NULL
);
"""
_CONTEXT_SCHEMA_SQLITE = """
CREATE TABLE IF NOT EXISTS context_facts (
id TEXT PRIMARY KEY,
tenant_id TEXT NOT NULL DEFAULT '',
category TEXT NOT NULL,
key TEXT NOT NULL,
value TEXT NOT NULL,
source TEXT,
created_at TEXT NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_facts_category ON context_facts(category);
CREATE INDEX IF NOT EXISTS idx_facts_key ON context_facts(key);
CREATE INDEX IF NOT EXISTS idx_facts_tenant ON context_facts(tenant_id);
CREATE TABLE IF NOT EXISTS context_documents (
id TEXT PRIMARY KEY,
tenant_id TEXT NOT NULL DEFAULT '',
filename TEXT NOT NULL,
doc_type TEXT NOT NULL,
full_text TEXT NOT NULL,
file_size INTEGER,
uploaded_at TEXT NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_docs_tenant ON context_documents(tenant_id);
CREATE TABLE IF NOT EXISTS context_chunks (
id TEXT PRIMARY KEY,
tenant_id TEXT NOT NULL DEFAULT '',
document_id TEXT NOT NULL REFERENCES context_documents(id) ON DELETE CASCADE,
chunk_index INTEGER NOT NULL,
text TEXT NOT NULL,
embedding BLOB
);
CREATE INDEX IF NOT EXISTS idx_chunks_doc ON context_chunks(document_id);
CREATE INDEX IF NOT EXISTS idx_chunks_tenant ON context_chunks(tenant_id);
"""
# ---------------------------------------------------------------------------
# Postgres DDL — executed statement-by-statement
# ---------------------------------------------------------------------------
_MAIN_SCHEMA_PG_STMTS = [
"""
CREATE TABLE IF NOT EXISTS log_entries (
id TEXT NOT NULL,
tenant_id TEXT NOT NULL DEFAULT '',
source_id TEXT NOT NULL,
sequence INTEGER NOT NULL,
timestamp_raw TEXT,
timestamp_iso TEXT,
ingest_time TEXT NOT NULL,
severity TEXT,
repeat_count INTEGER DEFAULT 1,
out_of_order INTEGER DEFAULT 0,
matched_patterns TEXT DEFAULT '[]',
text TEXT NOT NULL,
text_tsv tsvector,
anomaly_score DOUBLE PRECISION,
anomaly_label TEXT,
anomaly_scored_at TEXT,
ml_score DOUBLE PRECISION,
ml_label TEXT,
ml_scored_at TEXT,
PRIMARY KEY (tenant_id, id)
)
""",
"CREATE INDEX IF NOT EXISTS idx_tenant_src ON log_entries(tenant_id, source_id)",
"CREATE INDEX IF NOT EXISTS idx_timestamp ON log_entries(timestamp_iso)",
"CREATE INDEX IF NOT EXISTS idx_severity ON log_entries(tenant_id, severity)",
"CREATE INDEX IF NOT EXISTS idx_patterns ON log_entries(matched_patterns)",
"CREATE INDEX IF NOT EXISTS idx_fts_gin ON log_entries USING GIN(text_tsv)",
"CREATE INDEX IF NOT EXISTS idx_anomaly ON log_entries(tenant_id, anomaly_score)",
"CREATE INDEX IF NOT EXISTS idx_ml_scored ON log_entries(tenant_id, ml_scored_at)",
"""
CREATE TABLE IF NOT EXISTS detections (
id TEXT PRIMARY KEY,
tenant_id TEXT NOT NULL DEFAULT '',
entry_id TEXT NOT NULL,
source_id TEXT NOT NULL,
anomaly_label TEXT NOT NULL,
anomaly_score DOUBLE PRECISION NOT NULL,
severity TEXT NOT NULL,
text TEXT NOT NULL,
timestamp_iso TEXT,
detected_at TEXT NOT NULL,
acknowledged INTEGER NOT NULL DEFAULT 0,
acknowledged_at TEXT,
notes TEXT NOT NULL DEFAULT '',
scorer TEXT NOT NULL DEFAULT 'anomaly'
)
""",
"CREATE INDEX IF NOT EXISTS idx_detections_tenant ON detections(tenant_id, detected_at)",
"CREATE INDEX IF NOT EXISTS idx_detections_ack ON detections(acknowledged)",
"CREATE INDEX IF NOT EXISTS idx_detections_label ON detections(anomaly_label)",
"CREATE INDEX IF NOT EXISTS idx_detections_entry ON detections(entry_id)",
"CREATE INDEX IF NOT EXISTS idx_detections_scorer ON detections(scorer)",
"""
CREATE OR REPLACE FUNCTION _ts_update_text_tsv() RETURNS trigger AS $$
BEGIN
NEW.text_tsv := to_tsvector('english', COALESCE(NEW.text, ''));
RETURN NEW;
END;
$$ LANGUAGE plpgsql
""",
"""
DO $$ BEGIN
IF NOT EXISTS (
SELECT 1 FROM pg_trigger WHERE tgname = 'trig_log_entries_tsv'
) THEN
CREATE TRIGGER trig_log_entries_tsv
BEFORE INSERT OR UPDATE OF text ON log_entries
FOR EACH ROW EXECUTE FUNCTION _ts_update_text_tsv();
END IF;
END $$
""",
"""
CREATE TABLE IF NOT EXISTS glean_fingerprints (
tenant_id TEXT NOT NULL DEFAULT '',
path TEXT NOT NULL,
mtime DOUBLE PRECISION NOT NULL,
size BIGINT NOT NULL,
gleaned_at TEXT NOT NULL,
PRIMARY KEY (tenant_id, path)
)
""",
"""
CREATE TABLE IF NOT EXISTS incidents (
id TEXT PRIMARY KEY,
tenant_id TEXT NOT NULL DEFAULT '',
label TEXT NOT NULL,
issue_type TEXT NOT NULL DEFAULT '',
started_at TEXT,
ended_at TEXT,
notes TEXT NOT NULL DEFAULT '',
created_at TEXT NOT NULL,
severity TEXT NOT NULL DEFAULT 'medium'
)
""",
"CREATE INDEX IF NOT EXISTS idx_incidents_time ON incidents(started_at, ended_at)",
"CREATE INDEX IF NOT EXISTS idx_incidents_tenant ON incidents(tenant_id)",
"""
CREATE TABLE IF NOT EXISTS received_bundles (
id TEXT PRIMARY KEY,
tenant_id TEXT NOT NULL DEFAULT '',
source_host TEXT NOT NULL,
issue_type TEXT NOT NULL DEFAULT '',
label TEXT NOT NULL,
severity TEXT NOT NULL DEFAULT 'medium',
started_at TEXT,
bundled_at TEXT NOT NULL,
entry_count INTEGER NOT NULL DEFAULT 0,
bundle_json TEXT NOT NULL
)
""",
"CREATE INDEX IF NOT EXISTS idx_bundles_bundled ON received_bundles(bundled_at)",
"CREATE INDEX IF NOT EXISTS idx_bundles_type ON received_bundles(issue_type)",
"""
CREATE TABLE IF NOT EXISTS sent_bundles (
id TEXT PRIMARY KEY,
tenant_id TEXT NOT NULL DEFAULT '',
incident_id TEXT NOT NULL,
exported_at TEXT NOT NULL,
sanitized INTEGER NOT NULL DEFAULT 0,
entry_count INTEGER NOT NULL DEFAULT 0,
bundle_json TEXT NOT NULL
)
""",
"CREATE INDEX IF NOT EXISTS idx_sent_bundles_incident ON sent_bundles(incident_id)",
"CREATE INDEX IF NOT EXISTS idx_sent_bundles_time ON sent_bundles(exported_at)",
"""
CREATE TABLE IF NOT EXISTS blocklist_candidates (
id TEXT PRIMARY KEY,
tenant_id TEXT NOT NULL DEFAULT '',
domain_or_ip TEXT NOT NULL,
source_device_ip TEXT,
source_device_name TEXT,
first_seen TEXT NOT NULL,
last_seen TEXT NOT NULL,
hit_count INTEGER DEFAULT 1,
status TEXT DEFAULT 'pending',
pushed_at TEXT,
log_evidence TEXT DEFAULT '[]',
matched_rule TEXT,
llm_score DOUBLE PRECISION,
llm_reason TEXT
)
""",
"CREATE INDEX IF NOT EXISTS idx_blocklist_device ON blocklist_candidates(source_device_ip)",
"CREATE INDEX IF NOT EXISTS idx_blocklist_status ON blocklist_candidates(status)",
"CREATE INDEX IF NOT EXISTS idx_blocklist_domain ON blocklist_candidates(domain_or_ip)",
"CREATE INDEX IF NOT EXISTS idx_blocklist_tenant ON blocklist_candidates(tenant_id)",
]
_CONTEXT_SCHEMA_PG_STMTS = [
"""
CREATE TABLE IF NOT EXISTS context_facts (
id TEXT PRIMARY KEY,
tenant_id TEXT NOT NULL DEFAULT '',
category TEXT NOT NULL,
key TEXT NOT NULL,
value TEXT NOT NULL,
source TEXT,
created_at TEXT NOT NULL
)
""",
"CREATE INDEX IF NOT EXISTS idx_facts_category ON context_facts(category)",
"CREATE INDEX IF NOT EXISTS idx_facts_key ON context_facts(key)",
"CREATE INDEX IF NOT EXISTS idx_facts_tenant ON context_facts(tenant_id)",
"""
CREATE TABLE IF NOT EXISTS context_documents (
id TEXT PRIMARY KEY,
tenant_id TEXT NOT NULL DEFAULT '',
filename TEXT NOT NULL,
doc_type TEXT NOT NULL,
full_text TEXT NOT NULL,
file_size BIGINT,
uploaded_at TEXT NOT NULL
)
""",
"CREATE INDEX IF NOT EXISTS idx_docs_tenant ON context_documents(tenant_id)",
"""
CREATE TABLE IF NOT EXISTS context_chunks (
id TEXT PRIMARY KEY,
tenant_id TEXT NOT NULL DEFAULT '',
document_id TEXT NOT NULL REFERENCES context_documents(id) ON DELETE CASCADE,
chunk_index INTEGER NOT NULL,
text TEXT NOT NULL,
embedding BYTEA
)
""",
"CREATE INDEX IF NOT EXISTS idx_chunks_doc ON context_chunks(document_id)",
"CREATE INDEX IF NOT EXISTS idx_chunks_tenant ON context_chunks(tenant_id)",
]
# ---------------------------------------------------------------------------
# SQLite additive column migrations — applied after CREATE TABLE on every boot
# ---------------------------------------------------------------------------
_MAIN_MIGRATIONS_SQLITE = [
"ALTER TABLE log_entries ADD COLUMN tenant_id TEXT NOT NULL DEFAULT ''",
"ALTER TABLE incidents ADD COLUMN issue_type TEXT NOT NULL DEFAULT ''",
"ALTER TABLE incidents ADD COLUMN tenant_id TEXT NOT NULL DEFAULT ''",
"ALTER TABLE received_bundles ADD COLUMN tenant_id TEXT NOT NULL DEFAULT ''",
"ALTER TABLE sent_bundles ADD COLUMN tenant_id TEXT NOT NULL DEFAULT ''",
"ALTER TABLE blocklist_candidates ADD COLUMN tenant_id TEXT NOT NULL DEFAULT ''",
"ALTER TABLE glean_fingerprints ADD COLUMN tenant_id TEXT NOT NULL DEFAULT ''",
"ALTER TABLE glean_fingerprints ADD COLUMN mtime REAL",
"ALTER TABLE glean_fingerprints ADD COLUMN size INTEGER",
"ALTER TABLE glean_fingerprints ADD COLUMN gleaned_at TEXT",
"ALTER TABLE log_entries ADD COLUMN anomaly_score REAL",
"ALTER TABLE log_entries ADD COLUMN anomaly_label TEXT",
"ALTER TABLE log_entries ADD COLUMN anomaly_scored_at TEXT",
"ALTER TABLE log_entries ADD COLUMN ml_score REAL",
"ALTER TABLE log_entries ADD COLUMN ml_label TEXT",
"ALTER TABLE log_entries ADD COLUMN ml_scored_at TEXT",
"ALTER TABLE detections ADD COLUMN scorer TEXT NOT NULL DEFAULT 'anomaly'",
"ALTER TABLE log_entries ADD COLUMN anonymized INTEGER DEFAULT NULL",
]
_CONTEXT_MIGRATIONS_SQLITE = [
"ALTER TABLE context_facts ADD COLUMN tenant_id TEXT NOT NULL DEFAULT ''",
"ALTER TABLE context_documents ADD COLUMN tenant_id TEXT NOT NULL DEFAULT ''",
"ALTER TABLE context_chunks ADD COLUMN tenant_id TEXT NOT NULL DEFAULT ''",
]
def _run_sqlite_migrations(conn: sqlite3.Connection, stmts: list[str]) -> None:
for stmt in stmts:
try:
conn.execute(stmt)
except sqlite3.OperationalError:
pass # column already exists or table not present yet — both are fine
def _run_pg_stmts(stmts: list[str]) -> None:
"""Execute Postgres DDL statements — each in its own transaction for IF NOT EXISTS safety."""
from psycopg import connect as pg_connect # type: ignore[import]
import os
url = os.environ["DATABASE_URL"]
with pg_connect(url, autocommit=True) as conn:
for stmt in stmts:
stripped = stmt.strip()
if stripped:
conn.execute(stripped)
# ---------------------------------------------------------------------------
# Public API
# ---------------------------------------------------------------------------
def ensure_schema(db_path: Path) -> None:
"""Ensure main log/incidents/blocklist tables exist. Idempotent."""
if BACKEND == Backend.POSTGRES:
_run_pg_stmts(_MAIN_SCHEMA_PG_STMTS)
logger.debug("Postgres main schema verified")
return
conn = sqlite3.connect(str(db_path), timeout=30.0)
conn.execute("PRAGMA journal_mode=WAL")
# Migrations first: add tenant_id to existing tables BEFORE index creation touches it
_run_sqlite_migrations(conn, _MAIN_MIGRATIONS_SQLITE)
conn.commit()
conn.executescript(_MAIN_SCHEMA_SQLITE)
conn.close()
logger.debug("SQLite main schema verified at %s", db_path)
def ensure_context_schema(db_path: Path) -> None:
"""Ensure context KB tables exist. Idempotent."""
if BACKEND == Backend.POSTGRES:
_run_pg_stmts(_CONTEXT_SCHEMA_PG_STMTS)
logger.debug("Postgres context schema verified")
return
conn = sqlite3.connect(str(db_path), timeout=30.0)
conn.execute("PRAGMA journal_mode=WAL")
conn.execute("PRAGMA foreign_keys=ON")
_run_sqlite_migrations(conn, _CONTEXT_MIGRATIONS_SQLITE)
conn.commit()
conn.executescript(_CONTEXT_SCHEMA_SQLITE)
conn.close()
logger.debug("SQLite context schema verified at %s", db_path)
def migrate_incidents_to_dedicated_db(main_db: Path, incidents_db: Path) -> int:
"""One-shot migration: copy incidents/bundles rows from main DB to incidents DB.
Safe to call on every startup rows already in incidents_db are skipped.
No-op for Postgres (single DB, no migration needed).
"""
if BACKEND == Backend.POSTGRES:
return 0
src = sqlite3.connect(str(main_db), timeout=30.0)
src.row_factory = sqlite3.Row
dst = sqlite3.connect(str(incidents_db), timeout=30.0)
migrated = 0
for table in ("incidents", "received_bundles", "sent_bundles"):
try:
rows = src.execute(f"SELECT * FROM {table}").fetchall() # noqa: S608
except sqlite3.OperationalError:
continue
if not rows:
continue
cols = ", ".join(rows[0].keys())
placeholders = ", ".join("?" * len(rows[0].keys()))
dst.executemany(
f"INSERT OR IGNORE INTO {table} ({cols}) VALUES ({placeholders})", # noqa: S608
[tuple(r) for r in rows],
)
migrated += len(rows)
dst.commit()
src.close()
dst.close()
return migrated
def ensure_incidents_schema(db_path: Path) -> None:
"""Ensure incidents/bundles tables exist. Idempotent.
For Postgres, incidents live in the same DB as log_entries (already created by
ensure_schema), so this is a no-op the tables were created above.
"""
if BACKEND == Backend.POSTGRES:
return
conn = sqlite3.connect(str(db_path), timeout=30.0)
conn.execute("PRAGMA journal_mode=WAL")
_run_sqlite_migrations(conn, _MAIN_MIGRATIONS_SQLITE)
conn.commit()
conn.executescript(_MAIN_SCHEMA_SQLITE)
conn.close()
logger.debug("SQLite incidents schema verified at %s", db_path)

12
app/db/tenant.py Normal file
View file

@ -0,0 +1,12 @@
"""Tenant ID resolution — TURNSTONE_TENANT_ID env var, hostname fallback."""
from __future__ import annotations
import os
import socket
from functools import lru_cache
@lru_cache(maxsize=1)
def resolve_tenant_id() -> str:
"""Return this node's tenant ID. Result is cached after first call."""
return os.environ.get("TURNSTONE_TENANT_ID") or socket.gethostname()

0
app/glean/__init__.py Normal file
View file

View file

@ -33,6 +33,7 @@ def load_patterns(path: Path) -> list[LogPattern]:
pattern=p["pattern"], pattern=p["pattern"],
severity=p["severity"], severity=p["severity"],
description=p["description"], description=p["description"],
domain=p.get("domain", ""),
) )
for p in raw.get("patterns", []) for p in raw.get("patterns", [])
] ]
@ -42,6 +43,11 @@ def _compile(patterns: list[LogPattern]) -> list[tuple[LogPattern, re.Pattern]]:
return [(p, re.compile(p.pattern, re.IGNORECASE)) for p in patterns] return [(p, re.compile(p.pattern, re.IGNORECASE)) for p in patterns]
def load_compiled_patterns(path: Path) -> list[tuple[LogPattern, object]]:
"""Load and compile patterns from a YAML file. Public API over the private _compile."""
return _compile(load_patterns(path))
def apply_patterns( def apply_patterns(
text: str, text: str,
compiled: list[tuple[LogPattern, re.Pattern]], compiled: list[tuple[LogPattern, re.Pattern]],

View file

@ -4,7 +4,7 @@ from __future__ import annotations
import json import json
from typing import Iterator from typing import Iterator
from app.ingest.base import ( from app.glean.base import (
SourceState, apply_patterns, epoch_float_to_iso, SourceState, apply_patterns, epoch_float_to_iso,
make_entry_id, now_iso, make_entry_id, now_iso,
) )

View file

@ -18,7 +18,7 @@ import re
from datetime import datetime, timezone from datetime import datetime, timezone
from typing import Iterator from typing import Iterator
from app.ingest.base import ( from app.glean.base import (
SourceState, apply_patterns, detect_severity, make_entry_id, now_iso, SourceState, apply_patterns, detect_severity, make_entry_id, now_iso,
) )
from app.services.models import LogPattern, RetrievedEntry from app.services.models import LogPattern, RetrievedEntry

42
app/glean/doc_upload.py Normal file
View file

@ -0,0 +1,42 @@
"""Upload adapter: processes file bytes and writes to context store — MIT licensed."""
from __future__ import annotations
import uuid
from pathlib import Path
from typing import Any
from app.context.chunker import process_upload
from app.context.store import add_document, add_fact
from app.db import get_conn, resolve_tenant_id
def glean_upload(db_path: Path, filename: str, content: bytes) -> dict[str, Any]:
"""Process an uploaded file and write to context store. Returns result summary."""
doc_type, facts, chunks = process_upload(filename, content)
tid = resolve_tenant_id()
doc = add_document(
db_path,
filename=filename,
doc_type=doc_type,
full_text=content.decode("utf-8", errors="replace"),
file_size=len(content),
)
for fact in facts:
add_fact(db_path, fact.category, fact.key, fact.value, source="upload")
with get_conn(db_path) as conn:
for i, chunk_text in enumerate(chunks):
conn.execute(
"INSERT INTO context_chunks(id, tenant_id, document_id, chunk_index, text) VALUES (?,?,?,?,?)",
(str(uuid.uuid4()), tid, doc.id, i, chunk_text),
)
conn.commit()
return {
"document_id": doc.id,
"doc_type": doc_type,
"facts_written": len(facts),
"chunks_written": len(chunks),
}

View file

@ -4,7 +4,7 @@ from __future__ import annotations
import json import json
from typing import Iterator from typing import Iterator
from app.ingest.base import ( from app.glean.base import (
SourceState, apply_patterns, detect_severity, SourceState, apply_patterns, detect_severity,
make_entry_id, now_iso, make_entry_id, now_iso,
) )

View file

@ -4,7 +4,7 @@ from __future__ import annotations
import json import json
from typing import Iterator from typing import Iterator
from app.ingest.base import ( from app.glean.base import (
SourceState, apply_patterns, epoch_micros_to_iso, SourceState, apply_patterns, epoch_micros_to_iso,
make_entry_id, now_iso, SYSLOG_PRIORITY, make_entry_id, now_iso, SYSLOG_PRIORITY,
) )

View file

@ -0,0 +1,166 @@
"""Live MQTT glean subscriber for Turnstone.
Reads ``type: mqtt`` entries from sources.yaml and subscribes to each broker
in the background. Incoming messages are normalized to RetrievedEntry and
written to the Turnstone SQLite database as they arrive.
This runs as an asyncio task alongside the batch glean scheduler. It is
started from the FastAPI lifespan in rest.py.
MQTT source config format in sources.yaml::
sources:
- id: meshtastic-home
type: mqtt
broker_host: 10.1.10.5
broker_port: 1883 # optional, default 1883
broker_username: ~ # optional
broker_password: ~ # optional
topics:
- msh/# # one or more topic patterns
severity: INFO # optional default severity for all messages
- id: iot-sensors
type: mqtt
broker_host: localhost
topics:
- home/+/temperature
- home/+/humidity
"""
from __future__ import annotations
import asyncio
import hashlib
import json
import logging
import sqlite3
from datetime import datetime, timezone
from pathlib import Path
import yaml
from app.services.models import RetrievedEntry
logger = logging.getLogger(__name__)
def _load_mqtt_sources(sources_file: Path) -> list[dict]:
"""Return only the ``type: mqtt`` entries from sources.yaml."""
if not sources_file.exists():
return []
with sources_file.open() as f:
data = yaml.safe_load(f) or {}
return [s for s in data.get("sources", []) if s.get("type") == "mqtt"]
def _make_entry_id(source_id: str, seq: int, text: str) -> str:
h = hashlib.sha1(f"{source_id}:{seq}:{text}".encode()).hexdigest()[:16]
return f"{source_id}:{seq}:{h}"
def _write_entry(db_path: Path, entry: RetrievedEntry) -> None:
with sqlite3.connect(db_path, timeout=30.0) as conn:
conn.execute(
"""
INSERT OR IGNORE INTO log_entries
(id, source_id, sequence, timestamp_raw, timestamp_iso,
ingest_time, severity, repeat_count, out_of_order,
matched_patterns, text)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""",
(
entry.entry_id,
entry.source_id,
entry.sequence,
entry.timestamp_raw,
entry.timestamp_iso,
entry.ingest_time,
entry.severity,
entry.repeat_count,
1 if entry.out_of_order else 0,
json.dumps(entry.matched_patterns),
entry.text,
),
)
async def _run_source_subscriber(source: dict, db_path: Path) -> None:
"""Maintain a subscription to one MQTT source, reconnecting on error."""
try:
from circuitforge_core.mqtt import MQTTClient, MQTTConfig
except ImportError:
logger.error(
"circuitforge-core[mqtt] is not installed — MQTT source %r skipped. "
"Run: pip install circuitforge-core[mqtt]",
source.get("id"),
)
return
source_id: str = source["id"]
host: str = source["broker_host"]
port: int = int(source.get("broker_port", 1883))
username: str | None = source.get("broker_username") or source.get("username")
password: str | None = source.get("broker_password") or source.get("password")
topics: list[str] = source.get("topics", ["#"])
default_severity: str = source.get("severity", "INFO")
cfg = MQTTConfig(
host=host,
port=port,
username=username,
password=password,
client_id=f"turnstone-{source_id}",
)
client = MQTTClient(cfg)
seq = 0
for topic in topics:
@client.on(topic)
async def _handle(msg, _src=source_id, _sev=default_severity):
nonlocal seq
seq += 1
now = datetime.now(tz=timezone.utc).isoformat()
text = msg.text()
entry = RetrievedEntry(
entry_id=_make_entry_id(_src, seq, text),
source_id=_src,
sequence=seq,
timestamp_raw=now,
timestamp_iso=now,
ingest_time=now,
severity=_sev,
repeat_count=1,
out_of_order=False,
matched_patterns=[],
text=f"[{msg.topic}] {text}",
)
_write_entry(db_path, entry)
logger.debug("MQTT[%s] %s: %s", _src, msg.topic, text[:120])
logger.info("MQTT subscriber starting: %s%s:%d topics=%s", source_id, host, port, topics)
await client.run()
async def run_mqtt_subscribers(sources_file: Path, db_path: Path) -> None:
"""Start one subscriber task per MQTT source. Runs until cancelled."""
sources = _load_mqtt_sources(sources_file)
if not sources:
logger.debug("No MQTT sources configured in %s", sources_file)
return
logger.info("Starting %d MQTT subscriber(s)", len(sources))
tasks = [
asyncio.create_task(
_run_source_subscriber(src, db_path),
name=f"mqtt-{src.get('id', i)}",
)
for i, src in enumerate(sources)
]
try:
await asyncio.gather(*tasks)
except asyncio.CancelledError:
for t in tasks:
t.cancel()
await asyncio.gather(*tasks, return_exceptions=True)
raise

641
app/glean/pipeline.py Normal file
View file

@ -0,0 +1,641 @@
"""Glean pipeline: auto-detect format, parse, write to SQLite or Postgres."""
from __future__ import annotations
import json
import logging
import re
import sqlite3 # still used in migrate_incidents_to_dedicated_db (SQLite-only migration)
from pathlib import Path
from typing import Any, Iterator
from app.db import (
frag,
get_conn,
resolve_tenant_id,
)
from app.db.schema import (
ensure_context_schema,
ensure_incidents_schema,
ensure_schema,
migrate_incidents_to_dedicated_db,
)
import yaml
from app.glean import caddy, dmesg_log, docker_log, journald, plaintext, plex, qbittorrent, servarr, syslog, wazuh
from app.glean.base import _compile, load_patterns, now_iso
from app.glean.ssh import (
SSHTransport,
SSHConnectionError,
SSHCommandError,
_build_docker_command,
_build_journald_command,
_build_plaintext_command,
_build_syslog_command,
)
from app.services.models import LogPattern, RetrievedEntry
from app.services.search import build_fts_index
logger = logging.getLogger(__name__)
_SCHEMA = """
CREATE TABLE IF NOT EXISTS log_entries (
id TEXT PRIMARY KEY,
source_id TEXT NOT NULL,
sequence INTEGER NOT NULL,
timestamp_raw TEXT,
timestamp_iso TEXT,
ingest_time TEXT NOT NULL,
severity TEXT,
repeat_count INTEGER DEFAULT 1,
out_of_order INTEGER DEFAULT 0,
matched_patterns TEXT DEFAULT '[]',
text TEXT NOT NULL,
anonymized INTEGER DEFAULT NULL
);
CREATE INDEX IF NOT EXISTS idx_source ON log_entries(source_id);
CREATE INDEX IF NOT EXISTS idx_timestamp ON log_entries(timestamp_iso);
CREATE INDEX IF NOT EXISTS idx_ts_repeat ON log_entries(timestamp_iso, repeat_count);
CREATE INDEX IF NOT EXISTS idx_severity ON log_entries(severity);
CREATE INDEX IF NOT EXISTS idx_patterns ON log_entries(matched_patterns);
-- incidents tables moved to ensure_incidents_schema() / INCIDENTS_DB_PATH
-- kept here as no-ops so legacy single-file deployments still work
CREATE TABLE IF NOT EXISTS incidents (
id TEXT PRIMARY KEY,
label TEXT NOT NULL,
issue_type TEXT NOT NULL DEFAULT '',
started_at TEXT,
ended_at TEXT,
notes TEXT NOT NULL DEFAULT '',
created_at TEXT NOT NULL,
severity TEXT NOT NULL DEFAULT 'medium'
);
CREATE TABLE IF NOT EXISTS received_bundles (
id TEXT PRIMARY KEY,
source_host TEXT NOT NULL,
issue_type TEXT NOT NULL DEFAULT '',
label TEXT NOT NULL,
severity TEXT NOT NULL DEFAULT 'medium',
started_at TEXT,
bundled_at TEXT NOT NULL,
entry_count INTEGER NOT NULL DEFAULT 0,
bundle_json TEXT NOT NULL
);
CREATE TABLE IF NOT EXISTS sent_bundles (
id TEXT PRIMARY KEY,
incident_id TEXT NOT NULL,
exported_at TEXT NOT NULL,
sanitized INTEGER NOT NULL DEFAULT 0,
entry_count INTEGER NOT NULL DEFAULT 0,
bundle_json TEXT NOT NULL
);
-- context tables moved to ensure_context_schema() / CONTEXT_DB_PATH
-- kept here as no-ops so legacy single-file deployments still work
CREATE TABLE IF NOT EXISTS context_facts (
id TEXT PRIMARY KEY,
category TEXT NOT NULL,
key TEXT NOT NULL,
value TEXT NOT NULL,
source TEXT,
created_at TEXT NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_facts_category ON context_facts(category);
CREATE INDEX IF NOT EXISTS idx_facts_key ON context_facts(key);
CREATE TABLE IF NOT EXISTS context_documents (
id TEXT PRIMARY KEY,
filename TEXT NOT NULL,
doc_type TEXT NOT NULL,
full_text TEXT NOT NULL,
file_size INTEGER,
uploaded_at TEXT NOT NULL
);
CREATE TABLE IF NOT EXISTS context_chunks (
id TEXT PRIMARY KEY,
document_id TEXT NOT NULL REFERENCES context_documents(id) ON DELETE CASCADE,
chunk_index INTEGER NOT NULL,
text TEXT NOT NULL,
embedding BLOB
);
CREATE INDEX IF NOT EXISTS idx_chunks_doc ON context_chunks(document_id);
CREATE TABLE IF NOT EXISTS blocklist_candidates (
id TEXT PRIMARY KEY,
domain_or_ip TEXT NOT NULL,
source_device_ip TEXT,
source_device_name TEXT,
first_seen TEXT NOT NULL,
last_seen TEXT NOT NULL,
hit_count INTEGER DEFAULT 1,
status TEXT DEFAULT 'pending',
pushed_at TEXT,
log_evidence TEXT DEFAULT '[]',
matched_rule TEXT,
llm_score REAL,
llm_reason TEXT
);
CREATE INDEX IF NOT EXISTS idx_blocklist_device ON blocklist_candidates(source_device_ip);
CREATE INDEX IF NOT EXISTS idx_blocklist_status ON blocklist_candidates(status);
CREATE INDEX IF NOT EXISTS idx_blocklist_domain ON blocklist_candidates(domain_or_ip);
CREATE TABLE IF NOT EXISTS glean_fingerprints (
path TEXT PRIMARY KEY,
mtime REAL NOT NULL,
size INTEGER NOT NULL,
gleaned_at TEXT NOT NULL
);
"""
_CONTEXT_SCHEMA = """
CREATE TABLE IF NOT EXISTS context_facts (
id TEXT PRIMARY KEY,
category TEXT NOT NULL,
key TEXT NOT NULL,
value TEXT NOT NULL,
source TEXT,
created_at TEXT NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_facts_category ON context_facts(category);
CREATE INDEX IF NOT EXISTS idx_facts_key ON context_facts(key);
CREATE TABLE IF NOT EXISTS context_documents (
id TEXT PRIMARY KEY,
filename TEXT NOT NULL,
doc_type TEXT NOT NULL,
full_text TEXT NOT NULL,
file_size INTEGER,
uploaded_at TEXT NOT NULL
);
CREATE TABLE IF NOT EXISTS context_chunks (
id TEXT PRIMARY KEY,
document_id TEXT NOT NULL REFERENCES context_documents(id) ON DELETE CASCADE,
chunk_index INTEGER NOT NULL,
text TEXT NOT NULL,
embedding BLOB
);
CREATE INDEX IF NOT EXISTS idx_chunks_doc ON context_chunks(document_id);
"""
# ensure_schema / ensure_context_schema / ensure_incidents_schema / migrate_incidents_to_dedicated_db
# are now implemented in app/db/schema.py and re-exported via app/db/__init__.py.
# The imports at the top of this file bring them in; these names are kept as module-level
# symbols so existing callers (rest.py, tests) still find them here without changes.
# _INCIDENTS_SCHEMA and its ensure_/migrate_ functions moved to app/db/schema.py
def _fingerprint(path: Path) -> tuple[float, int]:
"""Return (mtime, size) for a file — cheap identity check, no content read needed."""
st = path.stat()
return st.st_mtime, st.st_size
def _fp_unchanged(conn: Any, path: Path, mtime: float, size: int) -> bool:
"""Return True only when the stored fingerprint exactly matches (mtime, size)."""
tid = resolve_tenant_id()
row = conn.execute(
"SELECT mtime, size FROM glean_fingerprints WHERE path = ? AND (tenant_id = ? OR tenant_id = '')",
(str(path), tid),
).fetchone()
if row is None:
return False
return row["mtime"] == mtime and row["size"] == size
def _save_fingerprint(
conn: Any,
path: Path,
mtime: float,
size: int,
gleaned_at: str,
) -> None:
"""Upsert the fingerprint for *path* after a successful glean."""
tid = resolve_tenant_id()
conn.execute(frag.fingerprint_upsert(), (tid, str(path), mtime, size, gleaned_at))
def _detect_format(first_line: str) -> str:
try:
obj = json.loads(first_line)
if "__REALTIME_TIMESTAMP" in obj:
return "journald"
if "SOURCE" in obj and str(obj.get("SOURCE", "")).startswith("docker:"):
return "docker"
if wazuh.is_wazuh_alert(obj):
return "wazuh"
if "ts" in obj and ("msg" in obj or "message" in obj or "request" in obj):
return "caddy"
except (json.JSONDecodeError, AttributeError):
pass
if plex.is_plex_log(first_line):
return "plex"
if qbittorrent.is_qbit_log(first_line):
return "qbittorrent"
if servarr.is_servarr_log(first_line):
return "servarr"
if dmesg_log.is_dmesg_log(first_line):
return "dmesg"
if syslog.is_syslog(first_line):
return "syslog"
return "plaintext"
def _parse_file(
path: Path,
compiled: list[tuple[LogPattern, object]],
ingest_time: str,
source_id: str | None = None,
) -> Iterator[RetrievedEntry]:
source_id = source_id or path.stem
with path.open("r", errors="replace") as f:
lines = iter(f)
try:
first = next(lines)
except StopIteration:
return
fmt = _detect_format(first.strip())
logger.info("Detected format %r for %s", fmt, path.name)
def all_lines():
yield first
yield from lines
if fmt == "journald":
yield from journald.parse(all_lines(), source_id, compiled, ingest_time)
elif fmt == "wazuh":
yield from wazuh.parse(all_lines(), source_id, compiled, ingest_time)
elif fmt == "docker":
yield from docker_log.parse(all_lines(), source_id, compiled, ingest_time)
elif fmt == "caddy":
yield from caddy.parse(all_lines(), source_id, compiled, ingest_time)
elif fmt == "plex":
yield from plex.parse(all_lines(), source_id, compiled, ingest_time)
elif fmt == "qbittorrent":
yield from qbittorrent.parse(all_lines(), source_id, compiled, ingest_time)
elif fmt == "servarr":
yield from servarr.parse(all_lines(), source_id, compiled, ingest_time)
elif fmt == "dmesg":
yield from dmesg_log.parse(all_lines(), source_id, compiled, ingest_time)
elif fmt == "syslog":
yield from syslog.parse(all_lines(), source_id, compiled, ingest_time)
else:
yield from plaintext.parse(all_lines(), source_id, compiled, ingest_time)
def _write_batch(conn: Any, batch: list[RetrievedEntry]) -> None:
tid = resolve_tenant_id()
conflict = frag.entries_conflict_clause()
sql = f"""
{frag.insert_ignore_entries()}
(tenant_id, id, source_id, sequence, timestamp_raw, timestamp_iso,
ingest_time, severity, repeat_count, out_of_order,
matched_patterns, text)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?)
{conflict}
"""
conn.executemany(
sql,
[
(
tid, e.entry_id, e.source_id, e.sequence,
e.timestamp_raw, e.timestamp_iso, e.ingest_time,
e.severity, e.repeat_count, int(e.out_of_order),
json.dumps(list(e.matched_patterns)), e.text,
)
for e in batch
],
)
def _glean_files(
files: list[Path],
db_path: Path,
pattern_file: Path | None = None,
batch_size: int = 1000,
source_id_map: dict[Path, str] | None = None,
force: bool = False,
) -> dict[str, int]:
pattern_file = pattern_file or Path("patterns/default.yaml")
patterns = load_patterns(pattern_file)
compiled = _compile(patterns)
ingest_time = now_iso()
source_id_map = source_id_map or {}
ensure_schema(db_path)
with get_conn(db_path) as conn:
stats: dict[str, int] = {}
skipped: list[str] = []
for log_file in files:
source_id = source_id_map.get(log_file, log_file.stem)
mtime, size = _fingerprint(log_file)
if not force and _fp_unchanged(conn, log_file, mtime, size):
logger.debug("Skipping unchanged file: %s", log_file.name)
skipped.append(log_file.name)
stats[source_id] = stats.get(source_id, 0)
continue
count = 0
batch: list[RetrievedEntry] = []
for entry in _parse_file(log_file, compiled, ingest_time, source_id=source_id):
batch.append(entry)
if len(batch) >= batch_size:
_write_batch(conn, batch)
conn.commit()
count += len(batch)
batch.clear()
if batch:
_write_batch(conn, batch)
conn.commit()
count += len(batch)
_save_fingerprint(conn, log_file, mtime, size, ingest_time)
conn.commit()
stats[source_id] = stats.get(source_id, 0) + count
logger.info("Gleaned %d entries from %s (source: %s)", count, log_file.name, source_id)
if skipped:
logger.info("Skipped %d unchanged file(s): %s", len(skipped), ", ".join(skipped))
logger.info("Building FTS index...")
build_fts_index(db_path)
logger.info("FTS index ready")
return stats
def _stream_and_write(
transport: SSHTransport,
cmd: str,
parser,
source_id: str,
compiled: list[tuple[LogPattern, object]],
ingest_time: str,
conn: Any,
batch_size: int,
) -> int:
"""Stream *cmd* output through *parser* and write entries to *conn*.
Catches SSHCommandError per-item so one bad command doesn't abort the rest
of the glean items for this host. Returns the number of entries written.
"""
count = 0
batch: list[RetrievedEntry] = []
try:
for entry in parser(transport.exec_stream(cmd), source_id, compiled, ingest_time):
batch.append(entry)
if len(batch) >= batch_size:
_write_batch(conn, batch)
conn.commit()
count += len(batch)
batch.clear()
if batch:
_write_batch(conn, batch)
conn.commit()
count += len(batch)
except SSHCommandError as exc:
logger.warning("SSH command failed for source %r (cmd: %s): %s", source_id, cmd, exc)
logger.info("Gleaned %d entries from SSH source %s", count, source_id)
return count
def _glean_ssh_source(
src: dict, # type: ignore[type-arg]
compiled: list[tuple[LogPattern, object]],
ingest_time: str,
conn: Any,
batch_size: int,
) -> dict[str, int]:
"""Open one SSHTransport connection for *src* and glean all its glean items.
One SSH connection is shared across all items in the ``glean:`` list so
the handshake overhead is paid only once per host per glean run.
Returns a stats dict mapping ``{source_id: entry_count}`` for each item.
Gracefully skips the entire source on SSHConnectionError.
"""
host_id = src.get("id", src.get("host", "unknown"))
host = src["host"]
user = src["user"]
key_path = str(Path(src["key_path"]).expanduser())
port = int(src.get("port", 22))
glean_items: list[dict] = src.get("glean", []) # type: ignore[type-arg]
stats: dict[str, int] = {}
try:
with SSHTransport(host=host, user=user, key_path=key_path, port=port) as t:
for item in glean_items:
item_type = item.get("type", "plaintext")
# Per-item source_id — falls back to host_id/type for un-labelled items
item_id = item.get("id") or f"{host_id}/{item_type}"
if item_type == "journald":
cmd = _build_journald_command(item)
count = _stream_and_write(
t, cmd, journald.parse, item_id, compiled, ingest_time, conn, batch_size
)
stats[item_id] = stats.get(item_id, 0) + count
elif item_type == "syslog":
cmd = _build_syslog_command(item)
count = _stream_and_write(
t, cmd, syslog.parse, item_id, compiled, ingest_time, conn, batch_size
)
stats[item_id] = stats.get(item_id, 0) + count
elif item_type == "plaintext":
cmd = _build_plaintext_command(item)
count = _stream_and_write(
t, cmd, plaintext.parse, item_id, compiled, ingest_time, conn, batch_size
)
stats[item_id] = stats.get(item_id, 0) + count
elif item_type == "docker":
cmds = _build_docker_command(item)
if isinstance(cmds, str):
cmds = [cmds]
containers: list[str] = item.get("containers", [])
for i, cmd in enumerate(cmds):
# Use the container name as the final path segment when available
container_name = containers[i] if i < len(containers) else str(i)
container_id = f"{item_id}/{container_name}" if len(cmds) > 1 else item_id
count = _stream_and_write(
t, cmd, docker_log.parse, container_id,
compiled, ingest_time, conn, batch_size,
)
stats[container_id] = stats.get(container_id, 0) + count
else:
logger.warning(
"Unknown SSH glean type %r for source %r — skipping item",
item_type, host_id,
)
except SSHConnectionError as exc:
logger.warning("SSH connection failed for source %r: %s", host_id, exc)
return stats
def glean_ssh_source(
src: dict, # type: ignore[type-arg]
db_path: Path,
pattern_file: Path | None = None,
batch_size: int = 1000,
) -> dict[str, int]:
"""Glean a single SSH source dict and write results to *db_path*.
Public wrapper around :func:`_glean_ssh_source` for the REST layer.
Manages the DB connection, pattern compilation, and FTS rebuild so callers
don't have to deal with those lifecycle concerns.
Returns stats mapping ``{sub_source_id: entry_count}``.
"""
effective_pattern_file = pattern_file or Path("patterns/default.yaml")
compiled = _compile(load_patterns(effective_pattern_file))
ingest_time = now_iso()
ensure_schema(db_path)
with get_conn(db_path) as conn:
stats = _glean_ssh_source(src, compiled, ingest_time, conn, batch_size)
logger.info("Rebuilding FTS index after SSH source glean...")
build_fts_index(db_path)
return stats
def glean_dir(
corpus_dir: Path,
db_path: Path,
pattern_file: Path | None = None,
batch_size: int = 1000,
force: bool = False,
) -> dict[str, int]:
"""Glean all .jsonl and .log files from a corpus directory.
Pass ``force=True`` to bypass fingerprint checks and re-glean all files
regardless of whether they have changed since the last run.
"""
files = sorted(corpus_dir.rglob("*.jsonl")) + sorted(corpus_dir.rglob("*.log"))
return _glean_files(files, db_path, pattern_file, batch_size, force=force)
def glean_file(
log_file: Path,
db_path: Path,
pattern_file: Path | None = None,
force: bool = False,
) -> dict[str, int]:
"""Glean a single log file (any supported format).
Pass ``force=True`` to re-glean even when the file fingerprint is unchanged.
"""
return _glean_files([log_file], db_path, pattern_file, force=force)
def glean_sources(
sources_file: Path,
db_path: Path,
pattern_file: Path | None = None,
batch_size: int = 1000,
force: bool = False,
) -> dict[str, int]:
"""Glean all sources listed in a sources.yaml config file.
Supports two source types:
Local file sources (default):
sources:
- id: sonarr
path: /opt/sonarr/config/logs/sonarr.0.txt
SSH remote sources (transport: ssh):
sources:
- id: rack01
transport: ssh
host: 192.168.1.10
user: admin
key_path: ~/.ssh/id_ed25519
glean:
- type: journald
args: ["--since", "2 hours ago"]
- type: syslog
path: /var/log/syslog
- type: plaintext
path: /var/log/app/error.log
- type: docker
containers: [myapp, nginx]
Missing local paths and SSH connection failures are logged as warnings
so the cron keeps running when a source is temporarily down.
"""
with open(sources_file) as f:
config = yaml.safe_load(f)
local_sources: list[dict] = [] # type: ignore[type-arg]
ssh_sources: list[dict] = [] # type: ignore[type-arg]
for src in config.get("sources", []):
if src.get("transport") == "ssh":
ssh_sources.append(src)
else:
local_sources.append(src)
# ── Local file sources ─────────────────────────────────────────────────
files: list[Path] = []
source_id_map: dict[Path, str] = {}
for src in local_sources:
path = Path(src["path"])
if not path.exists():
logger.warning("Source %r not found, skipping: %s", src.get("id", "?"), path)
continue
files.append(path)
if "id" in src:
source_id_map[path] = src["id"]
if not files and not ssh_sources:
logger.warning("No sources found — check sources.yaml paths")
return {}
stats: dict[str, int] = {}
if files:
stats.update(_glean_files(files, db_path, pattern_file, batch_size, source_id_map, force=force))
# ── SSH remote sources ─────────────────────────────────────────────────
if not ssh_sources:
return stats
# Compile patterns once, share across all SSH sources in this run.
effective_pattern_file = pattern_file or Path("patterns/default.yaml")
compiled = _compile(load_patterns(effective_pattern_file))
ingest_time = now_iso()
ensure_schema(db_path)
with get_conn(db_path) as conn:
for src in ssh_sources:
ssh_stats = _glean_ssh_source(src, compiled, ingest_time, conn, batch_size)
for k, v in ssh_stats.items():
stats[k] = stats.get(k, 0) + v
conn.commit()
# Rebuild FTS only when SSH sources added entries (_glean_files already
# rebuilds when local sources are present; safe to call again if both ran).
if ssh_sources:
logger.info("Rebuilding FTS index after SSH glean...")
build_fts_index(db_path)
return stats

View file

@ -10,7 +10,7 @@ import re
from datetime import datetime, timezone from datetime import datetime, timezone
from typing import Iterator from typing import Iterator
from app.ingest.base import ( from app.glean.base import (
SourceState, apply_patterns, detect_severity, make_entry_id, now_iso, SourceState, apply_patterns, detect_severity, make_entry_id, now_iso,
) )
from app.services.models import LogPattern, RetrievedEntry from app.services.models import LogPattern, RetrievedEntry
@ -32,13 +32,14 @@ def _extract_ts(line: str) -> tuple[str, str]:
if m: if m:
ts_raw = m.group("ts") ts_raw = m.group("ts")
try: try:
# Strip fractional seconds / TZ for strptime compat # Strip fractional seconds / TZ for strptime compat.
# Normalise ISO 8601 T-separator to space so strptime format matches.
clean = re.sub(r"(\.\d+)?([Zz]|[+-]\d{2}:?\d{2})?$", "", ts_raw).strip() clean = re.sub(r"(\.\d+)?([Zz]|[+-]\d{2}:?\d{2})?$", "", ts_raw).strip()
clean = clean.replace("T", " ") clean = clean.replace("T", " ")
dt = datetime.strptime(clean, fmt) dt = datetime.strptime(clean, fmt.replace("T", " "))
if dt.year == 1900: if dt.year == 1900:
dt = dt.replace(year=datetime.now().year) dt = dt.replace(year=datetime.now().year)
dt = dt.replace(tzinfo=timezone.utc) dt = dt.astimezone(timezone.utc)
return ts_raw, dt.isoformat() return ts_raw, dt.isoformat()
except ValueError: except ValueError:
pass pass

View file

@ -12,7 +12,7 @@ import re
from datetime import datetime, timezone from datetime import datetime, timezone
from typing import Iterator from typing import Iterator
from app.ingest.base import ( from app.glean.base import (
SourceState, apply_patterns, make_entry_id, now_iso, SourceState, apply_patterns, make_entry_id, now_iso,
) )
from app.services.models import LogPattern, RetrievedEntry from app.services.models import LogPattern, RetrievedEntry
@ -39,8 +39,8 @@ _LEVEL_MAP = {
def _parse_ts(month: str, day: str, year: str, time: str) -> tuple[str, str]: def _parse_ts(month: str, day: str, year: str, time: str) -> tuple[str, str]:
raw = f"{month} {day}, {year} {time}" raw = f"{month} {day}, {year} {time}"
try: try:
# Plex logs are local time — treat as UTC for now (no TZ in log) # Plex logs use local time; convert to UTC for consistent DB storage
dt = datetime.strptime(raw, "%b %d, %Y %H:%M:%S.%f").replace(tzinfo=timezone.utc) dt = datetime.strptime(raw, "%b %d, %Y %H:%M:%S.%f").astimezone(timezone.utc)
return raw, dt.isoformat() return raw, dt.isoformat()
except ValueError: except ValueError:
return raw, "" return raw, ""

View file

@ -18,7 +18,7 @@ import re
from datetime import datetime, timezone from datetime import datetime, timezone
from typing import Iterator from typing import Iterator
from app.ingest.base import ( from app.glean.base import (
SourceState, apply_patterns, detect_severity, make_entry_id, now_iso, SourceState, apply_patterns, detect_severity, make_entry_id, now_iso,
) )
from app.services.models import LogPattern, RetrievedEntry from app.services.models import LogPattern, RetrievedEntry
@ -55,7 +55,7 @@ def _parse_ts(ts_str: str) -> tuple[str, str]:
"""Return (raw, iso). Handles classic (space sep) and hotio (T sep) timestamps.""" """Return (raw, iso). Handles classic (space sep) and hotio (T sep) timestamps."""
for fmt in ("%Y-%m-%dT%H:%M:%S", "%Y-%m-%d %H:%M:%S", "%Y/%m/%d %H:%M:%S"): for fmt in ("%Y-%m-%dT%H:%M:%S", "%Y-%m-%d %H:%M:%S", "%Y/%m/%d %H:%M:%S"):
try: try:
dt = datetime.strptime(ts_str, fmt).replace(tzinfo=timezone.utc) dt = datetime.strptime(ts_str, fmt).astimezone(timezone.utc)
return ts_str, dt.isoformat() return ts_str, dt.isoformat()
except ValueError: except ValueError:
continue continue

View file

@ -12,7 +12,7 @@ import re
from datetime import datetime, timezone from datetime import datetime, timezone
from typing import Iterator from typing import Iterator
from app.ingest.base import ( from app.glean.base import (
SourceState, apply_patterns, detect_severity, make_entry_id, now_iso, SourceState, apply_patterns, detect_severity, make_entry_id, now_iso,
) )
from app.services.models import LogPattern, RetrievedEntry from app.services.models import LogPattern, RetrievedEntry
@ -38,7 +38,7 @@ _LEVEL_MAP: dict[str, str | None] = {
def _parse_ts(ts_str: str) -> tuple[str, str]: def _parse_ts(ts_str: str) -> tuple[str, str]:
base = ts_str.split(".")[0] base = ts_str.split(".")[0]
try: try:
dt = datetime.strptime(base, "%Y-%m-%d %H:%M:%S").replace(tzinfo=timezone.utc) dt = datetime.strptime(base, "%Y-%m-%d %H:%M:%S").astimezone(timezone.utc)
return ts_str, dt.isoformat() return ts_str, dt.isoformat()
except ValueError: except ValueError:
return ts_str, "" return ts_str, ""

225
app/glean/ssh.py Normal file
View file

@ -0,0 +1,225 @@
"""SSH transport layer for remote log gleaning (issue #22).
Wraps Paramiko to provide a clean context-manager interface for executing
remote commands and streaming their stdout output. All format parsing is
delegated to the existing per-format parsers (journald, syslog, plaintext,
docker); this module is transport only.
Key design choices:
- Key-based auth only no password prompts in a daemon context.
- exec_stream is a generator; exit-status check fires after all lines are
yielded, so callers must drain the iterator (e.g. list()) to trigger it.
- Command builders live here because they encode SSH/remote-execution idioms
(journalctl flags, docker logs invocation) that the generic parsers don't
need to know about.
Example sources.yaml snippet::
sources:
- id: rack01
transport: ssh
host: 192.168.1.10
user: admin
key_path: ~/.ssh/id_ed25519
glean:
- type: journald
args: ["--since", "2 hours ago"]
- type: syslog
path: /var/log/syslog
- type: plaintext
path: /var/log/app/error.log
- type: docker
containers: [myapp, nginx]
"""
from __future__ import annotations
import shlex
from collections.abc import Iterator
from typing import Union
import paramiko
__all__ = [
"SSHConnectionError",
"SSHCommandError",
"SSHTransport",
"_build_journald_command",
"_build_syslog_command",
"_build_plaintext_command",
"_build_docker_command",
]
# Default syslog path used when none is specified in the source spec.
_SYSLOG_DEFAULT_PATH = "/var/log/syslog"
# ── Custom exceptions ─────────────────────────────────────────────────────────
class SSHConnectionError(Exception):
"""Raised when the SSH connection cannot be established or authenticated."""
class SSHCommandError(Exception):
"""Raised when a remote command exits with a non-zero status code."""
# ── Transport context manager ─────────────────────────────────────────────────
class SSHTransport:
"""Context manager wrapping a Paramiko SSH connection.
Opens the connection on ``__enter__`` and closes it on ``__exit__``,
even if an exception propagates. Key-based authentication only.
Usage::
with SSHTransport(host="10.0.0.1", user="admin",
key_path="~/.ssh/id_ed25519") as t:
for line in t.exec_stream("journalctl -o json --since '1 hour ago'"):
process(line)
"""
def __init__(
self,
host: str,
user: str,
key_path: str,
port: int = 22,
) -> None:
self._host = host
self._user = user
self._key_path = key_path
self._port = port
self._client: paramiko.SSHClient | None = None
# ── context manager protocol ──────────────────────────────────────────────
def __enter__(self) -> "SSHTransport":
client = paramiko.SSHClient()
client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
try:
client.connect(
hostname=self._host,
username=self._user,
key_filename=self._key_path,
port=self._port,
)
except paramiko.AuthenticationException as exc:
client.close()
raise SSHConnectionError(
f"SSH auth failed for {self._user}@{self._host}: {exc}"
) from exc
except paramiko.SSHException as exc:
client.close()
raise SSHConnectionError(
f"SSH connection failed to {self._host}: {exc}"
) from exc
self._client = client
return self
def __exit__(self, exc_type, exc_val, exc_tb) -> None: # type: ignore[override]
if self._client is not None:
self._client.close()
self._client = None
# Return None (falsy) so any in-flight exception is not suppressed.
# ── remote execution ──────────────────────────────────────────────────────
def exec_stream(self, command: str) -> Iterator[str]:
"""Execute *command* on the remote host and yield stdout lines.
The exit-status check runs after all stdout lines have been yielded,
so callers must drain the iterator to trigger it::
list(transport.exec_stream(cmd)) # raises if exit != 0
Raises:
SSHConnectionError: if called outside a ``with`` block.
SSHCommandError: if the remote command exits non-zero.
"""
if self._client is None:
raise SSHConnectionError(
"Not connected — use SSHTransport as a context manager"
)
_, stdout, stderr = self._client.exec_command(command)
for line in stdout:
yield line
exit_code = stdout.channel.recv_exit_status()
# Guard against MagicMock in tests: only treat real integer exit codes.
if isinstance(exit_code, int) and exit_code != 0:
error_msg = stderr.read().decode(errors="replace").strip()
raise SSHCommandError(
f"Command failed (exit {exit_code}): {error_msg}"
)
# ── Command builders ──────────────────────────────────────────────────────────
def _build_journald_command(spec: dict) -> str: # type: ignore[type-arg]
"""Build a ``journalctl`` command string from a glean source spec.
Spec keys:
- ``args`` list of extra journalctl arguments appended verbatim.
- ``unit`` shorthand for ``--unit <name>`` (inserted before ``args``).
Returns a single shell command string.
"""
parts = ["journalctl", "-o json", "--no-pager"]
if "unit" in spec:
parts.append(f"--unit {spec['unit']}")
if "args" in spec:
parts.extend(spec["args"])
return " ".join(parts)
def _build_syslog_command(spec: dict) -> str: # type: ignore[type-arg]
"""Build a ``cat`` command for a syslog-format log file.
Spec keys:
- ``path`` path to the file (default: ``/var/log/syslog``).
Returns a single shell command string.
"""
path = spec.get("path", _SYSLOG_DEFAULT_PATH)
return f"cat {shlex.quote(path)}"
def _build_plaintext_command(spec: dict) -> str: # type: ignore[type-arg]
"""Build a ``cat`` command for an arbitrary plaintext log file.
Spec keys:
- ``path`` **required** path to the log file.
Raises:
KeyError: if ``path`` is absent from the spec.
"""
path = spec["path"] # intentional KeyError if missing — callers must supply it
return f"cat {shlex.quote(path)}"
def _build_docker_command(
spec: dict, # type: ignore[type-arg]
) -> Union[str, list[str]]:
"""Build ``docker logs`` command(s) for one or more named containers.
Spec keys:
- ``containers`` **required** list of container names or IDs.
Returns a single command string when there is one container, or a list
of command strings when there are multiple (one command per container so
each can be streamed independently).
Raises:
KeyError: if ``containers`` is absent from the spec.
ValueError: if ``containers`` is an empty list.
"""
containers = spec["containers"] # intentional KeyError if missing
if not containers:
raise ValueError("'containers' must be a non-empty list")
commands = [f"docker logs {shlex.quote(c)}" for c in containers]
return commands[0] if len(commands) == 1 else commands

View file

@ -14,7 +14,7 @@ import re
from datetime import datetime, timezone from datetime import datetime, timezone
from typing import Iterator from typing import Iterator
from app.ingest.base import ( from app.glean.base import (
SourceState, apply_patterns, detect_severity, make_entry_id, now_iso, SourceState, apply_patterns, detect_severity, make_entry_id, now_iso,
) )
from app.services.models import LogPattern, RetrievedEntry from app.services.models import LogPattern, RetrievedEntry
@ -26,6 +26,8 @@ _MONTHS = {
# May 11 14:23:01 hostname ident[pid]: message # May 11 14:23:01 hostname ident[pid]: message
# May 1 04:00:00 hostname ident: message (no pid, day may be space-padded) # May 1 04:00:00 hostname ident: message (no pid, day may be space-padded)
# <134>May 11 14:23:01 ... (optional RFC 3164 PRI prefix from network syslog)
_PRI_RE = re.compile(r"^<\d{1,3}>")
_LINE_RE = re.compile( _LINE_RE = re.compile(
r"^(?P<month>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)" r"^(?P<month>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)"
r"\s+(?P<day>\d{1,2})\s+(?P<time>\d{2}:\d{2}:\d{2})" r"\s+(?P<day>\d{1,2})\s+(?P<time>\d{2}:\d{2}:\d{2})"
@ -35,7 +37,8 @@ _LINE_RE = re.compile(
def is_syslog(first_line: str) -> bool: def is_syslog(first_line: str) -> bool:
return bool(_LINE_RE.match(first_line.strip())) stripped = _PRI_RE.sub("", first_line.strip(), count=1)
return bool(_LINE_RE.match(stripped))
def _parse_ts(month_str: str, day: str, time_str: str) -> tuple[str, str]: def _parse_ts(month_str: str, day: str, time_str: str) -> tuple[str, str]:
@ -44,8 +47,7 @@ def _parse_ts(month_str: str, day: str, time_str: str) -> tuple[str, str]:
ts_raw = f"{month_str} {int(day):2d} {time_str}" ts_raw = f"{month_str} {int(day):2d} {time_str}"
try: try:
dt = datetime(year, month, int(day), dt = datetime(year, month, int(day),
*[int(p) for p in time_str.split(":")], *[int(p) for p in time_str.split(":")]).astimezone(timezone.utc)
tzinfo=timezone.utc)
return ts_raw, dt.isoformat() return ts_raw, dt.isoformat()
except ValueError: except ValueError:
return ts_raw, "" return ts_raw, ""
@ -80,7 +82,7 @@ def parse(
) )
for raw_line in lines: for raw_line in lines:
line = raw_line.rstrip("\n") line = _PRI_RE.sub("", raw_line.rstrip("\n"), count=1)
m = _LINE_RE.match(line) m = _LINE_RE.match(line)
if m: if m:
if pending_text is not None: if pending_text is not None:

100
app/glean/tautulli.py Normal file
View file

@ -0,0 +1,100 @@
"""Tautulli webhook ingestor.
Parses a Tautulli notification agent JSON payload into a single RetrievedEntry.
Tautulli sends all template values as strings, so all fields are treated as str.
"""
from __future__ import annotations
from app.glean.base import (
apply_patterns,
epoch_float_to_iso,
make_entry_id,
now_iso,
)
from app.services.models import LogPattern, RetrievedEntry
_ACTION_SEVERITY: dict[str, str | None] = {
"error": "CRITICAL",
"buffer": "WARN",
}
def _severity(action: str) -> str | None:
return _ACTION_SEVERITY.get(action.lower())
def _format_text(p: dict) -> str:
action = p.get("action", "").lower()
user = p.get("user") or "unknown"
player = p.get("player") or "unknown player"
grandparent = p.get("grandparent_title", "").strip()
title = p.get("title", "").strip()
media = f'"{grandparent}{title}"' if grandparent else f'"{title}"'
quality = p.get("quality", "")
video_dec = p.get("video_decision", "")
stream = f"{quality}, {video_dec}" if quality and video_dec else quality or video_dec
err = p.get("error_message", "").strip()
if action == "error":
base = f"[plex:error] {user} on {player}: {media}"
return f"{base}{err}" if err else base
if action == "buffer":
return f"[plex:buffer] {user} on {player}: {media} is buffering"
if action in ("play", "resume"):
parts = [f"[plex:{action}] {user} on {player}: {media}"]
if stream:
parts.append(f"({stream})")
return " ".join(parts)
if action == "stop":
return f"[plex:stop] {user} stopped {media} on {player}"
if action == "pause":
return f"[plex:pause] {user} paused {media} on {player}"
return f"[plex:{action}] {user}: {media} on {player}"
def is_tautulli_payload(payload: dict) -> bool:
"""Return True if the payload looks like a Tautulli webhook."""
return "action" in payload and "session_key" in payload
def parse_webhook(
payload: dict,
compiled_patterns: list[tuple[LogPattern, object]],
) -> RetrievedEntry:
"""Parse a Tautulli webhook payload into a single RetrievedEntry."""
source_id = "tautulli"
action = payload.get("action", "")
text = _format_text(payload)
raw_ts = payload.get("timestamp") or ""
try:
ts_float = float(raw_ts) if raw_ts else 0.0
except (ValueError, TypeError):
ts_float = 0.0
if ts_float:
timestamp_iso: str | None = epoch_float_to_iso(ts_float)
timestamp_raw: str | None = raw_ts
else:
timestamp_iso = now_iso()
timestamp_raw = None
ingest_time = now_iso()
severity = _severity(action)
matched = apply_patterns(text, compiled_patterns)
id_ts = str(raw_ts) if raw_ts else ingest_time
entry_id = make_entry_id(source_id, 0, id_ts + text)
return RetrievedEntry(
entry_id=entry_id,
source_id=source_id,
sequence=0,
timestamp_raw=timestamp_raw,
timestamp_iso=timestamp_iso,
ingest_time=ingest_time,
severity=severity,
repeat_count=1,
out_of_order=False,
matched_patterns=matched,
text=text,
)

161
app/glean/wazuh.py Normal file
View file

@ -0,0 +1,161 @@
"""Wazuh SIEM alert parser.
Handles Wazuh's alerts.json format (JSON Lines — one alert object per line):
/var/ossec/logs/alerts/alerts.json (on the Wazuh manager)
Each line is a complete JSON object. Key fields used:
timestamp ISO 8601 with timezone offset ("2024-01-15T10:23:45.123+0000")
rule.level 1-15 (maps to Turnstone severity)
rule.id Wazuh rule ID
rule.description human-readable rule description (primary message text)
rule.groups list of category tags
agent.name hostname that generated the original event
agent.ip agent IP address
full_log original raw log line that triggered the alert
location log file or input that was monitored
data dict of decoded fields (srcip, dstip, url, etc.)
"""
from __future__ import annotations
import json
from datetime import datetime, timezone
from typing import Iterator
from app.glean.base import (
SourceState, apply_patterns, make_entry_id, now_iso,
)
from app.services.models import LogPattern, RetrievedEntry
# Wazuh rule levels 1-15 → Turnstone severity labels.
# Levels < 4 are normally informational, 7+ begin to matter operationally,
# 10+ correspond to SIEM-worthy events, 13+ are critical.
_LEVEL_SEVERITY: dict[int, str] = {
1: "DEBUG", 2: "DEBUG", 3: "DEBUG",
4: "INFO", 5: "INFO", 6: "NOTICE",
7: "WARN", 8: "WARN", 9: "WARN",
10: "ERROR", 11: "ERROR", 12: "ERROR",
13: "CRITICAL", 14: "CRITICAL", 15: "CRITICAL",
}
def is_wazuh_alert(obj: dict) -> bool:
"""Return True if a parsed JSON object looks like a Wazuh alert."""
return (
isinstance(obj.get("rule"), dict)
and isinstance(obj.get("agent"), dict)
and ("timestamp" in obj or "manager" in obj)
)
def _parse_timestamp(raw: str) -> str:
"""Convert Wazuh's ISO 8601 timestamp to UTC ISO 8601."""
if not raw:
return ""
for fmt in (
"%Y-%m-%dT%H:%M:%S.%f%z",
"%Y-%m-%dT%H:%M:%S%z",
"%Y-%m-%dT%H:%M:%S.%fZ",
"%Y-%m-%dT%H:%M:%SZ",
):
try:
dt = datetime.strptime(raw, fmt)
return dt.astimezone(timezone.utc).isoformat()
except ValueError:
continue
return raw
def _build_text(alert: dict) -> str:
"""Compose a readable, searchable text representation of the alert."""
rule = alert.get("rule", {})
agent = alert.get("agent", {})
agent_name = agent.get("name", "unknown")
agent_ip = agent.get("ip", "")
rule_id = rule.get("id", "")
rule_desc = rule.get("description", "(no description)")
groups = rule.get("groups", [])
location = alert.get("location", "")
full_log = alert.get("full_log", "")
parts: list[str] = []
# Header line: agent + rule context
agent_tag = f"{agent_name}/{agent_ip}" if agent_ip else agent_name
group_tag = ",".join(groups) if groups else ""
header = f"[wazuh][agent:{agent_tag}][rule:{rule_id}]"
if group_tag:
header += f"[{group_tag}]"
parts.append(f"{header} {rule_desc}")
if location:
parts.append(f"location: {location}")
# Extra decoded fields (srcip, dstip, url, user, etc.)
data = alert.get("data", {})
if isinstance(data, dict) and data:
kv = " | ".join(f"{k}={v}" for k, v in sorted(data.items()) if v)
if kv:
parts.append(kv)
if full_log and full_log.strip() != rule_desc.strip():
parts.append(f"raw: {full_log.strip()}")
return "\n".join(parts)
def parse(
lines: Iterator[str],
source_id: str,
compiled_patterns: list[tuple[LogPattern, object]],
ingest_time: str | None = None,
) -> Iterator[RetrievedEntry]:
ingest_time = ingest_time or now_iso()
state = SourceState()
for raw_line in lines:
raw_line = raw_line.strip()
if not raw_line:
continue
try:
alert = json.loads(raw_line)
except json.JSONDecodeError:
continue
if not isinstance(alert, dict):
continue
rule = alert.get("rule", {})
agent = alert.get("agent", {})
ts_raw = alert.get("timestamp", "")
ts_iso = _parse_timestamp(ts_raw)
level = int(rule.get("level", 0))
severity = _LEVEL_SEVERITY.get(level, "INFO")
# Qualify source_id by agent so logs from different hosts stay separate.
agent_name = agent.get("name", "")
src = f"{source_id}:{agent_name}" if agent_name else source_id
text = _build_text(alert)
if not text:
continue
repeat, out_of_order = state.observe(text, ts_iso)
matched = apply_patterns(text, compiled_patterns)
yield RetrievedEntry(
entry_id=make_entry_id(src, state.sequence, text),
source_id=src,
sequence=state.sequence,
timestamp_raw=ts_raw,
timestamp_iso=ts_iso,
ingest_time=ingest_time,
severity=severity,
repeat_count=repeat,
out_of_order=out_of_order,
matched_patterns=matched,
text=text,
)

View file

@ -1,276 +0,0 @@
"""Ingest pipeline: auto-detect format, parse, write to SQLite."""
from __future__ import annotations
import json
import logging
import re
import sqlite3
from pathlib import Path
from typing import Iterator
import yaml
from app.ingest import caddy, dmesg_log, docker_log, journald, plaintext, plex, qbittorrent, servarr, syslog
from app.ingest.base import _compile, load_patterns, now_iso
from app.services.models import LogPattern, RetrievedEntry
from app.services.search import build_fts_index
logger = logging.getLogger(__name__)
_SCHEMA = """
CREATE TABLE IF NOT EXISTS log_entries (
id TEXT PRIMARY KEY,
source_id TEXT NOT NULL,
sequence INTEGER NOT NULL,
timestamp_raw TEXT,
timestamp_iso TEXT,
ingest_time TEXT NOT NULL,
severity TEXT,
repeat_count INTEGER DEFAULT 1,
out_of_order INTEGER DEFAULT 0,
matched_patterns TEXT DEFAULT '[]',
text TEXT NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_source ON log_entries(source_id);
CREATE INDEX IF NOT EXISTS idx_timestamp ON log_entries(timestamp_iso);
CREATE INDEX IF NOT EXISTS idx_ts_repeat ON log_entries(timestamp_iso, repeat_count);
CREATE INDEX IF NOT EXISTS idx_severity ON log_entries(severity);
CREATE INDEX IF NOT EXISTS idx_patterns ON log_entries(matched_patterns);
CREATE TABLE IF NOT EXISTS incidents (
id TEXT PRIMARY KEY,
label TEXT NOT NULL,
issue_type TEXT NOT NULL DEFAULT '',
started_at TEXT,
ended_at TEXT,
notes TEXT NOT NULL DEFAULT '',
created_at TEXT NOT NULL,
severity TEXT NOT NULL DEFAULT 'medium'
);
CREATE INDEX IF NOT EXISTS idx_incidents_time ON incidents(started_at, ended_at);
CREATE TABLE IF NOT EXISTS received_bundles (
id TEXT PRIMARY KEY,
source_host TEXT NOT NULL,
issue_type TEXT NOT NULL DEFAULT '',
label TEXT NOT NULL,
severity TEXT NOT NULL DEFAULT 'medium',
started_at TEXT,
bundled_at TEXT NOT NULL,
entry_count INTEGER NOT NULL DEFAULT 0,
bundle_json TEXT NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_bundles_bundled ON received_bundles(bundled_at);
CREATE INDEX IF NOT EXISTS idx_bundles_type ON received_bundles(issue_type);
"""
def ensure_schema(db_path: Path) -> None:
"""Create all tables and apply additive migrations. Safe to call on every startup."""
conn = sqlite3.connect(str(db_path))
conn.execute("PRAGMA journal_mode=WAL")
conn.executescript(_SCHEMA)
# Additive column migrations — ALTER TABLE silently skips if column exists
for stmt in [
"ALTER TABLE incidents ADD COLUMN issue_type TEXT NOT NULL DEFAULT ''",
]:
try:
conn.execute(stmt)
except sqlite3.OperationalError:
pass
conn.commit()
conn.close()
def _detect_format(first_line: str) -> str:
try:
obj = json.loads(first_line)
if "__REALTIME_TIMESTAMP" in obj:
return "journald"
if "SOURCE" in obj and str(obj.get("SOURCE", "")).startswith("docker:"):
return "docker"
if "ts" in obj and ("msg" in obj or "message" in obj or "request" in obj):
return "caddy"
except (json.JSONDecodeError, AttributeError):
pass
if plex.is_plex_log(first_line):
return "plex"
if qbittorrent.is_qbit_log(first_line):
return "qbittorrent"
if servarr.is_servarr_log(first_line):
return "servarr"
if dmesg_log.is_dmesg_log(first_line):
return "dmesg"
if syslog.is_syslog(first_line):
return "syslog"
return "plaintext"
def _parse_file(
path: Path,
compiled: list[tuple[LogPattern, object]],
ingest_time: str,
source_id: str | None = None,
) -> Iterator[RetrievedEntry]:
source_id = source_id or path.stem
with path.open("r", errors="replace") as f:
lines = iter(f)
try:
first = next(lines)
except StopIteration:
return
fmt = _detect_format(first.strip())
logger.info("Detected format %r for %s", fmt, path.name)
def all_lines():
yield first
yield from lines
if fmt == "journald":
yield from journald.parse(all_lines(), source_id, compiled, ingest_time)
elif fmt == "docker":
yield from docker_log.parse(all_lines(), source_id, compiled, ingest_time)
elif fmt == "caddy":
yield from caddy.parse(all_lines(), source_id, compiled, ingest_time)
elif fmt == "plex":
yield from plex.parse(all_lines(), source_id, compiled, ingest_time)
elif fmt == "qbittorrent":
yield from qbittorrent.parse(all_lines(), source_id, compiled, ingest_time)
elif fmt == "servarr":
yield from servarr.parse(all_lines(), source_id, compiled, ingest_time)
elif fmt == "dmesg":
yield from dmesg_log.parse(all_lines(), source_id, compiled, ingest_time)
elif fmt == "syslog":
yield from syslog.parse(all_lines(), source_id, compiled, ingest_time)
else:
yield from plaintext.parse(all_lines(), source_id, compiled, ingest_time)
def _write_batch(conn: sqlite3.Connection, batch: list[RetrievedEntry]) -> None:
conn.executemany(
"""
INSERT OR IGNORE INTO log_entries
(id, source_id, sequence, timestamp_raw, timestamp_iso,
ingest_time, severity, repeat_count, out_of_order,
matched_patterns, text)
VALUES (?,?,?,?,?,?,?,?,?,?,?)
""",
[
(
e.entry_id, e.source_id, e.sequence,
e.timestamp_raw, e.timestamp_iso, e.ingest_time,
e.severity, e.repeat_count, int(e.out_of_order),
json.dumps(list(e.matched_patterns)), e.text,
)
for e in batch
],
)
def _ingest_files(
files: list[Path],
db_path: Path,
pattern_file: Path | None = None,
batch_size: int = 1000,
source_id_map: dict[Path, str] | None = None,
) -> dict[str, int]:
pattern_file = pattern_file or Path("patterns/default.yaml")
patterns = load_patterns(pattern_file)
compiled = _compile(patterns)
ingest_time = now_iso()
source_id_map = source_id_map or {}
conn = sqlite3.connect(str(db_path))
conn.execute("PRAGMA journal_mode=WAL")
conn.executescript(_SCHEMA)
conn.commit()
stats: dict[str, int] = {}
for log_file in files:
source_id = source_id_map.get(log_file, log_file.stem)
count = 0
batch: list[RetrievedEntry] = []
for entry in _parse_file(log_file, compiled, ingest_time, source_id=source_id):
batch.append(entry)
if len(batch) >= batch_size:
_write_batch(conn, batch)
conn.commit()
count += len(batch)
batch.clear()
if batch:
_write_batch(conn, batch)
conn.commit()
count += len(batch)
stats[source_id] = stats.get(source_id, 0) + count
logger.info("Ingested %d entries from %s (source: %s)", count, log_file.name, source_id)
conn.close()
logger.info("Building FTS index...")
build_fts_index(db_path)
logger.info("FTS index ready")
return stats
def ingest(
corpus_dir: Path,
db_path: Path,
pattern_file: Path | None = None,
batch_size: int = 1000,
) -> dict[str, int]:
"""Ingest all .jsonl and .log files from a corpus directory."""
files = sorted(corpus_dir.glob("*.jsonl")) + sorted(corpus_dir.glob("*.log"))
return _ingest_files(files, db_path, pattern_file, batch_size)
def ingest_file(
log_file: Path,
db_path: Path,
pattern_file: Path | None = None,
) -> dict[str, int]:
"""Ingest a single log file (any supported format)."""
return _ingest_files([log_file], db_path, pattern_file)
def ingest_sources(
sources_file: Path,
db_path: Path,
pattern_file: Path | None = None,
batch_size: int = 1000,
) -> dict[str, int]:
"""Ingest all sources listed in a sources.yaml config file.
sources.yaml format:
sources:
- id: sonarr
path: /opt/sonarr/config/logs/sonarr.0.txt
- id: qbittorrent
path: /opt/qbittorrent/config/data/logs/qbittorrent.log
Missing paths are skipped with a warning so the cron keeps running
when a service is temporarily down.
"""
with open(sources_file) as f:
config = yaml.safe_load(f)
files: list[Path] = []
source_id_map: dict[Path, str] = {}
for src in config.get("sources", []):
path = Path(src["path"])
if not path.exists():
logger.warning("Source %r not found, skipping: %s", src.get("id", "?"), path)
continue
files.append(path)
if "id" in src:
source_id_map[path] = src["id"]
if not files:
logger.warning("No source files found — check sources.yaml paths")
return {}
return _ingest_files(files, db_path, pattern_file, batch_size, source_id_map)

View file

@ -11,7 +11,7 @@ from __future__ import annotations
import logging import logging
import os import os
import sqlite3 import sqlite3 # still used for the pre-index-check on SQLite backend
import sys import sys
from pathlib import Path from pathlib import Path
@ -53,15 +53,15 @@ _index_ready = False
def _ensure_index() -> None: def _ensure_index() -> None:
"""Build FTS index on first use; skip if already present.""" """Build FTS index on first use; skip if already present (SQLite only)."""
global _index_ready global _index_ready
if _index_ready: if _index_ready:
return return
try: try:
conn = sqlite3.connect(str(DB_PATH)) raw = sqlite3.connect(str(DB_PATH), timeout=30.0)
count = conn.execute("SELECT COUNT(*) FROM log_fts").fetchone()[0] count = raw.execute("SELECT COUNT(*) FROM log_fts").fetchone()[0]
conn.close() raw.close()
if count > 0: if count > 0:
_index_ready = True _index_ready = True
logger.info("FTS index present (%d entries)", count) logger.info("FTS index present (%d entries)", count)
@ -93,8 +93,8 @@ def search_logs(
Example: '"connection refused" OR "connection lost"' Example: '"connection refused" OR "connection lost"'
severity: Filter by level EMERGENCY, ALERT, CRITICAL, ERROR, WARN, NOTICE, INFO, DEBUG. severity: Filter by level EMERGENCY, ALERT, CRITICAL, ERROR, WARN, NOTICE, INFO, DEBUG.
source: Partial match on source_id. Format is 'corpus:host:service'. source: Partial match on source_id. Format is 'corpus:host:service'.
Example: 'example-node:caddy' matches all Caddy entries from example-node. Example: 'myserver:caddy' matches all Caddy entries from myserver.
pattern: Filter by named pattern tag applied at ingest time. pattern: Filter by named pattern tag applied at glean time.
Known tags: auth_failure, connection_lost, oom, segfault, disk_full, Known tags: auth_failure, connection_lost, oom, segfault, disk_full,
timeout, caddy_tls_error, caddy_config_error, caddy_auth_error, timeout, caddy_tls_error, caddy_config_error, caddy_auth_error,
caddy_upstream_error, service_restart, service_update, caddy_upstream_error, service_restart, service_update,
@ -176,7 +176,7 @@ def list_log_sources() -> str:
""" """
sources = list_sources(DB_PATH) sources = list_sources(DB_PATH)
if not sources: if not sources:
return "No log sources found. Has the corpus been ingested? Run: python scripts/ingest_corpus.py" return "No log sources found. Has the corpus been gleaned? Run: python scripts/glean_corpus.py"
lines = [f"Corpus: {DB_PATH}", f"Sources ({len(sources)} total):\n"] lines = [f"Corpus: {DB_PATH}", f"Sources ({len(sources)} total):\n"]
for s in sources: for s in sources:
@ -192,7 +192,7 @@ def list_log_sources() -> str:
if __name__ == "__main__": if __name__ == "__main__":
if not DB_PATH.exists(): if not DB_PATH.exists():
logger.error("Database not found: %s", DB_PATH) logger.error("Database not found: %s", DB_PATH)
logger.error("Run: python scripts/ingest_corpus.py <corpus_dir> <db_path>") logger.error("Run: python scripts/glean_corpus.py <corpus_dir> <db_path>")
sys.exit(1) sys.exit(1)
logger.info("Starting Turnstone MCP server (DB: %s)", DB_PATH) logger.info("Starting Turnstone MCP server (DB: %s)", DB_PATH)
mcp.run() mcp.run()

File diff suppressed because it is too large Load diff

305
app/services/anomaly.py Normal file
View file

@ -0,0 +1,305 @@
"""Anomaly scoring pipeline — batch-score log_entries with a HF classifier.
Designed to run after each glean cycle (or standalone). When no model is
configured the scorer is a no-op and returns immediately, so it is always
safe to wire into the glean pipeline.
Model: any HuggingFace text-classification model. The existing Hybrid-BERT
label map (from diagnose/classifier.py) is reused when the model produces
NORMAL/SECURITY_ANOMALY/ outputs; other models get a generic severity map.
Scoring strategy
----------------
- Query unscored rows in batches (WHERE anomaly_scored_at IS NULL)
- Run each entry text through the HF pipeline
- Write anomaly_score + anomaly_label + anomaly_scored_at back
- INSERT high-confidence hits (score >= threshold) into detections table,
skipping duplicates so the scorer is safe to re-run
"""
from __future__ import annotations
import logging
import os
import time
import uuid
from dataclasses import dataclass
from datetime import datetime, timezone
from pathlib import Path
from typing import Any
from app.db import get_conn, resolve_tenant_id
from app.db.dialect import q
logger = logging.getLogger(__name__)
# ---------------------------------------------------------------------------
# Label maps — reuse Hybrid-BERT vocabulary from diagnose/classifier.py
# ---------------------------------------------------------------------------
_HYBRID_BERT_SEVERITY: dict[str, str] = {
"NORMAL": "INFO",
"SECURITY_ANOMALY": "ERROR",
"SYSTEM_FAILURE": "CRITICAL",
"PERFORMANCE_ISSUE": "WARN",
"NETWORK_ANOMALY": "WARN",
"CONFIG_ERROR": "ERROR",
"HARDWARE_ISSUE": "CRITICAL",
}
_GENERIC_SEVERITY: dict[str, str] = {
"CRITICAL": "CRITICAL",
"ERROR": "ERROR",
"WARNING": "WARN",
"WARN": "WARN",
"INFO": "INFO",
"DEBUG": "DEBUG",
}
_ANOMALOUS_LABELS: frozenset[str] = frozenset(
{
"SECURITY_ANOMALY",
"SYSTEM_FAILURE",
"PERFORMANCE_ISSUE",
"NETWORK_ANOMALY",
"CONFIG_ERROR",
"HARDWARE_ISSUE",
"CRITICAL",
"ERROR",
}
)
_DEFAULT_THRESHOLD = float(os.environ.get("TURNSTONE_ANOMALY_THRESHOLD", "0.75"))
_DEFAULT_MODEL = os.environ.get("TURNSTONE_ANOMALY_MODEL", "")
_DEFAULT_DEVICE = os.environ.get("TURNSTONE_ANOMALY_DEVICE", "cpu")
_DEFAULT_BATCH = int(os.environ.get("TURNSTONE_ANOMALY_BATCH", "256"))
# ---------------------------------------------------------------------------
# ML singleton
# ---------------------------------------------------------------------------
_pipeline: Any | None = None
def _get_pipeline(model_id: str, device: str) -> Any:
global _pipeline # noqa: PLW0603
if _pipeline is None:
from transformers import pipeline as hf_pipeline # type: ignore[import-untyped]
_pipeline = hf_pipeline("text-classification", model=model_id, device=device)
return _pipeline
def reset_pipeline() -> None:
"""Reset the cached pipeline singleton (test helper)."""
global _pipeline # noqa: PLW0603
_pipeline = None
# ---------------------------------------------------------------------------
# Result types
# ---------------------------------------------------------------------------
@dataclass
class ScoringResult:
scored: int = 0
detections: int = 0
skipped: bool = False
error: str | None = None
# ---------------------------------------------------------------------------
# Internal helpers
# ---------------------------------------------------------------------------
def _map_label(raw_label: str, score: float) -> tuple[str, str]:
"""Return (normalised_label, severity) for a raw model output label."""
upper = raw_label.upper()
if upper in _HYBRID_BERT_SEVERITY:
return upper, _HYBRID_BERT_SEVERITY[upper]
sev = _GENERIC_SEVERITY.get(upper, "WARN")
return upper, sev
def _fetch_unscored(conn: Any, tenant_id: str, limit: int) -> list[dict]:
rows = conn.execute(
q("""
SELECT id, source_id, text, timestamp_iso, severity
FROM log_entries
WHERE anomaly_scored_at IS NULL
AND (tenant_id = ? OR tenant_id = '')
ORDER BY ingest_time DESC
LIMIT ?
"""),
(tenant_id, limit),
).fetchall()
return [dict(r) for r in rows]
def _write_scores(
conn: Any,
rows: list[dict],
scored_at: str,
) -> None:
conn.executemany(
q("UPDATE log_entries SET anomaly_score = ?, anomaly_label = ?, anomaly_scored_at = ? WHERE id = ?"),
[(r["anomaly_score"], r["anomaly_label"], scored_at, r["id"]) for r in rows],
)
def _insert_detections(conn: Any, rows: list[dict], tenant_id: str, detected_at: str) -> int:
inserted = 0
for r in rows:
try:
conn.execute(
q("""
INSERT INTO detections
(id, tenant_id, entry_id, source_id, anomaly_label, anomaly_score,
severity, text, timestamp_iso, detected_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
"""),
(
str(uuid.uuid4()),
tenant_id,
r["id"],
r["source_id"],
r["anomaly_label"],
r["anomaly_score"],
r["severity"],
r["text"][:2000],
r.get("timestamp_iso"),
detected_at,
),
)
inserted += 1
except Exception: # noqa: BLE001
pass # duplicate entry_id or constraint violation — skip
return inserted
# ---------------------------------------------------------------------------
# Public API
# ---------------------------------------------------------------------------
def score_unscored(
db_path: Path,
model_id: str = _DEFAULT_MODEL,
device: str = _DEFAULT_DEVICE,
batch_size: int = _DEFAULT_BATCH,
threshold: float = _DEFAULT_THRESHOLD,
) -> ScoringResult:
"""Score all unscored log_entries in batches.
Returns immediately (skipped=True) when model_id is empty allows
unconditional wiring without requiring the model to be configured.
"""
if not model_id:
return ScoringResult(skipped=True)
try:
pipe = _get_pipeline(model_id, device)
except Exception as exc:
logger.error("Failed to load anomaly model %r: %s", model_id, exc)
return ScoringResult(error=str(exc))
tenant_id = resolve_tenant_id()
total_scored = 0
total_detections = 0
while True:
with get_conn(db_path) as conn:
batch = _fetch_unscored(conn, tenant_id, batch_size)
if not batch:
break
texts = [r["text"][:512] for r in batch]
try:
predictions = pipe(texts, truncation=True, max_length=512)
except Exception as exc:
logger.error("Inference error on batch of %d: %s", len(batch), exc)
return ScoringResult(scored=total_scored, detections=total_detections, error=str(exc))
scored_at = datetime.now(tz=timezone.utc).isoformat()
scored_rows: list[dict] = []
detection_rows: list[dict] = []
for row, pred in zip(batch, predictions):
label, severity = _map_label(pred["label"], pred["score"])
enriched = {**row, "anomaly_score": pred["score"], "anomaly_label": label, "severity": severity}
scored_rows.append(enriched)
if label in _ANOMALOUS_LABELS and pred["score"] >= threshold:
detection_rows.append(enriched)
for _attempt in range(4):
try:
with get_conn(db_path) as conn:
_write_scores(conn, scored_rows, scored_at)
det_count = _insert_detections(conn, detection_rows, tenant_id, scored_at)
conn.commit()
break
except Exception as exc:
if "database is locked" in str(exc).lower() and _attempt < 3:
logger.warning("DB locked, retrying write in 10s (attempt %d/4)", _attempt + 1)
time.sleep(10)
else:
raise
total_scored += len(scored_rows)
total_detections += det_count
logger.info(
"Scored %d entries, %d detections (threshold=%.2f)",
len(scored_rows), det_count, threshold,
)
if len(batch) < batch_size:
break
return ScoringResult(scored=total_scored, detections=total_detections)
def list_detections(
db_path: Path,
limit: int = 100,
unacked_only: bool = False,
label: str | None = None,
scorer: str | None = None,
) -> list[dict]:
"""Return detections ordered by detected_at DESC."""
tenant_id = resolve_tenant_id()
conditions = ["(tenant_id = ? OR tenant_id = '')"]
params: list[Any] = [tenant_id]
if unacked_only:
conditions.append("acknowledged = 0")
if label:
conditions.append(q("anomaly_label = ?"))
params.append(label.upper())
if scorer:
conditions.append(q("scorer = ?"))
params.append(scorer.lower())
where = " AND ".join(conditions)
with get_conn(db_path) as conn:
rows = conn.execute(
q(f"SELECT * FROM detections WHERE {where} ORDER BY detected_at DESC LIMIT ?"), # noqa: S608
(*params, limit),
).fetchall()
return [dict(r) for r in rows]
def acknowledge_detection(db_path: Path, detection_id: str, notes: str = "") -> bool:
"""Mark a detection as acknowledged. Returns True if a row was updated."""
tenant_id = resolve_tenant_id()
acked_at = datetime.now(tz=timezone.utc).isoformat()
with get_conn(db_path) as conn:
cur = conn.execute(
q("""
UPDATE detections
SET acknowledged = 1, acknowledged_at = ?, notes = ?
WHERE id = ? AND (tenant_id = ? OR tenant_id = '')
"""),
(acked_at, notes, detection_id, tenant_id),
)
conn.commit()
return cur.rowcount > 0

291
app/services/blocklist.py Normal file
View file

@ -0,0 +1,291 @@
"""Blocklist candidate extraction, management, and telemetry matching."""
from __future__ import annotations
import dataclasses
import json
import re
import uuid
from datetime import datetime, timezone
from pathlib import Path
from typing import Any
from app.db import get_conn, resolve_tenant_id
import yaml
# ---------------------------------------------------------------------------
# Data models
# ---------------------------------------------------------------------------
@dataclasses.dataclass(frozen=True)
class TelemetryRule:
name: str
domains: tuple[str, ...]
category: str
description: str
@dataclasses.dataclass
class BlocklistCandidate:
id: str
domain_or_ip: str
source_device_ip: str | None
source_device_name: str | None
first_seen: str
last_seen: str
hit_count: int
status: str
pushed_at: str | None
log_evidence: list[str]
matched_rule: str | None
llm_score: float | None
llm_reason: str | None
# ---------------------------------------------------------------------------
# Telemetry list
# ---------------------------------------------------------------------------
def load_telemetry_rules(path: Path) -> list[TelemetryRule]:
"""Load telemetry rules from a YAML file."""
data = yaml.safe_load(path.read_text())
return [
TelemetryRule(
name=r["name"],
domains=tuple(d.lower().strip(".") for d in r["domains"]),
category=r["category"],
description=r.get("description", ""),
)
for r in data.get("rules", [])
]
def matches_telemetry(domain: str, rules: list[TelemetryRule]) -> TelemetryRule | None:
"""Return the first rule whose domains include domain or a parent domain, else None."""
d = domain.lower().strip(".")
for rule in rules:
for rd in rule.domains:
if d == rd or d.endswith("." + rd):
return rule
return None
# ---------------------------------------------------------------------------
# Regex extractors for router log entries
# ---------------------------------------------------------------------------
_DNSMASQ_RE = re.compile(
r"query\[A{1,4}\]\s+(?P<domain>\S+)\s+from\s+(?P<src>[\d.]+)"
)
_IPTABLES_RE = re.compile(
r"SRC=(?P<src>[\d.]+).*?DST=(?P<dst>[\d.a-zA-Z.-]+)"
)
_VALID_STATUSES = {"pending", "approved", "rejected", "pushed", "unblocked"}
# ---------------------------------------------------------------------------
# DB helpers
# ---------------------------------------------------------------------------
def _now_iso() -> str:
return datetime.now(timezone.utc).isoformat()
def _row_to_candidate(row: Any) -> BlocklistCandidate:
return BlocklistCandidate(
id=row["id"],
domain_or_ip=row["domain_or_ip"],
source_device_ip=row["source_device_ip"],
source_device_name=row["source_device_name"],
first_seen=row["first_seen"],
last_seen=row["last_seen"],
hit_count=row["hit_count"],
status=row["status"],
pushed_at=row["pushed_at"],
log_evidence=json.loads(row["log_evidence"] or "[]"),
matched_rule=row["matched_rule"],
llm_score=row["llm_score"],
llm_reason=row["llm_reason"],
)
def _upsert_candidate(
conn: Any,
domain_or_ip: str,
source_device_ip: str | None,
source_device_name: str | None,
matched_rule: str | None,
entry_id: str,
now: str,
) -> bool:
"""Insert or update a candidate. Returns True if a new row was created."""
tid = resolve_tenant_id()
row = conn.execute(
"SELECT id, hit_count, log_evidence FROM blocklist_candidates "
"WHERE domain_or_ip = ? AND source_device_ip IS ? AND (tenant_id = ? OR tenant_id = '')",
(domain_or_ip, source_device_ip, tid),
).fetchone()
if row is None:
conn.execute(
"""INSERT INTO blocklist_candidates
(id, tenant_id, domain_or_ip, source_device_ip, source_device_name,
first_seen, last_seen, hit_count, status, pushed_at, log_evidence, matched_rule)
VALUES (?, ?, ?, ?, ?, ?, ?, 1, 'pending', NULL, ?, ?)""",
(
str(uuid.uuid4()), tid, domain_or_ip, source_device_ip, source_device_name,
now, now, json.dumps([entry_id]), matched_rule,
),
)
return True
existing_id = row["id"]
hit_count = row["hit_count"]
existing_evidence = row["log_evidence"]
evidence = json.loads(existing_evidence or "[]")
if entry_id not in evidence:
evidence.append(entry_id)
evidence = evidence[-10:] # cap at 10
conn.execute(
"UPDATE blocklist_candidates SET last_seen=?, hit_count=?, log_evidence=? WHERE id=?",
(now, hit_count + 1, json.dumps(evidence), existing_id),
)
return False
# ---------------------------------------------------------------------------
# Extraction scan
# ---------------------------------------------------------------------------
def run_scan(
db_path: Path,
router_source_ids: list[str],
device_map: dict[str, str],
telemetry_rules: list[TelemetryRule],
) -> int:
"""Scan log_entries from router sources, upsert blocklist candidates.
Only entries whose source IP is in device_map are recorded.
Returns the total number of rows created or updated.
"""
if not router_source_ids or not device_map:
return 0
placeholders = ",".join("?" for _ in router_source_ids)
now = _now_iso()
count = 0
tid = resolve_tenant_id()
with get_conn(db_path) as conn:
rows = conn.execute(
f"SELECT id, text FROM log_entries WHERE source_id IN ({placeholders}) AND (tenant_id = ? OR tenant_id = '')", # noqa: S608
(*router_source_ids, tid),
).fetchall()
for row in rows:
entry_id, text = row["id"], row["text"]
# rest of loop body follows unchanged
src_ip: str | None = None
dst: str | None = None
m = _DNSMASQ_RE.search(text)
if m:
src_ip = m.group("src")
dst = m.group("domain")
else:
m = _IPTABLES_RE.search(text)
if m:
src_ip = m.group("src")
dst = m.group("dst")
if src_ip is None or src_ip not in device_map:
continue
device_name = device_map[src_ip]
rule = matches_telemetry(dst, telemetry_rules) if dst else None
matched_rule_name = rule.name if rule else None
_upsert_candidate(conn, dst or "unknown", src_ip, device_name, matched_rule_name, entry_id, now)
count += 1
conn.commit()
return count
# ---------------------------------------------------------------------------
# Candidate CRUD
# ---------------------------------------------------------------------------
_CANDIDATE_SELECT = (
"SELECT id,domain_or_ip,source_device_ip,source_device_name,"
"first_seen,last_seen,hit_count,status,pushed_at,log_evidence,"
"matched_rule,llm_score,llm_reason FROM blocklist_candidates"
)
def list_candidates(
db_path: Path,
status: str | None = None,
device_ip: str | None = None,
) -> list[BlocklistCandidate]:
tid = resolve_tenant_id()
conditions = ["(tenant_id = ? OR tenant_id = '')"]
params: list = [tid]
if status and status != "all":
conditions.append("status = ?")
params.append(status)
if device_ip:
conditions.append("source_device_ip = ?")
params.append(device_ip)
where = " AND ".join(conditions)
with get_conn(db_path) as conn:
rows = conn.execute(
f"{_CANDIDATE_SELECT} WHERE {where} ORDER BY last_seen DESC", # noqa: S608
params,
).fetchall()
return [_row_to_candidate(r) for r in rows]
def _get_candidate(conn: Any, candidate_id: str) -> BlocklistCandidate:
row = conn.execute(
f"{_CANDIDATE_SELECT} WHERE id=?", # noqa: S608
(candidate_id,),
).fetchone()
if row is None:
raise KeyError(f"Candidate {candidate_id!r} not found")
return _row_to_candidate(row)
def get_candidate(db_path: Path, candidate_id: str) -> BlocklistCandidate:
"""Fetch a single candidate by ID. Raises KeyError if not found."""
with get_conn(db_path) as conn:
return _get_candidate(conn, candidate_id)
def update_candidate_status(db_path: Path, candidate_id: str, new_status: str) -> BlocklistCandidate:
if new_status not in _VALID_STATUSES:
raise ValueError(f"Invalid status {new_status!r}. Must be one of {_VALID_STATUSES}")
with get_conn(db_path) as conn:
conn.execute("UPDATE blocklist_candidates SET status=? WHERE id=?", (new_status, candidate_id))
conn.commit()
return _get_candidate(conn, candidate_id)
def mark_pushed(db_path: Path, candidate_id: str) -> BlocklistCandidate:
with get_conn(db_path) as conn:
conn.execute(
"UPDATE blocklist_candidates SET status='pushed', pushed_at=? WHERE id=?",
(_now_iso(), candidate_id),
)
conn.commit()
return _get_candidate(conn, candidate_id)
def mark_unblocked(db_path: Path, candidate_id: str) -> BlocklistCandidate:
with get_conn(db_path) as conn:
conn.execute("UPDATE blocklist_candidates SET status='unblocked' WHERE id=?", (candidate_id,))
conn.commit()
return _get_candidate(conn, candidate_id)

241
app/services/cybersec.py Normal file
View file

@ -0,0 +1,241 @@
"""Cybersecurity-focused scoring pipeline using zero-shot classification.
Runs a second-pass analysis on entries that were already flagged by the
anomaly scorer or that have pattern matches. Uses a zero-shot classification
model (DeBERTa-v3-base-mnli is cached locally) so no fine-tuning is needed.
The scorer writes ml_score / ml_label / ml_scored_at to log_entries and
inserts high-confidence non-normal hits into the detections table tagged
with scorer='cybersec'.
Env vars
--------
TURNSTONE_CYBERSEC_MODEL HF model id for zero-shot classification.
Recommended: MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli
(already cached from the diagnose pipeline).
Set to empty string to disable (safe default).
TURNSTONE_CYBERSEC_DEVICE 'cpu' (default) or 'cuda'
TURNSTONE_CYBERSEC_THRESHOLD float confidence floor for detection insertion (default 0.60)
"""
from __future__ import annotations
import logging
import uuid
from dataclasses import dataclass, field
from datetime import datetime, timezone
from pathlib import Path
from typing import Any
from app.db import get_conn, resolve_tenant_id
from app.db.dialect import q
logger = logging.getLogger(__name__)
# ---------------------------------------------------------------------------
# Candidate labels — cybersec vocabulary for zero-shot inference
# ---------------------------------------------------------------------------
CYBERSEC_LABELS: list[str] = [
"authentication failure or brute force attack",
"privilege escalation or unauthorized access",
"network intrusion or port scan",
"malware or suspicious process activity",
"data exfiltration or unusual outbound traffic",
"normal system operation",
]
_NORMAL_LABEL = "normal system operation"
_LABEL_SEVERITY: dict[str, str] = {
"authentication failure or brute force attack": "ERROR",
"privilege escalation or unauthorized access": "CRITICAL",
"network intrusion or port scan": "ERROR",
"malware or suspicious process activity": "CRITICAL",
"data exfiltration or unusual outbound traffic":"CRITICAL",
"normal system operation": "INFO",
}
# ---------------------------------------------------------------------------
# Pipeline singleton
# ---------------------------------------------------------------------------
_pipeline: Any = None
def _get_pipeline(model_id: str, device: str) -> Any:
global _pipeline # noqa: PLW0603
if _pipeline is None:
from transformers import pipeline # type: ignore[import-untyped]
logger.info("loading cybersec zero-shot pipeline: %s on %s", model_id, device)
_pipeline = pipeline(
"zero-shot-classification",
model=model_id,
device=0 if device == "cuda" else -1,
)
logger.info("cybersec pipeline ready")
return _pipeline
def reset_pipeline() -> None:
"""Clear the cached pipeline — for testing only."""
global _pipeline # noqa: PLW0603
_pipeline = None
# ---------------------------------------------------------------------------
# Result type
# ---------------------------------------------------------------------------
@dataclass
class CybersecResult:
scored: int = 0
detections: int = 0
skipped: bool = False
error: str | None = None
# ---------------------------------------------------------------------------
# Core scoring function
# ---------------------------------------------------------------------------
def score_security_entries(
db_path: Path,
model_id: str,
device: str = "cpu",
batch_size: int = 32,
threshold: float = 0.60,
) -> CybersecResult:
"""Score entries that were anomaly-flagged or pattern-matched.
Only entries with ml_scored_at IS NULL are processed (idempotent).
Writes ml_score / ml_label / ml_scored_at and inserts high-confidence
hits into detections with scorer='cybersec'.
"""
if not model_id:
return CybersecResult(skipped=True)
tenant_id = resolve_tenant_id()
try:
pipe = _get_pipeline(model_id, device)
except Exception as exc:
logger.error("failed to load cybersec pipeline: %s", exc)
return CybersecResult(error=str(exc))
total_scored = 0
total_detections = 0
try:
with get_conn(db_path) as conn:
# Only score entries that are worth a second look:
# anomaly-flagged (non-normal) OR have at least one pattern match.
rows = conn.execute(
q("""
SELECT id, source_id, text, timestamp_iso
FROM log_entries
WHERE (tenant_id = ? OR tenant_id = '')
AND ml_scored_at IS NULL
AND (
(anomaly_label IS NOT NULL AND anomaly_label != 'NORMAL')
OR (matched_patterns IS NOT NULL AND matched_patterns != '[]' AND matched_patterns != '')
)
LIMIT ?
"""),
(tenant_id, batch_size * 10),
).fetchall()
if not rows:
return CybersecResult(skipped=True)
# Process in chunks to avoid OOM on large backlogs
for i in range(0, len(rows), batch_size):
chunk = rows[i : i + batch_size]
texts = [r["text"] for r in chunk]
try:
results = pipe(texts, candidate_labels=CYBERSEC_LABELS, multi_label=False)
except Exception as exc:
logger.warning("zero-shot inference error on chunk %d: %s", i, exc)
continue
now = datetime.now(tz=timezone.utc).isoformat()
with get_conn(db_path) as conn:
for row, result in zip(chunk, results):
top_label: str = result["labels"][0]
top_score: float = result["scores"][0]
conn.execute(
q("""
UPDATE log_entries
SET ml_score = ?, ml_label = ?, ml_scored_at = ?
WHERE id = ? AND (tenant_id = ? OR tenant_id = '')
"""),
(top_score, top_label, now, row["id"], tenant_id),
)
total_scored += 1
if top_score >= threshold and top_label != _NORMAL_LABEL:
severity = _LABEL_SEVERITY.get(top_label, "WARN")
try:
conn.execute(
q("""
INSERT INTO detections
(id, tenant_id, entry_id, source_id, anomaly_label,
anomaly_score, severity, text, timestamp_iso,
detected_at, scorer)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, 'cybersec')
"""),
(
str(uuid.uuid4()),
tenant_id,
row["id"],
row["source_id"],
top_label,
top_score,
severity,
row["text"],
row["timestamp_iso"],
now,
),
)
total_detections += 1
except Exception:
pass # entry may already have a detection — skip
conn.commit()
except Exception as exc:
logger.error("cybersec scoring failed: %s", exc, exc_info=True)
return CybersecResult(scored=total_scored, detections=total_detections, error=str(exc))
return CybersecResult(scored=total_scored, detections=total_detections)
# ---------------------------------------------------------------------------
# Query helpers (used by REST layer)
# ---------------------------------------------------------------------------
def list_cybersec_detections(
db_path: Path,
limit: int = 100,
unacked_only: bool = False,
label: str | None = None,
) -> list[dict]:
"""Return cybersec detections ordered by detected_at DESC."""
tenant_id = resolve_tenant_id()
conditions = ["(tenant_id = ? OR tenant_id = '')", "scorer = 'cybersec'"]
params: list[Any] = [tenant_id]
if unacked_only:
conditions.append("acknowledged = 0")
if label:
conditions.append(q("anomaly_label = ?"))
params.append(label)
where = " AND ".join(conditions)
with get_conn(db_path) as conn:
rows = conn.execute(
q(f"SELECT * FROM detections WHERE {where} ORDER BY detected_at DESC LIMIT ?"), # noqa: S608
(*params, limit),
).fetchall()
return [dict(r) for r in rows]

View file

@ -1,100 +0,0 @@
"""Frictionless diagnose service — NL time extraction + layered log search."""
from __future__ import annotations
import logging
import re
from datetime import datetime, timedelta, timezone
from pathlib import Path
from typing import Any
from app.services.search import SearchResult, entries_in_window, search
logger = logging.getLogger(__name__)
try:
from dateparser.search import search_dates as _search_dates # type: ignore[import]
_HAS_DATEPARSER = True
except ImportError:
_search_dates = None # type: ignore[assignment]
_HAS_DATEPARSER = False
def parse_time_window(query: str) -> tuple[str | None, str | None, str]:
"""Extract a time window from a natural-language query string.
Returns (since_iso, until_iso, keywords) where keywords is the query with
the matched time phrase stripped. Falls back to last-60-min window.
"""
if _HAS_DATEPARSER and _search_dates is not None:
try:
results = _search_dates(query, languages=["en"], settings={"PREFER_DATES_FROM": "past"})
except Exception:
logger.warning("dateparser failed on query %r — falling back to 60-min window", query)
results = None
if results:
phrase, dt = results[0]
if dt.tzinfo is None:
dt = dt.replace(tzinfo=timezone.utc)
since = (dt - timedelta(minutes=30)).isoformat()
until = (dt + timedelta(minutes=30)).isoformat()
keywords = re.sub(r"\s{2,}", " ", query.replace(phrase, " ").strip())
return since, until, keywords or query
return _last_n_minutes(60), _now_iso(), query
def diagnose(
db_path: Path,
query: str,
since: str | None = None,
until: str | None = None,
) -> dict[str, Any]:
"""Run layered log search with NL time extraction. Returns summary + entries."""
time_detected = since is not None and until is not None
if not time_detected:
parsed_since, parsed_until, keywords = parse_time_window(query)
since = since or parsed_since
until = until or parsed_until
time_detected = keywords != query
else:
keywords = query
keyword_hits = search(db_path, query=keywords, since=since, until=until, limit=150, or_mode=True)
window_hits = entries_in_window(db_path, since=since, until=until, limit=50)
seen: set[str] = set()
merged: list[SearchResult] = []
for r in keyword_hits + window_hits:
if r.entry_id not in seen:
seen.add(r.entry_id)
merged.append(r)
combined = sorted(merged, key=lambda r: (r.timestamp_iso or "\xff", r.sequence))[:200]
by_severity: dict[str, int] = {"CRITICAL": 0, "ERROR": 0, "WARN": 0, "INFO": 0}
by_source: dict[str, int] = {}
for r in combined:
sev = (r.severity or "INFO").upper()
if sev in by_severity:
by_severity[sev] += 1
by_source[r.source_id] = by_source.get(r.source_id, 0) + 1
return {
"summary": {
"total": len(combined),
"window_start": since,
"window_end": until,
"time_detected": time_detected,
"by_severity": by_severity,
"by_source": by_source,
},
"entries": combined,
}
def _now_iso() -> str:
return datetime.now(timezone.utc).isoformat()
def _last_n_minutes(n: int) -> str:
return (datetime.now(timezone.utc) - timedelta(minutes=n)).isoformat()

View file

@ -0,0 +1,377 @@
"""Frictionless diagnose service — NL time extraction + layered log search.
This module is the public interface for the diagnose package.
Full implementation lives here so that patch("app.services.diagnose._HAS_DATEPARSER")
and patch("app.services.diagnose._search_dates") continue to target the correct
namespace, preserving backward compatibility with existing tests.
The verbatim original is preserved in legacy.py for reference.
"""
from __future__ import annotations
import asyncio
import dataclasses
import logging
import os
import re
from collections.abc import AsyncGenerator
from datetime import datetime, timedelta, timezone
from pathlib import Path
from typing import Any
from app.context.retriever import retrieve_context, format_context_block
from app.services.llm import summarize
from app.services.search import SearchResult, entries_in_window, search
from app.services.diagnose.pipeline import run_pipeline
logger = logging.getLogger(__name__)
try:
from dateparser.search import search_dates as _search_dates # type: ignore[import]
_HAS_DATEPARSER = True
except ImportError:
_search_dates = None # type: ignore[assignment]
_HAS_DATEPARSER = False
_RELATIVE_RE = re.compile(
r"\b(?:last|past)\s+(?:(?P<n>\d+)|(?P<approx>a\s+few|few|couple(?:\s+of)?|several))?\s*(?P<unit>minute|hour|day|week)s?\b",
re.IGNORECASE,
)
_RELATIVE_UNITS = {"minute": 1, "hour": 60, "day": 1440, "week": 10080}
# Fuzzy quantifiers map to a reasonable span so "last few hours" → 3h window
_APPROX_N = 3
def _relative_window(match: re.Match) -> tuple[str, str]:
"""Convert a relative time match to (since_iso, until_iso)."""
n_str = match.group("n")
approx = match.group("approx")
unit = match.group("unit").lower()
n = int(n_str) if n_str else (_APPROX_N if approx else 1)
minutes = n * _RELATIVE_UNITS[unit]
return _last_n_minutes(minutes), _now_iso()
def parse_time_window(query: str) -> tuple[str | None, str | None, str]:
"""Extract a time window from a natural-language query string.
Returns (since_iso, until_iso, keywords) where keywords is the query with
the matched time phrase stripped. Falls back to last-60-min window.
"""
# Handle relative expressions first ("last hour", "past 30 minutes", etc.)
# dateparser misinterprets these as absolute times.
m = _RELATIVE_RE.search(query)
if m:
since, until = _relative_window(m)
keywords = re.sub(r"\s{2,}", " ", query[: m.start()] + query[m.end() :]).strip()
return since, until, keywords or query
if _HAS_DATEPARSER and _search_dates is not None:
# Tell dateparser what timezone the user is in so "3:35 am" means local time.
# PREFER_DAY_OF_MONTH is unused here but PREFER_DATES_FROM=past ensures
# "3:35 am" resolves to the most recent past occurrence, not a future one.
local_offset = datetime.now().astimezone().utcoffset()
offset_h = int((local_offset.total_seconds() if local_offset else 0) / 3600)
tz_str = f"UTC{'+' if offset_h >= 0 else ''}{offset_h}"
try:
results = _search_dates(
query,
languages=["en"],
settings={
"PREFER_DATES_FROM": "past",
"TIMEZONE": tz_str,
"RETURN_AS_TIMEZONE_AWARE": True,
},
)
except Exception as e:
logger.warning(
"dateparser failed (%s) on query %r — falling back to 60-min window",
type(e).__name__,
query,
)
results = None
if results:
phrase, dt = results[0]
if dt.tzinfo is None:
dt = dt.replace(tzinfo=timezone.utc)
else:
dt = dt.astimezone(
timezone.utc
) # normalise to UTC for SQLite string compare
since = (dt - timedelta(minutes=30)).isoformat()
until = (dt + timedelta(minutes=30)).isoformat()
keywords = re.sub(r"\s{2,}", " ", query.replace(phrase, " ").strip())
return since, until, keywords or query
return _last_n_minutes(60), _now_iso(), query
def diagnose(
db_path: Path,
query: str,
since: str | None = None,
until: str | None = None,
source_filter: str | None = None,
llm_url: str | None = None,
llm_model: str | None = None,
llm_api_key: str | None = None,
) -> dict[str, Any]:
"""Run layered log search with NL time extraction. Returns summary + entries."""
time_detected = since is not None and until is not None
if not time_detected:
parsed_since, parsed_until, keywords = parse_time_window(query)
since = since or parsed_since
until = until or parsed_until
time_detected = keywords != query
else:
keywords = query
keyword_hits = search(
db_path,
query=keywords,
since=since,
until=until,
source_filter=source_filter,
limit=150,
or_mode=True,
)
window_hits = entries_in_window(
db_path,
since=since,
until=until,
source_filter=source_filter,
limit=50,
per_source_cap=15,
)
seen: set[str] = set()
merged: list[SearchResult] = []
for r in keyword_hits + window_hits:
if r.entry_id not in seen:
seen.add(r.entry_id)
merged.append(r)
combined = sorted(merged, key=lambda r: (r.timestamp_iso or "\xff", r.sequence))[
:200
]
by_severity: dict[str, int] = {"CRITICAL": 0, "ERROR": 0, "WARN": 0, "INFO": 0}
by_source: dict[str, int] = {}
for r in combined:
sev = (r.severity or "INFO").upper()
if sev in by_severity:
by_severity[sev] += 1
by_source[r.source_id] = by_source.get(r.source_id, 0) + 1
reasoning: str | None = None
if llm_url and llm_model:
reasoning = summarize(
query, combined, llm_url=llm_url, llm_model=llm_model, api_key=llm_api_key
)
return {
"summary": {
"total": len(combined),
"window_start": since,
"window_end": until,
"time_detected": time_detected,
"by_severity": by_severity,
"by_source": by_source,
},
"reasoning": reasoning,
"entries": combined,
}
async def diagnose_stream(
db_path: Path,
query: str,
since: str | None = None,
until: str | None = None,
source_filter: str | None = None,
llm_url: str | None = None,
llm_model: str | None = None,
llm_api_key: str | None = None,
context_db_path: Path | None = None,
incidents_db_path: Path | None = None,
tech_level: str = "sysadmin",
pattern_domain: dict[str, str] | None = None,
) -> AsyncGenerator[dict[str, Any], None]:
"""Async generator yielding SSE event dicts for the diagnose pipeline.
Yields events in order:
{"type":"status","message":""} pipeline progress
{"type":"summary","data":{}} window + severity counts (fast, from DB)
{"type":"entries","data":[]} log entries (fast, from DB)
{"type":"reasoning","text":""} LLM analysis (slow, optional)
{"type":"done"}
"""
keywords = query.strip()
source_browse = not keywords and source_filter is not None
if source_browse:
# No keyword — browsing a source directly. Use 24h window; skip FTS entirely.
yield {"type": "status", "message": f"Loading {source_filter}"}
since = since or _last_n_minutes(60 * 24)
until = until or _now_iso()
time_detected = False
else:
yield {"type": "status", "message": "Parsing time window…"}
time_detected = since is not None and until is not None
if not time_detected:
parsed_since, parsed_until, keywords = await asyncio.to_thread(
parse_time_window, query
)
since = since or parsed_since
until = until or parsed_until
time_detected = keywords != query
yield {"type": "status", "message": "Loading environment context…"}
_ctx_db = context_db_path or db_path
ctx = await asyncio.to_thread(lambda: retrieve_context(_ctx_db, query))
yield {
"type": "context",
"facts": ctx.facts,
"chunks": ctx.chunks,
}
yield {"type": "status", "message": "Searching logs…"}
if source_browse:
keyword_hits: list[SearchResult] = []
window_hits = await asyncio.to_thread(
lambda: entries_in_window(
db_path,
since,
until,
source_filter=source_filter,
limit=200,
)
)
else:
keyword_hits, window_hits = await asyncio.gather(
asyncio.to_thread(
lambda: search(
db_path,
keywords,
source_filter=source_filter,
since=since,
until=until,
limit=150,
or_mode=True,
semantic=True,
)
),
asyncio.to_thread(
lambda: entries_in_window(
db_path,
since,
until,
source_filter=source_filter,
limit=50,
per_source_cap=15,
)
),
)
seen: set[str] = set()
merged: list[SearchResult] = []
for r in keyword_hits + window_hits:
if r.entry_id not in seen:
seen.add(r.entry_id)
merged.append(r)
combined = sorted(merged, key=lambda r: (r.timestamp_iso or "\xff", r.sequence))[
:200
]
by_severity: dict[str, int] = {"CRITICAL": 0, "ERROR": 0, "WARN": 0, "INFO": 0}
by_source: dict[str, int] = {}
for r in combined:
sev = (r.severity or "INFO").upper()
if sev in by_severity:
by_severity[sev] += 1
by_source[r.source_id] = by_source.get(r.source_id, 0) + 1
by_domain: dict[str, int] = {}
if pattern_domain:
for r in combined:
seen: set[str] = set()
for tag in (r.matched_patterns or []):
d = pattern_domain.get(tag, "")
if d and d not in seen:
seen.add(d)
by_domain[d] = by_domain.get(d, 0) + 1
yield {
"type": "summary",
"data": {
"total": len(combined),
"window_start": since,
"window_end": until,
"time_detected": time_detected,
"by_severity": by_severity,
"by_source": by_source,
"by_domain": by_domain,
},
}
yield {"type": "entries", "data": [dataclasses.asdict(r) for r in combined]}
if MULTI_AGENT_ENABLED:
async for event in run_pipeline(
db_path=db_path,
entries=combined,
ctx=ctx,
query=query,
since=since,
until=until,
llm_url=llm_url,
llm_model=llm_model,
llm_api_key=llm_api_key,
tech_level=tech_level,
incidents_db_path=incidents_db_path,
):
yield event
return # pipeline emits its own "done" event
if llm_url and llm_model and combined:
# Only compute context_block in the legacy path — pipeline uses ctx directly.
context_block = format_context_block(ctx)
yield {"type": "status", "message": "Analyzing with LLM…"}
reasoning = await asyncio.to_thread(
lambda: summarize(
query,
combined,
llm_url,
llm_model,
llm_api_key,
context_block=context_block,
)
)
if reasoning:
yield {"type": "reasoning", "text": reasoning}
yield {"type": "done"}
def _now_iso() -> str:
return datetime.now(timezone.utc).isoformat()
def _last_n_minutes(n: int) -> str:
return (datetime.now(timezone.utc) - timedelta(minutes=n)).isoformat()
__all__ = [
"diagnose",
"diagnose_stream",
"parse_time_window",
]
# Feature flag for Task 6
MULTI_AGENT_ENABLED = (
os.getenv("TURNSTONE_MULTI_AGENT_DIAGNOSE", "false").lower() == "true"
)

View file

@ -0,0 +1,174 @@
"""Shared LLM client for the multi-agent diagnose pipeline.
Both Stage 3 (RootCauseHypothesizer) and Stage 5 (SummarySynthesizer) send
messages to the same LLM backend using the same two-step pattern:
1. Try the cf-orch task endpoint product-scoped inference routing.
2. Fall back to OpenAI-compat direct model call by name.
Centralising here means changes to auth headers, timeouts, retry logic, or
cf-orch payload structure only need to be made once.
"""
from __future__ import annotations
import logging
import re
import httpx
logger = logging.getLogger(__name__)
# Regex that strips ```json … ``` or ``` … ``` fences from LLM output.
_JSON_FENCE_RE = re.compile(
r"^```(?:json)?\s*|\s*```$",
re.MULTILINE,
)
# Reasoning models (DeepSeek-R1, Qwen QwQ, Llama thinking variants) embed
# chain-of-thought inside <think>…</think> tags in the content field.
# Strip them so only the final response reaches the UI.
_THINK_TAG_RE = re.compile(r"<think>.*?</think>", re.DOTALL | re.IGNORECASE)
def _strip_thinking(text: str) -> str:
"""Remove <think>…</think> blocks and trim surrounding whitespace."""
return _THINK_TAG_RE.sub("", text).strip()
def extract_content(resp_json: dict) -> str | None:
"""Pull text content from an OpenAI-compat chat completion response.
Strips reasoning-model thinking tags before returning.
Returns None when the response has no choices or empty content.
"""
choices = resp_json.get("choices") or []
if not choices:
return None
raw = (choices[0].get("message", {}).get("content") or "").strip()
if not raw:
return None
return _strip_thinking(raw) or None
def strip_json_fences(raw: str) -> str:
"""Remove markdown code fences that some LLMs wrap around JSON output.
Example: '```json\\n[...]\\n```' '[...]'
"""
return _JSON_FENCE_RE.sub("", raw).strip()
def extract_first_json_array(raw: str) -> str:
"""Extract the first complete JSON array from a string.
Reasoning models (e.g. foundation-sec-8b) sometimes emit valid JSON and
then repeat it inside a markdown fence. Standard json.loads() fails on the
combined text. This function scans for the first '[' and walks to its
matching ']', handling nested structures.
Returns the extracted substring, or the original string if no array found
(so the caller's json.loads() fails with the usual error message).
"""
start = raw.find("[")
if start == -1:
return raw
depth = 0
in_string = False
escape_next = False
for i, ch in enumerate(raw[start:], start=start):
if escape_next:
escape_next = False
continue
if ch == "\\" and in_string:
escape_next = True
continue
if ch == '"':
in_string = not in_string
continue
if in_string:
continue
if ch == "[":
depth += 1
elif ch == "]":
depth -= 1
if depth == 0:
return raw[start : i + 1]
return raw # unbalanced — return as-is so caller sees the error
def call_llm(
llm_url: str,
llm_model: str,
llm_api_key: str | None,
messages: list[dict],
task_name: str = "log_analysis",
timeout: float = 120.0,
max_tokens: int = 2048,
) -> str | None:
"""Send messages to the LLM; return raw text or None on failure.
Tries the cf-orch task endpoint first (product-routed inference).
Falls back to a direct OpenAI-compat ``/v1/chat/completions`` call when:
- The task endpoint returns 404 (no assignment for this task).
- The task endpoint is unreachable (connection error, timeout, etc.).
Args:
llm_url: Base URL of the LLM backend (e.g. ``http://<YOUR_HOST_IP>:7700``).
llm_model: Model identifier used in the OpenAI-compat fallback call.
llm_api_key: Optional bearer token for authenticated endpoints.
messages: OpenAI-style message list (system + user turns).
task_name: cf-orch task name for product-routed inference (default: ``log_analysis``).
timeout: Request timeout in seconds (default: 120).
max_tokens: Maximum tokens to generate (default: 2048). Prevents mid-sentence
truncation when the backend default is lower than the output needs.
Returns:
Raw text content string, or None if both paths fail.
"""
headers: dict[str, str] = {}
if llm_api_key:
headers["Authorization"] = f"Bearer {llm_api_key}"
# --- Path 1: cf-orch task endpoint ---
task_url = f"{llm_url.rstrip('/')}/api/inference/task"
try:
resp = httpx.post(
task_url,
json={
"product": "turnstone",
"task": task_name,
"payload": {"messages": messages, "stream": False, "max_tokens": max_tokens},
},
headers=headers,
timeout=timeout,
)
if resp.status_code == 200:
return extract_content(resp.json())
if resp.status_code != 404:
resp.raise_for_status()
logger.debug(
"No task assignment for turnstone.%s — falling back to direct model",
task_name,
)
except Exception as exc: # noqa: BLE001
# Broad catch is intentional: captures network errors, timeouts, and
# any backend-specific exceptions so the pipeline can fall back.
logger.debug(
"Task endpoint unavailable (%s) — falling back to direct model", exc
)
# --- Path 2: OpenAI-compat fallback ---
try:
resp = httpx.post(
f"{llm_url.rstrip('/')}/v1/chat/completions",
json={"model": llm_model, "messages": messages, "stream": False, "max_tokens": max_tokens},
headers=headers,
timeout=timeout,
)
resp.raise_for_status()
return extract_content(resp.json())
except Exception as exc: # noqa: BLE001
logger.warning("LLM call failed (%s): %s", type(exc).__name__, exc)
return None

View file

@ -0,0 +1,274 @@
"""Stage 2: Severity Classifier — ML with two fallback levels.
Classification strategy (in priority order):
Path A ML: Hugging Face text-classification pipeline, loaded lazily.
Path B pattern_tags: Map cluster.pattern_tags through the loaded pattern
severity dict; pick the highest severity across matching tags.
Path C regex: Call detect_severity() from app.glean.base on the cluster's
representative_text.
Each cluster is classified independently. The ``classifier_used`` field on the
returned ``ClassifiedTimeline`` reflects the primary path (the one that governed
the overall classification session, not individual cluster fallbacks).
"""
from __future__ import annotations
import logging
import os
from pathlib import Path
from typing import Any
from types import MappingProxyType
from app.services.diagnose.models import (
ClassifiedTimeline,
EventCluster,
SeverityLabel,
TimelineResult,
)
logger = logging.getLogger(__name__)
# ---------------------------------------------------------------------------
# Module-level ML singleton — reset to None between tests via the fixture
# ---------------------------------------------------------------------------
_ml_classifier: Any | None = None
def _get_ml_classifier(model_id: str, device: str) -> Any:
"""Return the cached HF pipeline, loading it on first call."""
global _ml_classifier # noqa: PLW0603
if _ml_classifier is None:
from transformers import pipeline as hf_pipeline # type: ignore[import-untyped]
_ml_classifier = hf_pipeline(
"text-classification", model=model_id, device=device
)
return _ml_classifier
# ---------------------------------------------------------------------------
# Label mapping
# ---------------------------------------------------------------------------
_LABEL_MAP: dict[str, SeverityLabel] = {
"ERROR": "ERROR",
"WARNING": "WARN",
"WARN": "WARN",
"INFO": "INFO",
"DEBUG": "DEBUG",
"CRITICAL": "CRITICAL",
}
# Label shim for krishnas4415/log-anomaly-detection-models (Hybrid-BERT, MIT).
# Maps the model's 7-class output vocabulary to Turnstone SeverityLabel.
# Checked against the model config.json — labels confirmed in turnstone#41.
_HYBRID_BERT_LABEL_MAP: dict[str, SeverityLabel] = {
"NORMAL": "INFO",
"SECURITY_ANOMALY": "ERROR",
"SYSTEM_FAILURE": "CRITICAL",
"PERFORMANCE_ISSUE": "WARN",
"NETWORK_ANOMALY": "WARN",
"CONFIG_ERROR": "ERROR",
"HARDWARE_ISSUE": "CRITICAL",
}
_CRITICAL_KEYWORDS: frozenset[str] = frozenset(
{
"panic",
"oom",
"fatal",
"critical",
"kernel panic",
"out of memory",
"segfault",
"segmentation fault",
}
)
_SEVERITY_ORDER: dict[str | None, int] = {
"CRITICAL": 5,
"ERROR": 4,
"WARN": 3,
"WARNING": 3,
"INFO": 2,
"DEBUG": 1,
None: 0,
}
def _map_label(label: str, score: float, text: str) -> SeverityLabel:
"""Translate a raw model output label to a Turnstone SeverityLabel.
Handles two model vocabularies:
- Standard (ERROR/WARN/INFO/CRITICAL/DEBUG) byviz/bylastic_classification_logs
- Hybrid-BERT (normal/security_anomaly/) krishnas4415/log-anomaly-detection-models
Applies keyword-based CRITICAL promotion and low-confidence DEBUG demotion
on top of the base mapping.
"""
upper = label.upper()
# Resolve via Hybrid-BERT map first, then standard map, then UNKNOWN.
base: SeverityLabel = _HYBRID_BERT_LABEL_MAP.get(upper) or _LABEL_MAP.get(upper, "UNKNOWN") # type: ignore[assignment]
if base == "ERROR" and score > 0.95 and any(
k in text.lower() for k in _CRITICAL_KEYWORDS
):
return "CRITICAL"
if base == "INFO" and score < 0.4:
return "DEBUG"
return base
def _highest_from_tags(
tags: tuple[str, ...], severity_map: dict[str, str]
) -> SeverityLabel | None:
"""Return the highest severity from the pattern_tags that appear in severity_map."""
best: str | None = None
best_rank = -1
for tag in tags:
sev = severity_map.get(tag)
rank = _SEVERITY_ORDER.get(sev, 0)
if rank > best_rank:
best_rank = rank
best = sev
if best is None:
return None
normalised = "WARN" if best.upper() == "WARNING" else best.upper()
return normalised # type: ignore[return-value]
# ---------------------------------------------------------------------------
# SeverityClassifier
# ---------------------------------------------------------------------------
class SeverityClassifier:
"""Classify each EventCluster's severity using ML, patterns, or regex fallback.
Parameters
----------
model_id:
Hugging Face model identifier. When empty (default), ML is skipped.
device:
Torch device string passed to the HF pipeline (e.g. ``"cpu"`` or ``"cuda:0"``).
pattern_file:
Path to the YAML pattern file. When ``None`` the classifier reads
``TURNSTONE_PATTERNS`` env var (same logic as ``app/rest.py``).
"""
def __init__(
self,
model_id: str = "",
device: str = "cpu",
pattern_file: Path | None = None,
) -> None:
self._model_id = model_id
self._device = device
self._pattern_file: Path | None = pattern_file
self._pattern_severity: dict[str, str] = {}
self._patterns_loaded = False
# ------------------------------------------------------------------
# Lazy loaders
# ------------------------------------------------------------------
def _resolve_pattern_file(self) -> Path | None:
"""Resolve pattern file from constructor arg or env var."""
if self._pattern_file is not None:
return self._pattern_file
env_dir = os.environ.get("TURNSTONE_PATTERNS")
if env_dir:
return Path(env_dir) / "default.yaml"
return None
def _ensure_patterns_loaded(self) -> None:
"""Populate _pattern_severity from the pattern YAML file (once)."""
if self._patterns_loaded:
return
self._patterns_loaded = True
path = self._resolve_pattern_file()
if path is None:
return
from app.glean.base import load_patterns
patterns = load_patterns(path)
self._pattern_severity = {p.name: p.severity for p in patterns}
# ------------------------------------------------------------------
# Per-cluster classification helpers
# ------------------------------------------------------------------
def _classify_cluster_ml(self, cluster: EventCluster) -> SeverityLabel | None:
"""Attempt ML classification. Returns None on any inference failure."""
try:
pipe = _get_ml_classifier(self._model_id, self._device)
results = pipe(cluster.representative_text)
if not results:
return None
hit = results[0]
return _map_label(hit["label"], hit["score"], cluster.representative_text)
except Exception: # noqa: BLE001
logger.warning(
"ML inference failed for cluster %s — falling back",
cluster.cluster_id,
)
return None
def _classify_cluster_pattern_tags(
self, cluster: EventCluster
) -> SeverityLabel | None:
"""Derive severity from the cluster's pattern_tags. Returns None if no match."""
return _highest_from_tags(cluster.pattern_tags, self._pattern_severity)
def _classify_cluster_regex(self, cluster: EventCluster) -> SeverityLabel:
"""Classify by scanning representative_text with the severity regex."""
from app.glean.base import detect_severity
raw = detect_severity(cluster.representative_text)
if raw is None:
return "INFO"
return _LABEL_MAP.get(raw.upper(), "INFO") # type: ignore[return-value]
# ------------------------------------------------------------------
# Public API
# ------------------------------------------------------------------
def classify(self, timeline: TimelineResult) -> ClassifiedTimeline:
"""Classify every cluster in *timeline* and return a ``ClassifiedTimeline``."""
self._ensure_patterns_loaded()
# Determine which primary path governs this session
ml_available = bool(self._model_id)
patterns_available = bool(self._pattern_severity)
if ml_available:
classifier_used: str = "ml"
elif patterns_available:
classifier_used = "pattern_tags"
else:
classifier_used = "regex"
cluster_severities: dict[str, SeverityLabel] = {}
for cluster in timeline.clusters:
severity: SeverityLabel | None = None
if ml_available:
severity = self._classify_cluster_ml(cluster)
if severity is None and patterns_available:
severity = self._classify_cluster_pattern_tags(cluster)
if severity is None:
severity = self._classify_cluster_regex(cluster)
cluster_severities[cluster.cluster_id] = severity
return ClassifiedTimeline(
timeline=timeline,
cluster_severities=MappingProxyType(cluster_severities),
classifier_used=classifier_used, # type: ignore[arg-type]
model_id=self._model_id if ml_available else None,
)

View file

@ -0,0 +1,167 @@
"""Stage 3: Root-Cause Hypothesizer — LLM + RAG context."""
from __future__ import annotations
import json
import logging
from uuid import uuid4
from app.context.retriever import RetrievedContext
from app.services.diagnose._llm_client import call_llm, extract_first_json_array, strip_json_fences
from app.services.diagnose.models import (
ClassifiedTimeline,
EventCluster,
Hypothesis,
SeverityLabel,
)
logger = logging.getLogger(__name__)
_VALID_SEVERITIES: frozenset[str] = frozenset({"CRITICAL", "ERROR", "WARN", "INFO", "DEBUG"})
_SYSTEM_PROMPT = (
"You are a Linux sysadmin log analyst. Analyze the following clustered log timeline "
"and generate 2-4 root cause hypotheses as a JSON array.\n\n"
"Each hypothesis must follow this exact JSON schema:\n"
'{"title": str (≤80 chars), "description": str (2-4 sentences), '
'"confidence": float (0.0-1.0), "severity": str (one of: CRITICAL, ERROR, WARN, INFO), '
'"supporting_clusters": [str list of cluster IDs]}\n\n'
"Return ONLY a valid JSON array. No prose, no markdown, no explanation outside the JSON."
)
def _coerce_float(val: object, default: float) -> float:
"""Safely coerce LLM output to float, returning default on failure."""
try:
return float(val) # type: ignore[arg-type]
except (TypeError, ValueError):
return default
def _validate_severity(s: str) -> SeverityLabel:
"""Map a raw severity string to a valid SeverityLabel, defaulting to ERROR."""
upper = s.upper()
if upper == "WARNING":
return "WARN"
return upper if upper in _VALID_SEVERITIES else "ERROR" # type: ignore[return-value]
def _cluster_summary(cluster: EventCluster, severity: str) -> str:
"""Build a condensed single-line summary of a cluster for the prompt."""
sources = ", ".join(list(cluster.source_ids)[:3])
patterns = ", ".join(list(cluster.pattern_tags)[:5])
text_preview = cluster.representative_text[:200]
summary = (
f"[{severity}] {cluster.start_iso or 'unknown'} "
f"({sources}) — {text_preview}"
)
if patterns:
summary += f" [patterns: {patterns}]"
return summary
class RootCauseHypothesizer:
"""Generate ranked root-cause hypotheses from a classified log timeline."""
def __init__(self, max_hypotheses: int = 4) -> None:
self._max_hypotheses = max_hypotheses
def hypothesize(
self,
classified: ClassifiedTimeline,
ctx: RetrievedContext,
query: str,
llm_url: str | None = None,
llm_model: str | None = None,
llm_api_key: str | None = None,
) -> list[Hypothesis]:
"""Generate hypotheses from a classified timeline and RAG context.
Returns an empty list when no LLM is configured or there are no
clusters to analyse.
"""
if not llm_url or not llm_model:
return []
clusters = classified.timeline.clusters
if not clusters:
return []
cluster_lines = [
_cluster_summary(c, classified.cluster_severities.get(c.cluster_id, c.severity))
for c in clusters
]
cluster_block = "\n".join(cluster_lines)
context_parts: list[str] = []
for chunk in ctx.chunks[:5]:
filename = chunk.get("filename", "unknown")
text = chunk.get("text", "")[:300]
context_parts.append(f"[{filename}] {text}")
context_block = "\n".join(context_parts) if context_parts else "(none)"
user_message = (
f"Query: {query}\n\n"
f"Context from runbooks and known patterns:\n{context_block}\n\n"
f"Log timeline (clustered, {len(clusters)} clusters):\n{cluster_block}\n\n"
f"Generate up to {self._max_hypotheses} hypotheses. Return JSON array only."
)
messages = [
{"role": "system", "content": _SYSTEM_PROMPT},
{"role": "user", "content": user_message},
]
raw_response = call_llm(
llm_url=llm_url,
llm_model=llm_model,
llm_api_key=llm_api_key,
messages=messages,
max_tokens=1024, # JSON array of 2-4 hypotheses; 1024 is sufficient
)
if raw_response is None:
return []
return self._parse_response(raw_response)
def _parse_response(self, raw: str) -> list[Hypothesis]:
"""Parse the LLM JSON response into a list of Hypothesis objects.
Strips markdown code fences before parsing some LLMs wrap JSON in
triple-backtick fences despite being instructed not to.
"""
try:
# extract_first_json_array handles reasoning models that emit valid
# JSON then repeat it inside a markdown fence block.
data = json.loads(extract_first_json_array(strip_json_fences(raw)))
except json.JSONDecodeError:
logger.warning(
"Hypothesizer: invalid JSON from LLM (truncated): %.120s", raw
)
return []
if not isinstance(data, list):
logger.warning(
"Hypothesizer: expected JSON array, got %s", type(data).__name__
)
return []
hypotheses: list[Hypothesis] = []
for item in data[: self._max_hypotheses]:
if not isinstance(item, dict):
continue
severity_raw = item.get("severity", "ERROR")
severity = _validate_severity(str(severity_raw))
hypothesis = Hypothesis(
hypothesis_id=str(uuid4()),
title=str(item.get("title", "Unknown"))[:80],
description=str(item.get("description", "")),
confidence=_coerce_float(item.get("confidence"), 0.5),
supporting_cluster_ids=tuple(
str(x) for x in (item.get("supporting_clusters") or [])
),
runbook_refs=(),
severity=severity,
)
hypotheses.append(hypothesis)
return hypotheses

View file

@ -0,0 +1,318 @@
"""Frictionless diagnose service — NL time extraction + layered log search."""
from __future__ import annotations
import asyncio
import dataclasses
import logging
import re
from collections.abc import AsyncGenerator
from datetime import datetime, timedelta, timezone
from pathlib import Path
from typing import Any
from app.context.retriever import retrieve_context, format_context_block
from app.services.llm import summarize
from app.services.search import SearchResult, entries_in_window, search
logger = logging.getLogger(__name__)
try:
from dateparser.search import search_dates as _search_dates # type: ignore[import]
_HAS_DATEPARSER = True
except ImportError:
_search_dates = None # type: ignore[assignment]
_HAS_DATEPARSER = False
_RELATIVE_RE = re.compile(
r"\b(?:last|past)\s+(?:(?P<n>\d+)|(?P<approx>a\s+few|few|couple(?:\s+of)?|several))?\s*(?P<unit>minute|hour|day|week)s?\b",
re.IGNORECASE,
)
_RELATIVE_UNITS = {"minute": 1, "hour": 60, "day": 1440, "week": 10080}
# Fuzzy quantifiers map to a reasonable span so "last few hours" → 3h window
_APPROX_N = 3
def _relative_window(match: re.Match) -> tuple[str, str]:
"""Convert a relative time match to (since_iso, until_iso)."""
n_str = match.group("n")
approx = match.group("approx")
unit = match.group("unit").lower()
n = int(n_str) if n_str else (_APPROX_N if approx else 1)
minutes = n * _RELATIVE_UNITS[unit]
return _last_n_minutes(minutes), _now_iso()
def parse_time_window(query: str) -> tuple[str | None, str | None, str]:
"""Extract a time window from a natural-language query string.
Returns (since_iso, until_iso, keywords) where keywords is the query with
the matched time phrase stripped. Falls back to last-60-min window.
"""
# Handle relative expressions first ("last hour", "past 30 minutes", etc.)
# dateparser misinterprets these as absolute times.
m = _RELATIVE_RE.search(query)
if m:
since, until = _relative_window(m)
keywords = re.sub(r"\s{2,}", " ", query[: m.start()] + query[m.end() :]).strip()
return since, until, keywords or query
if _HAS_DATEPARSER and _search_dates is not None:
# Tell dateparser what timezone the user is in so "3:35 am" means local time.
# PREFER_DAY_OF_MONTH is unused here but PREFER_DATES_FROM=past ensures
# "3:35 am" resolves to the most recent past occurrence, not a future one.
local_offset = datetime.now().astimezone().utcoffset()
offset_h = int((local_offset.total_seconds() if local_offset else 0) / 3600)
tz_str = f"UTC{'+' if offset_h >= 0 else ''}{offset_h}"
try:
results = _search_dates(
query,
languages=["en"],
settings={
"PREFER_DATES_FROM": "past",
"TIMEZONE": tz_str,
"RETURN_AS_TIMEZONE_AWARE": True,
},
)
except Exception:
logger.warning(
"dateparser failed on query %r — falling back to 60-min window", query
)
results = None
if results:
phrase, dt = results[0]
if dt.tzinfo is None:
dt = dt.replace(tzinfo=timezone.utc)
else:
dt = dt.astimezone(
timezone.utc
) # normalise to UTC for SQLite string compare
since = (dt - timedelta(minutes=30)).isoformat()
until = (dt + timedelta(minutes=30)).isoformat()
keywords = re.sub(r"\s{2,}", " ", query.replace(phrase, " ").strip())
return since, until, keywords or query
return _last_n_minutes(60), _now_iso(), query
def diagnose(
db_path: Path,
query: str,
since: str | None = None,
until: str | None = None,
source_filter: str | None = None,
llm_url: str | None = None,
llm_model: str | None = None,
llm_api_key: str | None = None,
) -> dict[str, Any]:
"""Run layered log search with NL time extraction. Returns summary + entries."""
time_detected = since is not None and until is not None
if not time_detected:
parsed_since, parsed_until, keywords = parse_time_window(query)
since = since or parsed_since
until = until or parsed_until
time_detected = keywords != query
else:
keywords = query
keyword_hits = search(
db_path,
query=keywords,
since=since,
until=until,
source_filter=source_filter,
limit=150,
or_mode=True,
)
window_hits = entries_in_window(
db_path,
since=since,
until=until,
source_filter=source_filter,
limit=50,
per_source_cap=15,
)
seen: set[str] = set()
merged: list[SearchResult] = []
for r in keyword_hits + window_hits:
if r.entry_id not in seen:
seen.add(r.entry_id)
merged.append(r)
combined = sorted(merged, key=lambda r: (r.timestamp_iso or "\xff", r.sequence))[
:200
]
by_severity: dict[str, int] = {"CRITICAL": 0, "ERROR": 0, "WARN": 0, "INFO": 0}
by_source: dict[str, int] = {}
for r in combined:
sev = (r.severity or "INFO").upper()
if sev in by_severity:
by_severity[sev] += 1
by_source[r.source_id] = by_source.get(r.source_id, 0) + 1
reasoning: str | None = None
if llm_url and llm_model:
reasoning = summarize(
query, combined, llm_url=llm_url, llm_model=llm_model, api_key=llm_api_key
)
return {
"summary": {
"total": len(combined),
"window_start": since,
"window_end": until,
"time_detected": time_detected,
"by_severity": by_severity,
"by_source": by_source,
},
"reasoning": reasoning,
"entries": combined,
}
async def diagnose_stream(
db_path: Path,
query: str,
since: str | None = None,
until: str | None = None,
source_filter: str | None = None,
llm_url: str | None = None,
llm_model: str | None = None,
llm_api_key: str | None = None,
) -> AsyncGenerator[dict[str, Any], None]:
"""Async generator yielding SSE event dicts for the diagnose pipeline.
Yields events in order:
{"type":"status","message":""} pipeline progress
{"type":"summary","data":{}} window + severity counts (fast, from DB)
{"type":"entries","data":[]} log entries (fast, from DB)
{"type":"reasoning","text":""} LLM analysis (slow, optional)
{"type":"done"}
"""
keywords = query.strip()
source_browse = not keywords and source_filter is not None
if source_browse:
# No keyword — browsing a source directly. Use 24h window; skip FTS entirely.
yield {"type": "status", "message": f"Loading {source_filter}"}
since = since or _last_n_minutes(60 * 24)
until = until or _now_iso()
time_detected = False
else:
yield {"type": "status", "message": "Parsing time window…"}
time_detected = since is not None and until is not None
if not time_detected:
parsed_since, parsed_until, keywords = await asyncio.to_thread(
parse_time_window, query
)
since = since or parsed_since
until = until or parsed_until
time_detected = keywords != query
yield {"type": "status", "message": "Loading environment context…"}
ctx = await asyncio.to_thread(lambda: retrieve_context(db_path, query))
context_block = format_context_block(ctx)
yield {
"type": "context",
"facts": ctx.facts,
"chunks": ctx.chunks,
}
yield {"type": "status", "message": "Searching logs…"}
if source_browse:
keyword_hits: list[SearchResult] = []
window_hits = await asyncio.to_thread(
lambda: entries_in_window(
db_path,
since,
until,
source_filter=source_filter,
limit=200,
)
)
else:
keyword_hits, window_hits = await asyncio.gather(
asyncio.to_thread(
lambda: search(
db_path,
keywords,
source_filter=source_filter,
since=since,
until=until,
limit=150,
or_mode=True,
)
),
asyncio.to_thread(
lambda: entries_in_window(
db_path,
since,
until,
source_filter=source_filter,
limit=50,
per_source_cap=15,
)
),
)
seen: set[str] = set()
merged: list[SearchResult] = []
for r in keyword_hits + window_hits:
if r.entry_id not in seen:
seen.add(r.entry_id)
merged.append(r)
combined = sorted(merged, key=lambda r: (r.timestamp_iso or "\xff", r.sequence))[
:200
]
by_severity: dict[str, int] = {"CRITICAL": 0, "ERROR": 0, "WARN": 0, "INFO": 0}
by_source: dict[str, int] = {}
for r in combined:
sev = (r.severity or "INFO").upper()
if sev in by_severity:
by_severity[sev] += 1
by_source[r.source_id] = by_source.get(r.source_id, 0) + 1
yield {
"type": "summary",
"data": {
"total": len(combined),
"window_start": since,
"window_end": until,
"time_detected": time_detected,
"by_severity": by_severity,
"by_source": by_source,
},
}
yield {"type": "entries", "data": [dataclasses.asdict(r) for r in combined]}
if llm_url and llm_model and combined:
yield {"type": "status", "message": "Analyzing with LLM…"}
reasoning = await asyncio.to_thread(
lambda: summarize(
query,
combined,
llm_url,
llm_model,
llm_api_key,
context_block=context_block,
)
)
if reasoning:
yield {"type": "reasoning", "text": reasoning}
yield {"type": "done"}
def _now_iso() -> str:
return datetime.now(timezone.utc).isoformat()
def _last_n_minutes(n: int) -> str:
return (datetime.now(timezone.utc) - timedelta(minutes=n)).isoformat()

View file

@ -0,0 +1,77 @@
"""Pipeline data types for the multi-agent diagnose pipeline."""
from __future__ import annotations
from dataclasses import dataclass
from types import MappingProxyType
from typing import Literal
SeverityLabel = Literal["CRITICAL", "ERROR", "WARN", "INFO", "DEBUG", "UNKNOWN"]
@dataclass(frozen=True)
class EventCluster:
"""A time-correlated group of log entries within the timeline."""
cluster_id: str
entries: tuple[str, ...] # entry_id refs
start_iso: str | None
end_iso: str | None
duration_seconds: float
source_ids: tuple[str, ...]
pattern_tags: tuple[str, ...]
severity: SeverityLabel
burst: bool
gap_before_seconds: float
representative_text: str
@dataclass(frozen=True)
class TimelineResult:
"""Structured timeline of event clusters built from log entries."""
clusters: tuple[EventCluster, ...]
total_entries: int
window_start: str | None
window_end: str | None
gap_count: int
burst_count: int
dominant_sources: tuple[str, ...]
@dataclass(frozen=True)
class ClassifiedTimeline:
"""Timeline annotated with ML-assigned severity per cluster.
``cluster_severities`` is a ``MappingProxyType`` so the mapping is
fully immutable consistent with the ``frozen=True`` intent.
"""
timeline: TimelineResult
cluster_severities: MappingProxyType # MappingProxyType[str, SeverityLabel]
classifier_used: Literal["ml", "pattern_tags", "regex"]
model_id: str | None
@dataclass(frozen=True)
class Hypothesis:
"""A root-cause hypothesis generated by Stage 3."""
hypothesis_id: str
title: str
description: str
confidence: float
supporting_cluster_ids: tuple[str, ...]
runbook_refs: tuple[str, ...]
severity: SeverityLabel
@dataclass(frozen=True)
class RankedHypothesis:
"""A hypothesis enriched by Stage 4 false-positive suppression."""
hypothesis: Hypothesis
novelty_score: float
similarity_to_known: float
suppress: bool
suppression_reason: str | None

View file

@ -0,0 +1,173 @@
"""Multi-agent diagnose pipeline orchestrator — Stage 15 wiring."""
from __future__ import annotations
import asyncio
import dataclasses
import logging
import os
from collections.abc import AsyncGenerator
from pathlib import Path
from typing import Any
# Optional ML classifier model for Stage 2.
# When empty (default), Stage 2 falls back to pattern_tags then regex.
# Set TURNSTONE_CLASSIFIER_MODEL to a HuggingFace model ID to enable ML classification.
# Recommended: byviz/bylastic_classification_logs (DistilBERT, ~300MB)
_CLASSIFIER_MODEL: str = os.environ.get("TURNSTONE_CLASSIFIER_MODEL", "")
from app.context.retriever import RetrievedContext
from app.services.diagnose.classifier import SeverityClassifier
from app.services.diagnose.hypothesizer import RootCauseHypothesizer
from app.services.diagnose.suppressor import FalsePositiveSuppressor
from app.services.diagnose.synthesizer import SummarySynthesizer
from app.services.diagnose.timeline import TimelineReconstructor
from app.services.search import SearchResult
logger = logging.getLogger(__name__)
async def run_pipeline(
db_path: Path,
entries: list[SearchResult],
ctx: RetrievedContext,
query: str,
since: str | None, # reserved for future range-filtering in stage queries (#29 follow-up)
until: str | None, # reserved for future range-filtering in stage queries (#29 follow-up)
llm_url: str | None,
llm_model: str | None,
llm_api_key: str | None,
tech_level: str = "sysadmin",
incidents_db_path: Path | None = None,
) -> AsyncGenerator[dict[str, Any], None]:
"""Async generator that runs all 5 pipeline stages and yields SSE event dicts.
Stages:
1. TimelineReconstructor cluster log entries by time
2. SeverityClassifier annotate clusters with severity
3. RootCauseHypothesizer generate hypotheses via LLM
4. FalsePositiveSuppressor rank and suppress known patterns
5. SummarySynthesizer produce a narrative diagnosis
Yields events in order:
{"type": "status", "message": "Building timeline…"}
{"type": "pipeline_stage", "stage": 1, ...}
{"type": "pipeline_stage", "stage": 2, ...}
{"type": "pipeline_stage", "stage": 3, ...}
{"type": "pipeline_stage", "stage": 4, ...}
{"type": "hypotheses", "data": [...]}
{"type": "status", "message": "Synthesizing…"}
{"type": "reasoning", "text": "..."} only when synthesis produces text
{"type": "done"}
"""
# Stage 1: Timeline reconstruction
yield {"type": "status", "message": "Building timeline…"}
try:
timeline = await asyncio.to_thread(
TimelineReconstructor().reconstruct, entries
)
except Exception as exc:
logger.exception("Stage 1 (timeline) failed: %s", exc)
yield {"type": "error", "message": "Pipeline error in stage 1 (timeline)"}
yield {"type": "done"}
return
n_clusters = len(timeline.clusters)
burst = timeline.burst_count
yield {
"type": "pipeline_stage",
"stage": 1,
"name": "timeline",
"message": f"Built {n_clusters} clusters, {burst} bursts",
}
# Stage 2: Severity classification
try:
classified = await asyncio.to_thread(
SeverityClassifier(model_id=_CLASSIFIER_MODEL).classify, timeline
)
except Exception as exc:
logger.exception("Stage 2 (classifier) failed: %s", exc)
yield {"type": "error", "message": "Pipeline error in stage 2 (classifier)"}
yield {"type": "done"}
return
sev_counts: dict[str, int] = {}
for sev in classified.cluster_severities.values():
sev_counts[sev] = sev_counts.get(sev, 0) + 1
counts_str = ", ".join(f"{k}:{v}" for k, v in sorted(sev_counts.items()))
yield {
"type": "pipeline_stage",
"stage": 2,
"name": "classifier",
"message": f"{classified.classifier_used} classifier: {counts_str}",
}
# Stage 3: Root-cause hypotheses
try:
hypotheses = await asyncio.to_thread(
RootCauseHypothesizer().hypothesize,
classified,
ctx,
query,
llm_url,
llm_model,
llm_api_key,
)
except Exception as exc:
logger.exception("Stage 3 (hypothesizer) failed: %s", exc)
yield {"type": "error", "message": "Pipeline error in stage 3 (hypothesizer)"}
yield {"type": "done"}
return
yield {
"type": "pipeline_stage",
"stage": 3,
"name": "hypotheses",
"message": f"{len(hypotheses)} hypotheses generated",
}
# Stage 4: False-positive suppression
_incidents_db = incidents_db_path or db_path
try:
ranked = await asyncio.to_thread(
FalsePositiveSuppressor().suppress, hypotheses, _incidents_db
)
except Exception as exc:
logger.exception("Stage 4 (suppressor) failed: %s", exc)
yield {"type": "error", "message": "Pipeline error in stage 4 (suppressor)"}
yield {"type": "done"}
return
suppressed = sum(1 for rh in ranked if rh.suppress)
active = len(ranked) - suppressed
yield {
"type": "pipeline_stage",
"stage": 4,
"name": "suppressor",
"message": f"{suppressed} suppressed, {active} active",
}
yield {
"type": "hypotheses",
"data": [dataclasses.asdict(rh) for rh in ranked],
}
# Stage 5: Summary synthesis
yield {"type": "status", "message": "Synthesizing…"}
try:
synthesis_text = await asyncio.to_thread(
SummarySynthesizer().synthesize,
ranked,
timeline,
ctx,
query,
llm_url,
llm_model,
llm_api_key,
tech_level,
)
except Exception as exc:
logger.exception("Stage 5 (synthesizer) failed: %s", exc)
yield {"type": "error", "message": "Pipeline error in stage 5 (synthesizer)"}
yield {"type": "done"}
return
if synthesis_text:
yield {"type": "reasoning", "text": synthesis_text}
yield {"type": "done"}

View file

@ -0,0 +1,275 @@
"""Stage 4: False-Positive Suppressor — embedding cosine similarity.
Compares each hypothesis against a corpus of resolved incidents using
embedding cosine similarity. Hypotheses that closely match a previously
resolved incident are suppressed as likely false positives.
When no embedding model is configured or the service is unavailable, all
hypotheses pass through with novelty_score=1.0 (full novelty assumed).
"""
from __future__ import annotations
import logging
import sqlite3
from pathlib import Path
from typing import Any
from app.services.diagnose.models import Hypothesis, RankedHypothesis
logger = logging.getLogger(__name__)
# Module-level corpus cache: db_path_str -> (corpus_texts, embeddings)
# Invalidated when the corpus text list changes between calls.
_corpus_cache: dict[str, tuple[list[str], Any]] = {}
# ---------------------------------------------------------------------------
# Cosine similarity helpers
# ---------------------------------------------------------------------------
try:
import numpy as np
def _cosine_similarities(
query_emb: list[float], corpus_embs: list[list[float]]
) -> list[float]:
"""Batch cosine similarity of one query embedding against all corpus embeddings."""
q = np.array(query_emb, dtype=np.float32)
c = np.array(corpus_embs, dtype=np.float32)
q_norm = q / (np.linalg.norm(q) + 1e-10)
c_norm = c / (np.linalg.norm(c, axis=1, keepdims=True) + 1e-10)
return list(c_norm @ q_norm)
_HAS_NUMPY = True
except ImportError: # pragma: no cover
import math
_HAS_NUMPY = False
def _dot(a: list[float], b: list[float]) -> float:
return sum(x * y for x, y in zip(a, b))
def _norm(a: list[float]) -> float:
return math.sqrt(sum(x * x for x in a)) + 1e-10
def _cosine(a: list[float], b: list[float]) -> float:
return _dot(a, b) / (_norm(a) * _norm(b))
def _cosine_similarities(
query_emb: list[float], corpus_embs: list[list[float]]
) -> list[float]:
return [_cosine(query_emb, c) for c in corpus_embs]
# ---------------------------------------------------------------------------
# DB helpers
# ---------------------------------------------------------------------------
def _fetch_resolved_incidents(incidents_db_path: Path) -> list[str]:
"""Fetch resolved incident texts from the incidents database.
Returns a list of non-empty combined strings for each resolved incident.
Returns an empty list on any error (missing table, connection failure, etc.).
"""
try:
with sqlite3.connect(str(incidents_db_path), timeout=30.0) as conn:
cursor = conn.execute(
"SELECT label, notes FROM incidents WHERE ended_at IS NOT NULL LIMIT 200"
)
rows = cursor.fetchall()
except sqlite3.OperationalError as exc:
logger.warning("Could not query resolved incidents (%s) — treating as empty corpus", exc)
return []
except sqlite3.Error as exc:
# Catches all remaining SQLite-family errors (IntegrityError, DatabaseError, etc.)
logger.warning("Unexpected SQLite error fetching resolved incidents (%s) — treating as empty corpus", exc)
return []
texts: list[str] = []
for label, notes in rows:
label = (label or "").strip()
notes = (notes or "").strip()
combined = f"{label}. {notes}" if label and notes else (label or notes)
if combined:
texts.append(combined)
return texts
# ---------------------------------------------------------------------------
# Public class
# ---------------------------------------------------------------------------
class FalsePositiveSuppressor:
"""Stage 4 of the multi-agent diagnose pipeline.
Uses embedding cosine similarity to detect hypotheses that closely match
previously resolved incidents and suppress them as likely false positives.
When model_id is empty or the embedding service is unavailable, all
hypotheses pass through with novelty_score=1.0 (no suppression).
"""
def __init__(
self,
model_id: str = "",
device: str = "cpu",
similarity_threshold: float = 0.85,
) -> None:
self._model_id = model_id
self._device = device
# _device stored for future use when get_embedder() supports device selection
# Suppress when cosine similarity to a known resolved incident >= threshold.
# A threshold of 0.85 means "suppress if 85%+ similar to something already resolved."
self._similarity_threshold = similarity_threshold
def suppress(
self,
hypotheses: list[Hypothesis],
incidents_db_path: Path,
) -> list[RankedHypothesis]:
"""Rank hypotheses by novelty, suppressing those matching resolved incidents.
Args:
hypotheses: Candidate hypotheses from Stage 3.
incidents_db_path: Path to the dedicated incidents SQLite database.
Returns:
List of RankedHypothesis sorted by (novelty_score * confidence) descending.
Non-suppressed hypotheses appear first in practice.
"""
if not hypotheses:
return []
# No model configured — full passthrough, rank by confidence only.
if not self._model_id:
return self._passthrough(hypotheses)
# Attempt to obtain an embedder; fall back to passthrough on failure.
embedder = self._load_embedder()
if embedder is None:
logger.warning(
"Embedding service unavailable for model %r — skipping suppression",
self._model_id,
)
return self._passthrough(hypotheses)
# Fetch corpus texts from incidents DB; fall back to passthrough if empty.
corpus_texts = _fetch_resolved_incidents(incidents_db_path)
if not corpus_texts:
logger.debug("No resolved incidents found — all hypotheses treated as novel")
return self._passthrough(hypotheses)
# Embed corpus (with caching).
corpus_embeddings = self._get_corpus_embeddings(embedder, corpus_texts, incidents_db_path)
# Score each hypothesis and sort by novelty * confidence descending.
ranked = [
self._score_hypothesis(h, embedder, corpus_embeddings)
for h in hypotheses
]
ranked.sort(key=lambda rh: rh.novelty_score * rh.hypothesis.confidence, reverse=True)
return ranked
# ------------------------------------------------------------------
# Private helpers
# ------------------------------------------------------------------
def _score_hypothesis(
self,
hypothesis: Hypothesis,
embedder: Any,
corpus_embeddings: list[list[float]],
) -> RankedHypothesis:
"""Score a single hypothesis against the resolved incident corpus."""
try:
query_text = f"{hypothesis.title}. {hypothesis.description}"
h_emb = embedder.embed(query_text)
# Convert numpy array to plain Python list for _cosine_similarities
h_emb_list: list[float] = h_emb.tolist() if hasattr(h_emb, "tolist") else list(h_emb)
sims = _cosine_similarities(h_emb_list, corpus_embeddings)
max_sim = float(max(sims)) if sims else 0.0
except Exception as exc:
# Broad catch is intentional: catches unknown embedder runtime errors
# (e.g. CUDA OOM, backend crashes) so one bad hypothesis never halts the pipeline.
logger.warning("Embedding failed for hypothesis %r: %s — treating as novel", hypothesis.title, exc)
return RankedHypothesis(
hypothesis=hypothesis,
novelty_score=1.0,
similarity_to_known=0.0,
suppress=False,
suppression_reason=None,
)
novelty_score = 1.0 - max_sim
suppress = bool(max_sim >= self._similarity_threshold)
suppression_reason = (
f"Similar to resolved incident (similarity {max_sim:.2f})"
if suppress
else None
)
return RankedHypothesis(
hypothesis=hypothesis,
novelty_score=novelty_score,
similarity_to_known=max_sim,
suppress=suppress,
suppression_reason=suppression_reason,
)
def _load_embedder(self) -> Any | None:
"""Load the embedding service. Returns None if unavailable."""
try:
from app.services.embeddings import get_embedder
return get_embedder()
except Exception as exc:
# Broad catch is intentional: get_embedder() may raise on import or
# backend init failures from any number of third-party libraries.
logger.warning("Failed to import/initialise embedding service: %s", exc)
return None
def _get_corpus_embeddings(
self,
embedder: Any,
corpus_texts: list[str],
incidents_db_path: Path,
) -> list[list[float]]:
"""Return cached corpus embeddings, re-embedding if the corpus has changed."""
cache_key = str(incidents_db_path)
cached = _corpus_cache.get(cache_key)
if cached is not None:
cached_texts, cached_embeddings = cached
if cached_texts == corpus_texts:
return cached_embeddings
logger.debug("Embedding corpus of %d resolved incidents", len(corpus_texts))
try:
raw_embeddings = embedder.embed_batch(corpus_texts)
# Normalise each embedding to a plain Python list for portability
corpus_embeddings: list[list[float]] = [
e.tolist() if hasattr(e, "tolist") else list(e)
for e in raw_embeddings
]
except Exception as exc:
# Broad catch is intentional: embed_batch() may raise from any backend
# (network timeout, CUDA error, etc.) — treat as empty corpus so the
# pipeline can continue without suppression.
logger.warning("Corpus embedding failed: %s — treating as empty corpus", exc)
return []
_corpus_cache[cache_key] = (corpus_texts, corpus_embeddings)
return corpus_embeddings
def _passthrough(self, hypotheses: list[Hypothesis]) -> list[RankedHypothesis]:
"""Return all hypotheses as non-suppressed, ranked by confidence descending."""
ranked = [
RankedHypothesis(
hypothesis=h,
novelty_score=1.0,
similarity_to_known=0.0,
suppress=False,
suppression_reason=None,
)
for h in hypotheses
]
ranked.sort(key=lambda rh: rh.hypothesis.confidence, reverse=True)
return ranked

View file

@ -0,0 +1,203 @@
"""Stage 5: Summary Synthesizer — deterministic narrative from ranked hypotheses.
Streaming upgrade (async SSE chunks) is tracked as a follow-up enhancement.
This implementation is synchronous to match the rest of the pipeline.
"""
from __future__ import annotations
import logging
from app.context.retriever import RetrievedContext
from app.services.diagnose._llm_client import call_llm
from app.services.diagnose.models import RankedHypothesis, TimelineResult
logger = logging.getLogger(__name__)
_SYSTEM_PROMPTS: dict[str, str] = {
"sysadmin": (
"You are a Linux sysadmin diagnosing a system incident. "
"Write a concise, actionable incident diagnosis.\n\n"
"Format your response exactly as:\n"
"1. VERDICT: [CRITICAL|ERROR|WARN|INFO] — <what happened> (<X>% confidence)\n"
"2. TIMELINE: <what the logs show in sequence, 2-3 sentences>\n"
"3. ROOT CAUSES:\n"
" - <hypothesis 1 title> (<confidence>%)\n"
" - <hypothesis 2 title> (<confidence>%)\n"
"4. RECOMMENDED ACTIONS:\n"
" - <action based on hypotheses>\n"
"5. INVESTIGATE FURTHER: <open questions, if any>"
),
"homelab": (
"You are explaining a system incident to a home lab enthusiast — someone "
"comfortable with Linux basics but not necessarily familiar with every daemon "
"or kernel subsystem. Be clear about what each service does; spell out "
"abbreviations; explain why each action helps.\n\n"
"Format your response exactly as:\n"
"1. VERDICT: [CRITICAL|ERROR|WARN|INFO] — <what happened in plain terms> (<X>% confidence)\n"
"2. TIMELINE: <what happened in sequence, 2-3 sentences; explain what each service is>\n"
"3. ROOT CAUSES:\n"
" - <hypothesis title — one sentence explaining what it means> (<confidence>%)\n"
"4. RECOMMENDED ACTIONS:\n"
" - <command or step — explain what it does and why>\n"
"5. INVESTIGATE FURTHER: <open questions in plain language>"
),
"executive": (
"You are summarizing a technical system incident for a non-technical stakeholder. "
"Focus on what broke, what the business impact was, and what the technical team is doing about it. "
"Use plain English. Do not use daemon names, kernel terms, log syntax, or technical jargon.\n\n"
"Format your response exactly as:\n"
"1. WHAT HAPPENED: <1-2 sentences describing the problem in plain English>\n"
"2. IMPACT: <which services or users were affected, and how>\n"
"3. CONFIDENCE: <High / Medium / Low — how certain we are of the diagnosis>\n"
"4. ACTION NEEDED: <what the IT team is doing or should do, in plain terms>"
),
}
def _build_hypothesis_block(ranked: list[RankedHypothesis]) -> str:
"""Build the hypothesis block for the prompt (non-suppressed only, top 3)."""
active = [rh for rh in ranked if not rh.suppress][:3]
if not active:
return "(none)"
lines: list[str] = []
for rh in active:
h = rh.hypothesis
conf_pct = int(h.confidence * 100)
novelty = f"{rh.novelty_score:.2f}"
desc = f"\n {h.description}" if h.description else ""
lines.append(
f"- [{h.severity}, {conf_pct}% conf, novelty {novelty}] {h.title}{desc}"
)
return "\n".join(lines)
def _build_timeline_block(timeline: TimelineResult) -> str:
"""Build a sequenced cluster block so the synthesizer can narrate what happened.
Mirrors the format used by the hypothesizer, but adds gap information so the
LLM can reason about silence windows between bursts.
"""
if not timeline.clusters:
return "(no clusters)"
lines: list[str] = []
for i, c in enumerate(timeline.clusters):
ts = c.start_iso or "unknown"
sources = ", ".join(list(c.source_ids)[:3])
tags = ", ".join(list(c.pattern_tags)[:4])
burst_label = " [BURST]" if c.burst else ""
gap_label = (
f" (+{int(c.gap_before_seconds)}s silence)"
if c.gap_before_seconds > 30
else ""
)
text_preview = c.representative_text[:200]
line = (
f"Cluster {i + 1}{burst_label}{gap_label} @ {ts} [{c.severity}] "
f"({sources}) — {text_preview}"
)
if tags:
line += f" [patterns: {tags}]"
lines.append(line)
return "\n".join(lines)
def _build_context_block(ctx: RetrievedContext) -> str:
"""Build the runbook context block for the prompt."""
parts: list[str] = []
for chunk in ctx.chunks[:5]:
filename = chunk.get("filename", "unknown")
text = chunk.get("text", "")[:300]
parts.append(f"[{filename}] {text}")
return "\n".join(parts) if parts else "(none)"
def _deterministic_fallback(
ranked: list[RankedHypothesis],
timeline: TimelineResult,
) -> str:
"""Build a deterministic fallback text when no LLM is available."""
active = [rh for rh in ranked if not rh.suppress][:3]
if active:
top = active[0]
verdict_severity = top.hypothesis.severity
verdict_title = top.hypothesis.title
verdict_conf = int(top.hypothesis.confidence * 100)
elif ranked:
top = ranked[0]
verdict_severity = top.hypothesis.severity
verdict_title = top.hypothesis.title
verdict_conf = int(top.hypothesis.confidence * 100)
else:
verdict_severity = "UNKNOWN"
verdict_title = "No hypotheses generated"
verdict_conf = 0
root_causes = ", ".join(
rh.hypothesis.title for rh in (active or ranked[:3])
) or "None"
return (
f"VERDICT: {verdict_severity}{verdict_title} ({verdict_conf}% confidence)\n"
f"TIMELINE: {timeline.total_entries} entries across {len(timeline.clusters)} clusters.\n"
f"ROOT CAUSES: {root_causes}"
)
class SummarySynthesizer:
"""Stage 5 of the multi-agent diagnose pipeline.
Synthesizes a human-readable incident narrative from ranked hypotheses,
the reconstructed timeline, and RAG context. When no LLM is configured,
returns a deterministic fallback built from the hypothesis data.
"""
def synthesize(
self,
ranked: list[RankedHypothesis],
timeline: TimelineResult,
ctx: RetrievedContext,
query: str,
llm_url: str | None = None,
llm_model: str | None = None,
llm_api_key: str | None = None,
tech_level: str = "sysadmin",
) -> str:
"""Return synthesis text (single string, synchronous).
Falls back to a deterministic narrative when no LLM URL or model is
provided, or when the LLM call fails.
"""
fallback = _deterministic_fallback(ranked, timeline)
if not llm_url or not llm_model:
return fallback
system_prompt = _SYSTEM_PROMPTS.get(tech_level, _SYSTEM_PROMPTS["sysadmin"])
hypothesis_block = _build_hypothesis_block(ranked)
timeline_block = _build_timeline_block(timeline)
context_block = _build_context_block(ctx)
dominant = ", ".join(timeline.dominant_sources[:5]) or "none"
user_message = (
f"Query: {query}\n\n"
f"Timeline ({len(timeline.clusters)} clusters, "
f"{timeline.burst_count} bursts, "
f"{timeline.gap_count} silence gaps; "
f"primary sources: {dominant}):\n"
f"{timeline_block}\n\n"
f"Root-cause hypotheses:\n{hypothesis_block}\n\n"
f"Context from runbooks:\n{context_block}"
)
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message},
]
result = call_llm(
llm_url=llm_url,
llm_model=llm_model,
llm_api_key=llm_api_key,
messages=messages,
)
return result if result else fallback

View file

@ -0,0 +1,272 @@
"""Stage 1: Timeline Reconstructor — pure Python, no ML."""
from __future__ import annotations
import hashlib
import logging
from collections import defaultdict
from datetime import datetime, timezone
from app.services.diagnose.models import EventCluster, TimelineResult
from app.services.search import SearchResult
logger = logging.getLogger(__name__)
_SEVERITY_ORDER: dict[str | None, int] = {
"CRITICAL": 5,
"ERROR": 4,
"WARN": 3,
"WARNING": 3,
"INFO": 2,
"DEBUG": 1,
None: 0,
}
def _parse_iso(s: str) -> datetime | None:
"""Parse ISO 8601 string to UTC-aware datetime. Returns None on parse failure."""
try:
dt = datetime.fromisoformat(s)
except ValueError:
logger.warning("Unparseable timestamp in log entry, treating as None: %r", s)
return None
if dt.tzinfo is None:
logger.debug("Naive timestamp treated as UTC: %s", s)
dt = dt.replace(tzinfo=timezone.utc)
return dt.astimezone(timezone.utc)
def _sort_key(e: SearchResult) -> tuple[int, str]:
"""Sort key: timestamped entries first (ascending), then None-timestamp entries."""
if e.timestamp_iso is None:
return (1, "")
return (0, e.timestamp_iso)
def _highest_severity(entries: list[SearchResult]) -> str:
"""Return the highest severity label across all entries."""
best: str | None = None
best_rank = -1
for entry in entries:
sev = entry.severity
rank = _SEVERITY_ORDER.get(sev, 0)
if rank > best_rank:
best_rank = rank
best = sev
# SeverityLabel requires a valid literal; fall back to "UNKNOWN" if None
if best is None:
return "UNKNOWN"
# Normalise WARNING -> WARN for the output type
if best == "WARNING":
return "WARN"
return best
def _representative_text(entries: list[SearchResult]) -> str:
"""Return text of the entry with highest rank; tie-break on longest text."""
if not entries:
return ""
best = max(entries, key=lambda e: (e.rank, len(e.text)))
return best.text
def _cluster_id(entry_ids: list[str]) -> str:
"""Compute a 12-char hex cluster ID from a sorted list of entry IDs."""
payload = ",".join(sorted(entry_ids)).encode()
return hashlib.sha1(payload).hexdigest()[:12] # noqa: S324 — not used for security
def _make_event_cluster(
cluster_entries: list[SearchResult],
gap_before_seconds: float,
burst_threshold: int,
burst_window_seconds: int,
) -> EventCluster:
"""Construct an EventCluster from a list of SearchResult entries."""
timestamps = [
ts
for e in cluster_entries
if e.timestamp_iso is not None
for ts in (_parse_iso(e.timestamp_iso),)
if ts is not None
]
start_iso: str | None = None
end_iso: str | None = None
duration_seconds = 0.0
if timestamps:
ts_min = min(timestamps)
ts_max = max(timestamps)
start_iso = ts_min.isoformat()
end_iso = ts_max.isoformat()
duration_seconds = (ts_max - ts_min).total_seconds()
entry_ids = [e.entry_id for e in cluster_entries]
burst = (
len(cluster_entries) >= burst_threshold
and duration_seconds <= burst_window_seconds
)
return EventCluster(
cluster_id=_cluster_id(entry_ids),
entries=tuple(entry_ids),
start_iso=start_iso,
end_iso=end_iso,
duration_seconds=duration_seconds,
source_ids=tuple(sorted(set(e.source_id for e in cluster_entries))),
pattern_tags=tuple(
sorted(set(tag for e in cluster_entries for tag in e.matched_patterns))
),
severity=_highest_severity(cluster_entries), # type: ignore[arg-type] # SeverityLabel is a Literal; _highest_severity returns a compatible str
burst=burst,
gap_before_seconds=gap_before_seconds,
representative_text=_representative_text(cluster_entries),
)
class TimelineReconstructor:
"""Reconstruct a structured timeline of event clusters from log entries.
Pure Python no ML or LLM calls. Designed as Stage 1 of the multi-agent
diagnose pipeline.
"""
def __init__(
self,
cluster_window_seconds: int = 30,
burst_threshold: int = 10,
burst_window_seconds: int = 5,
gap_significance_seconds: int = 30,
) -> None:
self._cluster_window = cluster_window_seconds
self._burst_threshold = burst_threshold
self._burst_window = burst_window_seconds
self._gap_significance_seconds: int = gap_significance_seconds
def _sort_entries(self, entries: list[SearchResult]) -> list[SearchResult]:
"""Sort entries: timestamped first (ascending), then None-timestamp entries."""
return sorted(entries, key=_sort_key)
def _group_into_raw_clusters(
self, sorted_entries: list[SearchResult]
) -> list[list[SearchResult]]:
"""Group sorted entries into time-window clusters."""
raw_clusters: list[list[SearchResult]] = []
current: list[SearchResult] = []
cluster_anchor: datetime | None = None
for entry in sorted_entries:
if not current:
current.append(entry)
if entry.timestamp_iso is not None:
cluster_anchor = _parse_iso(entry.timestamp_iso)
continue
if entry.timestamp_iso is None:
# No timestamp — always joins the current cluster
current.append(entry)
continue
entry_dt = _parse_iso(entry.timestamp_iso)
if entry_dt is None:
# Malformed timestamp — treat same as None: join current cluster
current.append(entry)
continue
if cluster_anchor is None:
# Current cluster has no anchor yet — set it, stay in cluster
cluster_anchor = entry_dt
current.append(entry)
continue
delta = (entry_dt - cluster_anchor).total_seconds()
if delta > self._cluster_window:
raw_clusters.append(current)
current = [entry]
cluster_anchor = entry_dt
else:
current.append(entry)
if current:
raw_clusters.append(current)
return raw_clusters
def _build_cluster(
self,
cluster_entries: list[SearchResult],
prev_end_iso: str | None,
) -> EventCluster:
"""Build an EventCluster from a list of SearchResult entries."""
gap_before = 0.0
if prev_end_iso is not None:
ts_list = [
ts
for e in cluster_entries
if e.timestamp_iso is not None
for ts in (_parse_iso(e.timestamp_iso),)
if ts is not None
]
if ts_list:
this_start = min(ts_list)
prev_end = _parse_iso(prev_end_iso)
if prev_end is not None:
gap_before = (this_start - prev_end).total_seconds()
return _make_event_cluster(
cluster_entries,
gap_before_seconds=gap_before,
burst_threshold=self._burst_threshold,
burst_window_seconds=self._burst_window,
)
def _dominant_sources_tuple(self, entries: list[SearchResult]) -> tuple[str, ...]:
"""Return source_ids sorted by total entry count descending."""
source_counts: dict[str, int] = defaultdict(int)
for entry in entries:
source_counts[entry.source_id] += 1
return tuple(
src for src, _ in sorted(source_counts.items(), key=lambda kv: -kv[1])
)
def reconstruct(self, entries: list[SearchResult]) -> TimelineResult:
"""Build a structured timeline from a flat list of log entries."""
if not entries:
return TimelineResult(
clusters=(),
total_entries=0,
window_start=None,
window_end=None,
gap_count=0,
burst_count=0,
dominant_sources=(),
)
sorted_entries = self._sort_entries(entries)
raw_clusters = self._group_into_raw_clusters(sorted_entries)
clusters: list[EventCluster] = []
prev_end: str | None = None
for raw in raw_clusters:
c = self._build_cluster(raw, prev_end)
clusters.append(c)
prev_end = c.end_iso
clusters_tuple = tuple(clusters)
gap_count = sum(
1
for c in clusters_tuple
if c.gap_before_seconds > self._gap_significance_seconds
)
return TimelineResult(
clusters=clusters_tuple,
total_entries=len(entries),
window_start=clusters_tuple[0].start_iso if clusters_tuple else None,
window_end=clusters_tuple[-1].end_iso if clusters_tuple else None,
gap_count=gap_count,
burst_count=sum(1 for c in clusters_tuple if c.burst),
dominant_sources=self._dominant_sources_tuple(entries),
)

285
app/services/discover.py Normal file
View file

@ -0,0 +1,285 @@
"""Environment auto-discovery for the onboarding wizard.
All checks are best-effort every function returns an empty list on failure
so the wizard degrades gracefully in containers, VMs, and minimal environments.
"""
from __future__ import annotations
import json
import logging
import os
import re
import shutil
import subprocess
import time
from pathlib import Path
from typing import Any
logger = logging.getLogger(__name__)
# Common log file candidates: (id, path, description)
_KNOWN_PATHS: list[tuple[str, str, str]] = [
("syslog", "/var/log/syslog", "System syslog (Debian/Ubuntu)"),
("syslog", "/var/log/messages", "System messages (RHEL/Rocky)"),
("auth", "/var/log/auth.log", "Auth log"),
("kern", "/var/log/kern.log", "Kernel log"),
("nginx-access", "/var/log/nginx/access.log", "Nginx access log"),
("nginx-error", "/var/log/nginx/error.log", "Nginx error log"),
("apache", "/var/log/apache2/access.log", "Apache access log"),
("apache-error", "/var/log/apache2/error.log", "Apache error log"),
("caddy", "/var/log/caddy/access.log", "Caddy access log"),
("docker-daemon","/var/log/docker.log", "Docker daemon log"),
("fail2ban", "/var/log/fail2ban.log", "Fail2ban log"),
("ufw", "/var/log/ufw.log", "UFW firewall log"),
]
def _run(cmd: list[str], timeout: float = 5.0) -> str | None:
"""Run a command and return stdout, or None on any error."""
try:
result = subprocess.run(cmd, capture_output=True, text=True, timeout=timeout)
return result.stdout if result.returncode == 0 else None
except Exception:
return None
def discover_journald() -> list[dict[str, Any]]:
"""Return a journald source candidate if journalctl is available."""
if not shutil.which("journalctl"):
return []
hostname = _run(["hostname"]) or "localhost"
hostname = hostname.strip()
return [{
"type": "journald",
"id": f"journal:{hostname}",
"label": f"System journal ({hostname})",
"description": "All systemd journal output from this host",
"available": True,
}]
def discover_docker() -> list[dict[str, Any]]:
"""Return Docker container candidates if Docker is running."""
for runtime in ("docker", "podman"):
if not shutil.which(runtime):
continue
out = _run([runtime, "ps", "--format", "{{json .}}"])
if out is None:
continue
containers = []
for line in out.splitlines():
line = line.strip()
if not line:
continue
try:
obj = json.loads(line)
name = obj.get("Names") or obj.get("Name") or obj.get("ID", "unknown")
# podman returns a list for Names
if isinstance(name, list):
name = name[0] if name else "unknown"
name = name.lstrip("/")
containers.append({
"type": "docker",
"id": f"{runtime}:{name}",
"label": f"{runtime.capitalize()}{name}",
"description": f"Container log stream for {name}",
"container": name,
"runtime": runtime,
"available": True,
})
except (json.JSONDecodeError, KeyError):
continue
if containers:
return containers
return []
def discover_files() -> list[dict[str, Any]]:
"""Return file-based source candidates for well-known log paths."""
found = []
seen_ids: set[str] = set()
for source_id, path, description in _KNOWN_PATHS:
if not os.path.exists(path):
continue
# deduplicate when both syslog and messages exist — take first match
if source_id in seen_ids:
continue
seen_ids.add(source_id)
found.append({
"type": "file",
"id": source_id,
"label": description,
"path": path,
"description": f"Read from {path}",
"available": True,
})
return found
def discover_all() -> dict[str, Any]:
"""Run all discovery checks and return a structured candidate list."""
candidates: list[dict[str, Any]] = []
candidates.extend(discover_journald())
candidates.extend(discover_docker())
candidates.extend(discover_files())
return {
"candidates": candidates,
"has_journald": any(c["type"] == "journald" for c in candidates),
"has_docker": any(c["type"] == "docker" for c in candidates),
"has_files": any(c["type"] == "file" for c in candidates),
}
def build_sources_yaml(selected: list[dict[str, Any]]) -> str:
"""Generate sources.yaml content from a list of selected candidates.
Each item must have: type, id, and type-specific fields (path, container, etc.).
"""
lines = [
"# Turnstone log sources — generated by the setup wizard.",
"# Edit this file to add, remove, or modify sources.",
"sources:",
]
for src in selected:
src_type = src.get("type", "file")
src_id = src.get("id", "unknown")
if src_type == "journald":
unit = src.get("unit")
lines.append(f" - id: {src_id}")
lines.append(f" type: journald")
if unit:
lines.append(f" unit: {unit}")
elif src_type == "docker":
runtime = src.get("runtime", "docker")
container = src.get("container", src_id.split(":")[-1])
lines.append(f" - id: {src_id}")
lines.append(f" type: docker")
lines.append(f" runtime: {runtime}")
lines.append(f" container: {container}")
else:
path = src.get("path", "")
lines.append(f" - id: {src_id}")
lines.append(f" path: {path}")
return "\n".join(lines) + "\n"
def validate_source(src: dict[str, Any]) -> str | None:
"""Return an error string if the source definition is invalid, else None."""
if not src.get("id"):
return "Source is missing 'id'"
src_type = src.get("type", "file")
if src_type == "file" and not src.get("path"):
return f"File source '{src['id']}' is missing 'path'"
if src_type == "docker" and not src.get("container"):
return f"Docker source '{src['id']}' is missing 'container'"
return None
# Extensions considered as log files in the filesystem scanner.
_LOG_EXTENSIONS = {"", ".log", ".txt", ".out", ".err"}
# Max file size to consider (500 MB).
_MAX_SIZE = 500 * 1024 * 1024
# Recency half-life in days — files older than this are scored near 0.
_RECENCY_HALFLIFE_DAYS = 30
def _path_to_source_id(path: Path) -> str:
"""Convert an absolute path to a kebab-case source ID."""
raw = re.sub(r"[^a-zA-Z0-9]+", "-", str(path)).strip("-").lower()
return raw[:64]
def scan_log_directories(
query: str | None = None,
dirs: list[str] | None = None,
max_depth: int = 4,
max_results: int = 25,
) -> list[dict[str, Any]]:
"""Scan filesystem directories for log files ranked by recency and keyword match.
Scoring weights:
- Recency (0-1): mtime within the last 30 days, decays exponentially
- Size (0-1): prefer 1 KB 50 MB; empty or huge files score low
- Keyword (0-1): stem matches between query words and path components
Returns up to *max_results* candidates sorted by descending score.
"""
if dirs is None:
dirs = ["/var/log", "/opt"]
now = time.time()
query_stems: list[str] = []
if query:
query_stems = [w.lower() for w in re.split(r"\W+", query) if len(w) >= 3]
candidates: list[dict[str, Any]] = []
def _walk(root: Path, depth: int) -> None:
if depth > max_depth:
return
try:
entries = list(root.iterdir())
except OSError:
return
for entry in entries:
if entry.name.startswith("."):
continue
if entry.is_symlink():
continue
if entry.is_dir():
_walk(entry, depth + 1)
continue
if not entry.is_file():
continue
if entry.suffix.lower() not in _LOG_EXTENSIONS:
continue
# Skip compressed archives
if entry.name.endswith((".gz", ".bz2", ".xz", ".zst")):
continue
try:
stat = entry.stat()
except OSError:
continue
if stat.st_size == 0 or stat.st_size > _MAX_SIZE:
continue
if not os.access(entry, os.R_OK):
continue
age_days = (now - stat.st_mtime) / 86400
recency = max(0.0, 1.0 - age_days / _RECENCY_HALFLIFE_DAYS)
if stat.st_size < 1024:
size_score = 0.3
elif stat.st_size <= 50 * 1024 * 1024:
size_score = 1.0
else:
# Large files: linear decay from 50 MB to 500 MB
size_score = max(0.1, 1.0 - (stat.st_size - 50 * 1024 * 1024) / _MAX_SIZE)
keyword_score = 0.0
if query_stems:
path_lower = str(entry).lower()
matches = sum(1 for stem in query_stems if stem in path_lower)
keyword_score = min(1.0, matches / max(len(query_stems), 1))
if query_stems:
total = recency * 0.4 + size_score * 0.2 + keyword_score * 0.4
else:
total = recency * 0.7 + size_score * 0.3
candidates.append({
"type": "file",
"id": _path_to_source_id(entry),
"path": str(entry),
"label": entry.name,
"size_bytes": stat.st_size,
"mtime": stat.st_mtime,
"score": round(total, 3),
"available": True,
})
for d in dirs:
_walk(Path(d), depth=0)
candidates.sort(key=lambda c: c["score"], reverse=True)
return candidates[:max_results]

229
app/services/embeddings.py Normal file
View file

@ -0,0 +1,229 @@
"""Configurable embedding service — BSL licensed.
Backends:
sentence_transformers local in-process inference (default, no server needed)
ollama HTTP to a running Ollama instance
Configuration (env vars):
TURNSTONE_EMBED_BACKEND sentence_transformers | ollama (default: sentence_transformers)
TURNSTONE_EMBED_MODEL model name/path (backend-specific default)
TURNSTONE_EMBED_DEVICE cpu | cuda (default: cpu; ST backend only)
TURNSTONE_LLM_URL Ollama base URL (default: http://localhost:11434)
When no backend is importable/reachable, EMBEDDING_AVAILABLE is False and all
embed calls return empty arrays callers must handle this gracefully.
"""
from __future__ import annotations
import logging
import os
import struct
from typing import Protocol, runtime_checkable
import numpy as np
logger = logging.getLogger(__name__)
# ── Public availability flag ──────────────────────────────────────────────────
EMBEDDING_AVAILABLE: bool = False
# ── Config ────────────────────────────────────────────────────────────────────
_BACKEND = os.environ.get("TURNSTONE_EMBED_BACKEND", "sentence_transformers").lower()
_DEVICE = os.environ.get("TURNSTONE_EMBED_DEVICE", "cpu").lower()
_LLM_URL = os.environ.get("TURNSTONE_LLM_URL", "http://localhost:11434")
# BAAI/bge-small-en-v1.5: 33MB, MIT, 49M downloads/month, 384-dim, 512-token max.
# Benchmarked as the best quality-to-size ratio in the field (MTEB 62.17).
# all-MiniLM-L6-v2 is a viable lighter alternative (23MB, 256-token max) if
# inference speed is the primary constraint.
_DEFAULT_MODEL: dict[str, str] = {
"sentence_transformers": "BAAI/bge-small-en-v1.5",
"ollama": "nomic-embed-text",
}
_MODEL = os.environ.get(
"TURNSTONE_EMBED_MODEL",
_DEFAULT_MODEL.get(_BACKEND, "sentence-transformers/all-MiniLM-L6-v2"),
)
# ── Protocol ──────────────────────────────────────────────────────────────────
@runtime_checkable
class Embedder(Protocol):
"""Minimal interface all embedding backends must satisfy."""
@property
def dim(self) -> int:
"""Embedding dimension produced by this model."""
...
@property
def model_name(self) -> str:
"""Human-readable model identifier."""
...
def embed(self, text: str) -> np.ndarray:
"""Embed a single string. Returns 1-D float32 array of length dim."""
...
def embed_batch(self, texts: list[str]) -> list[np.ndarray]:
"""Embed a list of strings. Returns list of 1-D float32 arrays."""
...
# ── sentence-transformers backend ─────────────────────────────────────────────
class SentenceTransformerEmbedder:
"""Local in-process embedding via the sentence-transformers library.
The model is downloaded from HuggingFace on first instantiation and cached
at ~/.cache/huggingface/. Subsequent starts use the local cache.
"""
def __init__(self, model_name: str = _MODEL, device: str = _DEVICE) -> None:
from sentence_transformers import SentenceTransformer # type: ignore[import]
logger.info("Loading embedding model %r on device %r ...", model_name, device)
self._model = SentenceTransformer(model_name, device=device)
self._model_name = model_name
# Infer dimension from a test embed rather than hard-coding
self._dim: int = int(self._model.encode("test").shape[0])
logger.info("Embedding model ready — dim=%d", self._dim)
@property
def dim(self) -> int:
return self._dim
@property
def model_name(self) -> str:
return self._model_name
def embed(self, text: str) -> np.ndarray:
vec = self._model.encode(text, convert_to_numpy=True, normalize_embeddings=True)
return vec.astype(np.float32)
def embed_batch(self, texts: list[str]) -> list[np.ndarray]:
if not texts:
return []
vecs = self._model.encode(
texts, convert_to_numpy=True, normalize_embeddings=True, batch_size=32
)
return [v.astype(np.float32) for v in vecs]
# ── Ollama backend ────────────────────────────────────────────────────────────
class OllamaEmbedder:
"""HTTP embedding via a running Ollama instance."""
def __init__(
self,
model_name: str = _MODEL,
llm_url: str = _LLM_URL,
timeout: float = 30.0,
) -> None:
import httpx # already a project dependency
self._model_name = model_name
self._url = f"{llm_url.rstrip('/')}/api/embeddings"
self._timeout = timeout
self._client = httpx.Client(timeout=timeout)
# Probe dimension with a test call
self._dim = self._probe_dim()
def _probe_dim(self) -> int:
try:
vec = self._raw_embed("probe")
return len(vec)
except Exception as exc:
logger.warning("Ollama dim probe failed (%s) — defaulting to 768", exc)
return 768
def _raw_embed(self, text: str) -> list[float]:
resp = self._client.post(
self._url, json={"model": self._model_name, "prompt": text}
)
resp.raise_for_status()
return resp.json().get("embedding") or []
@property
def dim(self) -> int:
return self._dim
@property
def model_name(self) -> str:
return self._model_name
def embed(self, text: str) -> np.ndarray:
vec = self._raw_embed(text)
return np.array(vec, dtype=np.float32)
def embed_batch(self, texts: list[str]) -> list[np.ndarray]:
return [self.embed(t) for t in texts]
# ── Singleton factory ─────────────────────────────────────────────────────────
_embedder: Embedder | None = None
def get_embedder() -> Embedder | None:
"""Return the configured embedder singleton, or None when unavailable.
Lazy-initialises on first call. Callers should check EMBEDDING_AVAILABLE
or test for None rather than calling this unconditionally.
"""
global _embedder, EMBEDDING_AVAILABLE
if _embedder is not None:
return _embedder
if _BACKEND == "sentence_transformers":
try:
_embedder = SentenceTransformerEmbedder(_MODEL, _DEVICE)
EMBEDDING_AVAILABLE = True
except ImportError:
logger.warning(
"sentence-transformers not installed — embeddings disabled. "
"Install with: pip install sentence-transformers"
)
except Exception as exc:
logger.warning("Failed to load sentence-transformers model %r: %s", _MODEL, exc)
elif _BACKEND == "ollama":
try:
_embedder = OllamaEmbedder(_MODEL, _LLM_URL)
EMBEDDING_AVAILABLE = True
except Exception as exc:
logger.warning("Ollama embedder init failed: %s", exc)
else:
logger.warning("Unknown TURNSTONE_EMBED_BACKEND %r — embeddings disabled", _BACKEND)
return _embedder
# ── BLOB serialisation helpers ────────────────────────────────────────────────
def pack_vector(vec: np.ndarray) -> bytes:
"""Serialise a float32 numpy vector to a SQLite BLOB."""
arr = vec.astype(np.float32)
return struct.pack(f"{len(arr)}f", *arr.tolist())
def unpack_vector(blob: bytes) -> np.ndarray:
"""Deserialise a SQLite BLOB back to a float32 numpy vector."""
n = len(blob) // 4 # 4 bytes per float32
return np.array(struct.unpack(f"{n}f", blob), dtype=np.float32)
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
"""Cosine similarity between two L2-normalised vectors.
Both vectors are re-normalised defensively so callers need not pre-normalise.
Returns 0.0 when either vector has zero norm.
"""
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
if norm_a == 0.0 or norm_b == 0.0:
return 0.0
return float(np.dot(a, b) / (norm_a * norm_b))

View file

@ -2,16 +2,31 @@
from __future__ import annotations from __future__ import annotations
import json import json
import sqlite3 import re
import uuid import uuid
from pathlib import Path from pathlib import Path
from app.ingest.base import now_iso from app.db import get_conn, resolve_tenant_id
from app.services.models import Incident, ReceivedBundle from app.glean.base import now_iso
from app.services.models import Incident, ReceivedBundle, SentBundle
from app.services.search import SearchResult, entries_in_window, search from app.services.search import SearchResult, entries_in_window, search
_REDACT_PATTERNS: list[tuple[re.Pattern, str]] = [
(re.compile(r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b"), "[IP]"),
(re.compile(r"[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}"), "[EMAIL]"),
(re.compile(r"(?i)\b(user(?:name)?|uid)\s*[=:]\s*\S+"), r"\1=[USER]"),
(re.compile(r"(?i)\bhost\s*[=:]\s*\S+"), "host=[HOST]"),
(re.compile(r"(?i)\bpassword\s*[=:]\s*\S+"), "password=[REDACTED]"),
]
def _row_to_incident(row: sqlite3.Row) -> Incident:
def _redact_text(text: str) -> str:
for pattern, replacement in _REDACT_PATTERNS:
text = pattern.sub(replacement, text)
return text
def _row_to_incident(row) -> Incident:
return Incident( return Incident(
id=row["id"], id=row["id"],
label=row["label"], label=row["label"],
@ -24,7 +39,7 @@ def _row_to_incident(row: sqlite3.Row) -> Incident:
) )
def _row_to_bundle(row: sqlite3.Row) -> ReceivedBundle: def _row_to_bundle(row) -> ReceivedBundle:
return ReceivedBundle( return ReceivedBundle(
id=row["id"], id=row["id"],
source_host=row["source_host"], source_host=row["source_host"],
@ -47,6 +62,7 @@ def create_incident(
notes: str = "", notes: str = "",
severity: str = "medium", severity: str = "medium",
) -> Incident: ) -> Incident:
tid = resolve_tenant_id()
incident = Incident( incident = Incident(
id=str(uuid.uuid4()), id=str(uuid.uuid4()),
label=label, label=label,
@ -57,47 +73,45 @@ def create_incident(
created_at=now_iso(), created_at=now_iso(),
severity=severity, severity=severity,
) )
conn = sqlite3.connect(str(db_path)) with get_conn(db_path) as conn:
conn.execute("PRAGMA journal_mode=WAL") conn.execute(
conn.execute( "INSERT INTO incidents (id, tenant_id, label, issue_type, started_at, ended_at, notes, created_at, severity) "
"INSERT INTO incidents (id, label, issue_type, started_at, ended_at, notes, created_at, severity) " "VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)",
"VALUES (?, ?, ?, ?, ?, ?, ?, ?)", (incident.id, tid, incident.label, incident.issue_type, incident.started_at,
(incident.id, incident.label, incident.issue_type, incident.started_at, incident.ended_at, incident.notes, incident.created_at, incident.severity),
incident.ended_at, incident.notes, incident.created_at, incident.severity), )
) conn.commit()
conn.commit()
conn.close()
return incident return incident
def list_incidents(db_path: Path) -> list[Incident]: def list_incidents(db_path: Path) -> list[Incident]:
conn = sqlite3.connect(str(db_path)) tid = resolve_tenant_id()
conn.execute("PRAGMA journal_mode=WAL") with get_conn(db_path) as conn:
conn.row_factory = sqlite3.Row rows = conn.execute(
rows = conn.execute( "SELECT * FROM incidents WHERE (tenant_id = ? OR tenant_id = '') ORDER BY created_at DESC",
"SELECT * FROM incidents ORDER BY created_at DESC" (tid,),
).fetchall() ).fetchall()
conn.close()
return [_row_to_incident(r) for r in rows] return [_row_to_incident(r) for r in rows]
def get_incident(db_path: Path, incident_id: str) -> Incident | None: def get_incident(db_path: Path, incident_id: str) -> Incident | None:
conn = sqlite3.connect(str(db_path)) tid = resolve_tenant_id()
conn.execute("PRAGMA journal_mode=WAL") with get_conn(db_path) as conn:
conn.row_factory = sqlite3.Row row = conn.execute(
row = conn.execute( "SELECT * FROM incidents WHERE id = ? AND (tenant_id = ? OR tenant_id = '')",
"SELECT * FROM incidents WHERE id = ?", (incident_id,) (incident_id, tid),
).fetchone() ).fetchone()
conn.close()
return _row_to_incident(row) if row else None return _row_to_incident(row) if row else None
def delete_incident(db_path: Path, incident_id: str) -> bool: def delete_incident(db_path: Path, incident_id: str) -> bool:
conn = sqlite3.connect(str(db_path)) tid = resolve_tenant_id()
conn.execute("PRAGMA journal_mode=WAL") with get_conn(db_path) as conn:
cur = conn.execute("DELETE FROM incidents WHERE id = ?", (incident_id,)) cur = conn.execute(
conn.commit() "DELETE FROM incidents WHERE id = ? AND (tenant_id = ? OR tenant_id = '')",
conn.close() (incident_id, tid),
)
conn.commit()
return cur.rowcount > 0 return cur.rowcount > 0
@ -142,6 +156,7 @@ def build_bundle(
incident: Incident, incident: Incident,
source_host: str, source_host: str,
limit: int = 200, limit: int = 200,
sanitize: bool = False,
) -> dict: ) -> dict:
"""Assemble a labeled bundle: incident metadata + related log entries.""" """Assemble a labeled bundle: incident metadata + related log entries."""
entries = get_incident_entries(db_path, incident, limit=limit) entries = get_incident_entries(db_path, incident, limit=limit)
@ -149,6 +164,7 @@ def build_bundle(
"bundle_version": 1, "bundle_version": 1,
"source_host": source_host, "source_host": source_host,
"bundled_at": now_iso(), "bundled_at": now_iso(),
"sanitized": sanitize,
"incident": { "incident": {
"id": incident.id, "id": incident.id,
"label": incident.label, "label": incident.label,
@ -164,7 +180,7 @@ def build_bundle(
"source_id": e.source_id, "source_id": e.source_id,
"timestamp_iso": e.timestamp_iso, "timestamp_iso": e.timestamp_iso,
"severity": e.severity, "severity": e.severity,
"text": e.text, "text": _redact_text(e.text) if sanitize else e.text,
"matched_patterns": list(e.matched_patterns), "matched_patterns": list(e.matched_patterns),
} }
for e in entries for e in entries
@ -172,8 +188,52 @@ def build_bundle(
} }
def record_sent_bundle(db_path: Path, incident_id: str, bundle: dict, sanitized: bool) -> SentBundle:
"""Log an outgoing bundle export to the sent_bundles table."""
tid = resolve_tenant_id()
record = SentBundle(
id=str(uuid.uuid4()),
incident_id=incident_id,
exported_at=now_iso(),
sanitized=sanitized,
entry_count=len(bundle.get("log_entries", [])),
bundle_json=json.dumps(bundle),
)
with get_conn(db_path) as conn:
conn.execute(
"INSERT INTO sent_bundles (id, tenant_id, incident_id, exported_at, sanitized, entry_count, bundle_json) "
"VALUES (?, ?, ?, ?, ?, ?, ?)",
(record.id, tid, record.incident_id, record.exported_at,
int(record.sanitized), record.entry_count, record.bundle_json),
)
conn.commit()
return record
def list_sent_bundles(db_path: Path) -> list[SentBundle]:
tid = resolve_tenant_id()
with get_conn(db_path) as conn:
rows = conn.execute(
"SELECT id, incident_id, exported_at, sanitized, entry_count, bundle_json "
"FROM sent_bundles WHERE (tenant_id = ? OR tenant_id = '') ORDER BY exported_at DESC",
(tid,),
).fetchall()
return [
SentBundle(
id=r["id"],
incident_id=r["incident_id"],
exported_at=r["exported_at"],
sanitized=bool(r["sanitized"]),
entry_count=r["entry_count"],
bundle_json=r["bundle_json"],
)
for r in rows
]
def store_bundle(db_path: Path, bundle: dict) -> ReceivedBundle: def store_bundle(db_path: Path, bundle: dict) -> ReceivedBundle:
"""Store an incoming bundle from a remote Turnstone instance.""" """Store an incoming bundle from a remote Turnstone instance."""
tid = resolve_tenant_id()
inc = bundle.get("incident", {}) inc = bundle.get("incident", {})
record = ReceivedBundle( record = ReceivedBundle(
id=str(uuid.uuid4()), id=str(uuid.uuid4()),
@ -186,38 +246,34 @@ def store_bundle(db_path: Path, bundle: dict) -> ReceivedBundle:
entry_count=len(bundle.get("log_entries", [])), entry_count=len(bundle.get("log_entries", [])),
bundle_json=json.dumps(bundle), bundle_json=json.dumps(bundle),
) )
conn = sqlite3.connect(str(db_path)) with get_conn(db_path) as conn:
conn.execute("PRAGMA journal_mode=WAL") conn.execute(
conn.execute( "INSERT INTO received_bundles "
"INSERT INTO received_bundles " "(id, tenant_id, source_host, issue_type, label, severity, started_at, bundled_at, entry_count, bundle_json) "
"(id, source_host, issue_type, label, severity, started_at, bundled_at, entry_count, bundle_json) " "VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)",
"VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)", (record.id, tid, record.source_host, record.issue_type, record.label,
(record.id, record.source_host, record.issue_type, record.label, record.severity, record.started_at, record.bundled_at, record.entry_count, record.bundle_json),
record.severity, record.started_at, record.bundled_at, record.entry_count, record.bundle_json), )
) conn.commit()
conn.commit()
conn.close()
return record return record
def list_bundles(db_path: Path) -> list[ReceivedBundle]: def list_bundles(db_path: Path) -> list[ReceivedBundle]:
conn = sqlite3.connect(str(db_path)) tid = resolve_tenant_id()
conn.execute("PRAGMA journal_mode=WAL") with get_conn(db_path) as conn:
conn.row_factory = sqlite3.Row rows = conn.execute(
rows = conn.execute( "SELECT id, source_host, issue_type, label, severity, started_at, bundled_at, entry_count, bundle_json "
"SELECT id, source_host, issue_type, label, severity, started_at, bundled_at, entry_count, bundle_json " "FROM received_bundles WHERE (tenant_id = ? OR tenant_id = '') ORDER BY bundled_at DESC",
"FROM received_bundles ORDER BY bundled_at DESC" (tid,),
).fetchall() ).fetchall()
conn.close()
return [_row_to_bundle(r) for r in rows] return [_row_to_bundle(r) for r in rows]
def get_bundle(db_path: Path, bundle_id: str) -> ReceivedBundle | None: def get_bundle(db_path: Path, bundle_id: str) -> ReceivedBundle | None:
conn = sqlite3.connect(str(db_path)) tid = resolve_tenant_id()
conn.execute("PRAGMA journal_mode=WAL") with get_conn(db_path) as conn:
conn.row_factory = sqlite3.Row row = conn.execute(
row = conn.execute( "SELECT * FROM received_bundles WHERE id = ? AND (tenant_id = ? OR tenant_id = '')",
"SELECT * FROM received_bundles WHERE id = ?", (bundle_id,) (bundle_id, tid),
).fetchone() ).fetchone()
conn.close()
return _row_to_bundle(row) if row else None return _row_to_bundle(row) if row else None

103
app/services/llm.py Normal file
View file

@ -0,0 +1,103 @@
import logging
import httpx
from app.services.search import SearchResult
logger = logging.getLogger(__name__)
_SEVERITY_RANK = {"CRITICAL": 0, "ERROR": 1, "WARN": 2, "WARNING": 2}
_PROMPT_TEMPLATE = """\
You are a homelab diagnostic assistant. A user described a symptom and the system retrieved relevant log entries.
Analyze the log entries below and write a 2-4 sentence plain-language diagnosis. Focus on errors and their likely root cause. Be specific and concise name the services involved, not generic platitudes.
User query: {query}
{context_section}
Log entries ({n} shown, highest severity first):
{log_block}
Diagnosis:"""
def _build_context(entries: list[SearchResult], max_entries: int = 25) -> str:
ranked = sorted(
entries,
key=lambda e: (_SEVERITY_RANK.get(e.severity or "", 3), e.timestamp_iso or ""),
)[:max_entries]
return "\n".join(
f"[{e.timestamp_iso or '?'}] [{e.severity or 'INFO'}] {e.text[:200]}"
for e in ranked
)
def _extract_content(resp_json: dict) -> str | None:
"""Pull text content from an OpenAI-compat chat completion response."""
choices = resp_json.get("choices") or []
if not choices:
return None
return (choices[0].get("message", {}).get("content") or "").strip() or None
def summarize(
query: str,
entries: list[SearchResult],
llm_url: str,
llm_model: str,
api_key: str | None = None,
timeout: float = 120.0,
context_block: str | None = None,
) -> str | None:
if not entries:
return None
log_block = _build_context(entries)
context_section = (
f"\nEnvironment context:\n{context_block}\n" if context_block else ""
)
prompt = _PROMPT_TEMPLATE.format(
query=query,
n=min(len(entries), 25),
log_block=log_block,
context_section=context_section,
)
headers = {"Authorization": f"Bearer {api_key}"} if api_key else {}
messages = [{"role": "user", "content": prompt}]
# Try cf-orch task-based endpoint first (routes to the security reasoning model
# assigned to turnstone.log_analysis without needing an explicit model name).
task_url = f"{llm_url.rstrip('/')}/api/inference/task"
try:
resp = httpx.post(
task_url,
json={
"product": "turnstone",
"task": "log_analysis",
"payload": {"messages": messages, "stream": False, "max_tokens": 1024},
},
headers=headers,
timeout=timeout,
)
if resp.status_code == 200:
return _extract_content(resp.json())
if resp.status_code != 404:
resp.raise_for_status()
# 404 means no assignment configured — fall through to direct model call
logger.debug("No task assignment for turnstone.log_analysis — falling back to direct model")
except Exception as exc:
logger.debug("Task endpoint unavailable (%s) — falling back to direct model", exc)
# Fallback: OpenAI-compat endpoint with explicit model name (local instances,
# or any cf-orch node that doesn't have task assignments loaded).
try:
resp = httpx.post(
f"{llm_url.rstrip('/')}/v1/chat/completions",
json={"model": llm_model, "messages": messages, "stream": False, "max_tokens": 1024},
headers=headers,
timeout=timeout,
)
resp.raise_for_status()
return _extract_content(resp.json())
except Exception as exc:
logger.warning("LLM summarization failed (%s): %s", type(exc).__name__, exc)
return None

View file

@ -10,7 +10,7 @@ class RetrievedEntry:
entry_id: str entry_id: str
source_id: str # log file path or service name source_id: str # log file path or service name
sequence: int # original line number — ingest order, not wall-clock order sequence: int # original line number — glean order, not wall-clock order
timestamp_raw: str | None # timestamp as it appeared in the log timestamp_raw: str | None # timestamp as it appeared in the log
timestamp_iso: str | None # parsed to ISO 8601 for sorting; None if unparseable timestamp_iso: str | None # parsed to ISO 8601 for sorting; None if unparseable
ingest_time: str # when Turnstone indexed this entry (wall clock) ingest_time: str # when Turnstone indexed this entry (wall clock)
@ -25,12 +25,13 @@ class RetrievedEntry:
@dataclass(frozen=True) @dataclass(frozen=True)
class LogPattern: class LogPattern:
"""A named regex pattern for tagging entries at ingest time.""" """A named regex pattern for tagging entries at glean time."""
name: str # e.g. "device_disconnect", "auth_failure" name: str # e.g. "device_disconnect", "auth_failure"
pattern: str # regex string pattern: str # regex string
severity: str # suggested severity if not present in log line severity: str # suggested severity if not present in log line
description: str # human-readable explanation for the UI description: str # human-readable explanation for the UI
domain: str = "" # service health domain (networking, storage, auth, etc.)
@dataclass(frozen=True) @dataclass(frozen=True)
@ -60,3 +61,15 @@ class ReceivedBundle:
bundled_at: str bundled_at: str
entry_count: int entry_count: int
bundle_json: str # full bundle serialized as JSON string bundle_json: str # full bundle serialized as JSON string
@dataclass(frozen=True)
class SentBundle:
"""A record of a bundle exported or sent from this instance."""
id: str
incident_id: str
exported_at: str
sanitized: bool
entry_count: int
bundle_json: str

134
app/services/nl_source.py Normal file
View file

@ -0,0 +1,134 @@
"""Natural-language log source interpretation (LLM path for #53).
BSL-gated feature: the structured form fallback is MIT; the LLM interpretation
requires the LLM service to be configured. The caller always validates the
output against the source schema before writing anything.
"""
from __future__ import annotations
import json
import logging
import re
from typing import Any
import httpx
logger = logging.getLogger(__name__)
_SYSTEM_PROMPT = """\
You are a Turnstone log-source configuration assistant.
The operator will describe a log source in plain English.
Respond ONLY with a JSON object matching this schema no prose, no markdown:
{
"id": "short-kebab-case identifier",
"type": "file" | "journald" | "docker",
"path": "/absolute/path (file type only)",
"container": "container-name (docker type only)",
"runtime": "docker" | "podman" (docker type only, default docker)",
"unit": "service.service (journald type only, omit for all-journal)",
"label": "Human-readable name for the UI"
}
Rules:
- For well-known apps (nginx, apache, caddy, sonarr, radarr, qbittorrent, plex, jellyfin),
use the conventional default log path.
- If the operator mentions a Docker/Podman container, use type=docker.
- If the operator mentions journald or a systemd service, use type=journald.
- If uncertain, use type=file with the most likely path.
- The "id" must be lowercase, hyphens only (no spaces, slashes, dots).
- Never include trailing commas or comments in your JSON.
"""
# Well-known path lookup for common apps — used as a deterministic fallback
_KNOWN_APPS: dict[str, dict[str, Any]] = {
"nginx": {"id": "nginx-access", "type": "file", "path": "/var/log/nginx/access.log"},
"apache": {"id": "apache", "type": "file", "path": "/var/log/apache2/access.log"},
"caddy": {"id": "caddy", "type": "file", "path": "/var/log/caddy/access.log"},
"sonarr": {"id": "sonarr", "type": "file", "path": "/var/log/sonarr/sonarr.0.txt"},
"radarr": {"id": "radarr", "type": "file", "path": "/var/log/radarr/radarr.0.txt"},
"qbittorrent": {"id": "qbittorrent", "type": "file", "path": "/var/log/qbittorrent/qbittorrent.log"},
"plex": {"id": "plex", "type": "file", "path": "/var/lib/plexmediaserver/Library/Application Support/Plex Media Server/Logs/Plex Media Server.log"},
"jellyfin": {"id": "jellyfin", "type": "file", "path": "/var/log/jellyfin/jellyfin.log"},
"syslog": {"id": "syslog", "type": "file", "path": "/var/log/syslog"},
"auth": {"id": "auth", "type": "file", "path": "/var/log/auth.log"},
"fail2ban": {"id": "fail2ban", "type": "file", "path": "/var/log/fail2ban.log"},
"docker": {"id": "docker-daemon", "type": "file", "path": "/var/log/docker.log"},
"journal": {"id": "journal", "type": "journald"},
"journald": {"id": "journal", "type": "journald"},
"systemd": {"id": "journal", "type": "journald"},
}
def _keyword_match(description: str) -> dict[str, Any] | None:
"""Try a simple keyword match before spending an LLM call."""
lower = description.lower()
for keyword, template in _KNOWN_APPS.items():
if keyword in lower:
result = dict(template)
result.setdefault("label", keyword.capitalize() + " log")
return result
return None
def _extract_json(text: str) -> dict[str, Any] | None:
"""Pull the first {...} block out of an LLM response."""
match = re.search(r"\{[^{}]+\}", text, re.DOTALL)
if not match:
return None
try:
return json.loads(match.group())
except json.JSONDecodeError:
return None
def interpret(
description: str,
llm_url: str | None,
llm_model: str | None,
api_key: str | None = None,
timeout: float = 30.0,
) -> dict[str, Any] | None:
"""Interpret a natural-language source description.
Returns a source dict or None if interpretation fails.
The caller must validate the result with discover.validate_source()
before writing anything to disk.
"""
# 1. Keyword shortcut — no LLM needed for well-known apps
kw = _keyword_match(description)
if kw:
logger.debug("NL source: keyword match for %r", description)
return kw
# 2. LLM path
if not llm_url or not llm_model:
logger.debug("NL source: no LLM configured, returning None")
return None
messages = [
{"role": "system", "content": _SYSTEM_PROMPT},
{"role": "user", "content": description},
]
headers = {"Content-Type": "application/json"}
if api_key:
headers["Authorization"] = f"Bearer {api_key}"
try:
resp = httpx.post(
f"{llm_url.rstrip('/')}/v1/chat/completions",
json={"model": llm_model, "messages": messages, "stream": False, "max_tokens": 256},
headers=headers,
timeout=timeout,
)
resp.raise_for_status()
content = resp.json()["choices"][0]["message"]["content"]
parsed = _extract_json(content)
if parsed:
parsed.setdefault("label", description[:60])
return parsed
logger.warning("NL source: could not extract JSON from LLM response")
except Exception as exc:
logger.warning("NL source: LLM call failed (%s): %s", type(exc).__name__, exc)
return None

327
app/services/orchard.py Normal file
View file

@ -0,0 +1,327 @@
"""The Orchard — auto-enrollment of new Turnstone branch nodes.
A "branch" is an external Turnstone instance that submits pattern-matched log
entries to a central harvest receiver (harvest.circuitforge.tech). Grafting
provisions the receiving infrastructure for a new branch:
1. Creates a data dir at ORCHARD_DATA_ROOT/<slug>/
2. Starts a new turnstone-submissions-<slug> Docker container
3. Injects a handle_path block into the Caddyfile marker section
4. Restarts caddy-proxy to activate the route
5. Persists the branch registry to orchard-branches.yaml
Admin auth: the graft/deactivate endpoints require
Authorization: Bearer <TURNSTONE_ORCHARD_ADMIN_KEY>
Set TURNSTONE_ORCHARD_ADMIN_KEY in the environment on the harvest instance.
If unset, the endpoints return 501 Not Implemented (feature is off).
Anonymization: a separate pass (run_anonymization) replaces IPs, hostnames,
and usernames in branch DBs with stable pseudonyms before Avocet reads them.
"""
from __future__ import annotations
import hashlib
import hmac
import ipaddress
import json
import logging
import os
import re
import secrets
import sqlite3
import subprocess
import time
from pathlib import Path
from typing import Any
logger = logging.getLogger(__name__)
# ---------------------------------------------------------------------------
# Config (read from env on the harvest instance)
# ---------------------------------------------------------------------------
ORCHARD_DATA_ROOT = Path(os.environ.get("TURNSTONE_ORCHARD_DATA_ROOT", "/devl/docker/turnstone-submissions"))
ORCHARD_CADDYFILE = Path(os.environ.get("TURNSTONE_ORCHARD_CADDYFILE", "/devl/caddy-proxy/Caddyfile"))
ORCHARD_CADDY_CONTAINER = os.environ.get("TURNSTONE_ORCHARD_CADDY_CONTAINER", "caddy-proxy")
ORCHARD_HARVEST_HOST = os.environ.get("TURNSTONE_ORCHARD_HARVEST_HOST", "https://harvest.circuitforge.tech")
ORCHARD_IMAGE = os.environ.get("TURNSTONE_ORCHARD_IMAGE", "localhost/turnstone:latest")
# Ports for submission containers start here and scan upward.
ORCHARD_PORT_BASE = int(os.environ.get("TURNSTONE_ORCHARD_PORT_BASE", "8538"))
_REGISTRY_FILE = ORCHARD_DATA_ROOT / "orchard-branches.yaml"
_CADDY_BRANCH_START = "# --- ORCHARD BRANCHES: auto-managed by POST /api/orchard/graft, do not edit manually ---"
_CADDY_BRANCH_END = "# --- END ORCHARD BRANCHES ---"
_SLUG_RE = re.compile(r"^[a-z0-9][a-z0-9-]{1,30}[a-z0-9]$")
# ---------------------------------------------------------------------------
# Branch registry
# ---------------------------------------------------------------------------
def _load_registry() -> list[dict[str, Any]]:
if not _REGISTRY_FILE.exists():
return []
import yaml as _yaml
try:
data = _yaml.safe_load(_REGISTRY_FILE.read_text()) or {}
return data.get("branches", [])
except Exception:
return []
def _save_registry(branches: list[dict[str, Any]]) -> None:
import yaml as _yaml
_REGISTRY_FILE.parent.mkdir(parents=True, exist_ok=True)
_REGISTRY_FILE.write_text(_yaml.dump({"branches": branches}, default_flow_style=False))
def list_branches() -> list[dict[str, Any]]:
return _load_registry()
# ---------------------------------------------------------------------------
# Port allocation
# ---------------------------------------------------------------------------
def _next_free_port() -> int:
used = {b["port"] for b in _load_registry() if "port" in b}
port = ORCHARD_PORT_BASE
while port in used:
port += 1
return port
# ---------------------------------------------------------------------------
# Caddy route injection
# ---------------------------------------------------------------------------
def _build_branch_block(slug: str, port: int) -> str:
return (
f" handle_path /{slug}/* {{\n"
f" reverse_proxy http://host.docker.internal:{port} {{\n"
f" header_up X-Real-IP {{remote_host}}\n"
f" header_up X-Forwarded-Proto {{scheme}}\n"
f" flush_interval -1\n"
f" transport http {{\n"
f" response_header_timeout 0\n"
f" read_timeout 0\n"
f" }}\n"
f" }}\n"
f" }}"
)
def _rewrite_caddy_branches(branches: list[dict[str, Any]]) -> None:
"""Replace the auto-managed section in the Caddyfile with current branches."""
if not ORCHARD_CADDYFILE.exists():
raise RuntimeError(f"Caddyfile not found at {ORCHARD_CADDYFILE}")
text = ORCHARD_CADDYFILE.read_text()
start_idx = text.find(_CADDY_BRANCH_START)
end_idx = text.find(_CADDY_BRANCH_END)
if start_idx == -1 or end_idx == -1:
raise RuntimeError("Caddyfile is missing the ORCHARD BRANCHES marker section")
active = [b for b in branches if b.get("active", True)]
blocks = "\n".join(_build_branch_block(b["slug"], b["port"]) for b in active)
replacement = f"{_CADDY_BRANCH_START}\n{blocks}\n {_CADDY_BRANCH_END}"
new_text = text[:start_idx] + replacement + text[end_idx + len(_CADDY_BRANCH_END):]
ORCHARD_CADDYFILE.write_text(new_text)
logger.info("Caddyfile updated with %d active branch routes", len(active))
def _reload_caddy() -> None:
result = subprocess.run(
["docker", "restart", ORCHARD_CADDY_CONTAINER],
capture_output=True, text=True, timeout=30,
)
if result.returncode != 0:
raise RuntimeError(f"docker restart {ORCHARD_CADDY_CONTAINER} failed: {result.stderr}")
logger.info("Restarted %s", ORCHARD_CADDY_CONTAINER)
# ---------------------------------------------------------------------------
# Container provisioning
# ---------------------------------------------------------------------------
def _start_branch_container(slug: str, port: int, data_dir: Path) -> None:
patterns_dir = data_dir / "patterns"
patterns_dir.mkdir(parents=True, exist_ok=True)
data_dir.mkdir(parents=True, exist_ok=True)
# Seed default patterns if not already present
repo_patterns = Path(__file__).parent.parent.parent / "patterns"
for yaml_file in ("default.yaml", "sources-example.yaml"):
src = repo_patterns / yaml_file
dst = patterns_dir / yaml_file
if src.exists() and not dst.exists():
dst.write_text(src.read_text())
container_name = f"turnstone-submissions-{slug}"
cmd = [
"docker", "run", "-d",
"--name", container_name,
"--restart", "unless-stopped",
"-p", f"{port}:8534",
"-v", f"{data_dir}:/data",
"-v", f"{patterns_dir}:/patterns",
"-e", f"TURNSTONE_DB=/data/turnstone.db",
"-e", f"TURNSTONE_SOURCE_HOST={slug}",
"-e", "PYTHONUNBUFFERED=1",
"-e", "TZ=America/Los_Angeles",
ORCHARD_IMAGE,
]
# Remove any stale container with the same name first
subprocess.run(["docker", "rm", "-f", container_name], capture_output=True)
result = subprocess.run(cmd, capture_output=True, text=True, timeout=60)
if result.returncode != 0:
raise RuntimeError(f"docker run for {container_name} failed: {result.stderr}")
logger.info("Started container %s on port %d", container_name, port)
def _stop_branch_container(slug: str) -> None:
container_name = f"turnstone-submissions-{slug}"
subprocess.run(["docker", "rm", "-f", container_name], capture_output=True, timeout=30)
logger.info("Removed container %s", container_name)
# ---------------------------------------------------------------------------
# Public API
# ---------------------------------------------------------------------------
def graft(slug: str, contact_email: str, agreed_to_terms: bool) -> dict[str, Any]:
"""Provision a new Orchard branch and return connection details."""
if not agreed_to_terms:
raise ValueError("agreed_to_terms must be true")
if not _SLUG_RE.match(slug):
raise ValueError(
f"Invalid slug {slug!r}: must be 2-32 lowercase alphanumeric/hyphen, "
"cannot start or end with a hyphen"
)
branches = _load_registry()
if any(b["slug"] == slug for b in branches):
raise ValueError(f"Branch {slug!r} already exists")
port = _next_free_port()
data_dir = ORCHARD_DATA_ROOT / slug
api_key = secrets.token_urlsafe(32)
branch: dict[str, Any] = {
"slug": slug,
"port": port,
"contact_email": contact_email,
"api_key_hash": hashlib.sha256(api_key.encode()).hexdigest(),
"grafted_at": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
"active": True,
}
_start_branch_container(slug, port, data_dir)
branches.append(branch)
_save_registry(branches)
_rewrite_caddy_branches(branches)
_reload_caddy()
submit_endpoint = f"{ORCHARD_HARVEST_HOST}/{slug}"
logger.info("Grafted branch %r at %s", slug, submit_endpoint)
return {
"slug": slug,
"submit_endpoint": submit_endpoint,
"api_key": api_key,
"port": port,
}
def deactivate(slug: str) -> dict[str, Any]:
"""Deactivate a branch: stop its container and remove its Caddy route."""
branches = _load_registry()
branch = next((b for b in branches if b["slug"] == slug), None)
if branch is None:
raise KeyError(f"Branch {slug!r} not found")
_stop_branch_container(slug)
branch["active"] = False
_save_registry(branches)
_rewrite_caddy_branches(branches)
_reload_caddy()
return {"slug": slug, "deactivated": True}
def verify_api_key(slug: str, key: str) -> bool:
"""Check whether *key* is valid for the given branch slug."""
branches = _load_registry()
branch = next((b for b in branches if b["slug"] == slug and b.get("active")), None)
if branch is None:
return False
expected = branch.get("api_key_hash", "")
provided = hashlib.sha256(key.encode()).hexdigest()
return hmac.compare_digest(expected, provided)
# ---------------------------------------------------------------------------
# Anonymization worker
# ---------------------------------------------------------------------------
_IP_RE = re.compile(
r"\b(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\b"
)
_USERNAME_RE = re.compile(r"\bfor\s+(\w+)\b|\buser\s+(\w+)\b|\bsession\s+opened\s+for\s+(\w+)\b", re.IGNORECASE)
def _pseudonym(value: str, salt: bytes, prefix: str) -> str:
digest = hmac.new(salt, value.encode(), "sha256").hexdigest()[:10]
return f"{prefix}-{digest}"
def _anonymize_text(text: str, salt: bytes) -> str:
def replace_ip(m: re.Match) -> str:
return _pseudonym(m.group(), salt, "ip")
def replace_user(m: re.Match) -> str:
user = next(g for g in m.groups() if g)
return m.group().replace(user, _pseudonym(user, salt, "user"))
text = _IP_RE.sub(replace_ip, text)
text = _USERNAME_RE.sub(replace_user, text)
return text
def run_anonymization(slug: str) -> dict[str, Any]:
"""Anonymize IPs and usernames in a branch DB in-place.
Uses a stable per-branch salt so pseudonyms are consistent across runs
but not reversible without the salt.
"""
branch = next((b for b in _load_registry() if b["slug"] == slug), None)
if branch is None:
raise KeyError(f"Branch {slug!r} not found")
db_path = ORCHARD_DATA_ROOT / slug / "turnstone.db"
if not db_path.exists():
return {"slug": slug, "anonymized": 0}
# Per-branch salt derived from api_key_hash for stability
salt = branch["api_key_hash"].encode()[:32].ljust(32, b"0")
conn = sqlite3.connect(str(db_path), timeout=30)
conn.execute("PRAGMA journal_mode=WAL")
rows = conn.execute("SELECT id, text FROM log_entries WHERE anonymized IS NULL OR anonymized = 0").fetchall()
updated = 0
for row_id, text in rows:
clean = _anonymize_text(text or "", salt)
if clean != text:
conn.execute("UPDATE log_entries SET text = ?, anonymized = 1 WHERE id = ?", (clean, row_id))
updated += 1
else:
conn.execute("UPDATE log_entries SET anonymized = 1 WHERE id = ?", (row_id,))
conn.commit()
conn.close()
logger.info("Anonymized %d/%d entries in branch %r", updated, len(rows), slug)
return {"slug": slug, "anonymized": updated, "total_processed": len(rows)}

96
app/services/pihole.py Normal file
View file

@ -0,0 +1,96 @@
"""Pi-hole API client supporting v5 (PHP) and v6 (REST) APIs."""
from __future__ import annotations
import dataclasses
import httpx
@dataclasses.dataclass
class PiholeClient:
url: str
api_key: str
version: str = "v6" # "v5" | "v6"
def __post_init__(self) -> None:
self.url = self.url.rstrip("/")
if not self.url or not self.api_key:
raise ValueError("PiholeClient requires a non-empty url and api_key")
# ── Public API ────────────────────────────────────────────────────────
def block(self, domain: str, comment: str = "Turnstone block") -> None:
if self.version == "v5":
self._v5_get("black", "add", domain)
else:
sid = self._v6_auth()
self._v6_post_domain(sid, domain, comment)
def unblock(self, domain: str) -> None:
if self.version == "v5":
self._v5_get("black", "sub", domain)
else:
sid = self._v6_auth()
self._v6_delete_domain(sid, domain)
def test_connection(self) -> dict:
try:
if self.version == "v5":
return self._v5_test()
return self._v6_test()
except Exception as exc:
return {"ok": False, "version": self.version, "domain_count": 0, "error": str(exc)}
# ── v5 helpers ────────────────────────────────────────────────────────
def _v5_get(self, list_type: str, action: str, domain: str) -> None:
params = {"list": list_type, action: domain, "auth": self.api_key}
with httpx.Client(timeout=10) as c:
c.get(f"{self.url}/admin/api.php", params=params).raise_for_status()
def _v5_test(self) -> dict:
with httpx.Client(timeout=10) as c:
r = c.get(f"{self.url}/admin/api.php", params={"summaryRaw": "", "auth": self.api_key})
r.raise_for_status()
data = r.json()
return {
"ok": True,
"version": "v5",
"domain_count": int(data.get("domains_being_blocked", 0)),
"error": None,
}
# ── v6 helpers ────────────────────────────────────────────────────────
def _v6_auth(self) -> str:
with httpx.Client(timeout=10) as c:
r = c.post(f"{self.url}/api/auth", json={"password": self.api_key})
r.raise_for_status()
data = r.json()
sid = data.get("session", {}).get("sid")
if not sid:
msg = data.get("session", {}).get("message", "no sid returned")
raise ValueError(f"Pi-hole v6 auth failed: {msg}")
return sid
def _v6_post_domain(self, sid: str, domain: str, comment: str) -> None:
body = [{"domain": domain, "comment": comment, "enabled": True}]
with httpx.Client(timeout=10, cookies={"sid": sid}) as c:
c.post(f"{self.url}/api/domains/deny", json=body).raise_for_status()
def _v6_delete_domain(self, sid: str, domain: str) -> None:
with httpx.Client(timeout=10, cookies={"sid": sid}) as c:
c.delete(f"{self.url}/api/domains/deny/{domain}").raise_for_status()
def _v6_test(self) -> dict:
sid = self._v6_auth()
with httpx.Client(timeout=10, cookies={"sid": sid}) as c:
r = c.get(f"{self.url}/api/domains/deny")
r.raise_for_status()
data = r.json()
return {
"ok": True,
"version": "v6",
"domain_count": len(data.get("data", [])),
"error": None,
}

View file

@ -1,4 +1,8 @@
"""FTS5-based log search with severity, source, and pattern filters.""" """FTS-based log search with optional hybrid BM25 + vector re-ranking.
SQLite backend: FTS5 virtual table with Porter stemmer.
Postgres backend: tsvector column with GIN index + websearch_to_tsquery.
"""
from __future__ import annotations from __future__ import annotations
import json import json
@ -6,8 +10,11 @@ import logging
import re import re
import sqlite3 import sqlite3
from dataclasses import dataclass from dataclasses import dataclass
from datetime import datetime, timedelta, timezone
from pathlib import Path from pathlib import Path
from app.db import BACKEND, Backend, frag, get_conn, resolve_tenant_id
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@ -28,48 +35,47 @@ class SearchResult:
def build_fts_index(db_path: Path) -> None: def build_fts_index(db_path: Path) -> None:
"""Build (or rebuild) the FTS5 index from log_entries. Safe to re-run. """Build (or rebuild) the FTS5 index from log_entries. Safe to re-run.
Drops and recreates the table if the schema is stale (missing sequence column). For Postgres, the tsvector column is maintained by a trigger this is a no-op.
""" """
conn = sqlite3.connect(str(db_path)) if BACKEND == Backend.POSTGRES:
conn.execute("PRAGMA journal_mode=WAL") return
# Check whether existing table has the sequence column; rebuild if not. with get_conn(db_path) as conn:
needs_rebuild = False needs_rebuild = False
try: try:
conn.execute("SELECT sequence FROM log_fts LIMIT 0") conn.execute("SELECT sequence FROM log_fts LIMIT 0")
except sqlite3.OperationalError: except Exception:
needs_rebuild = True needs_rebuild = True
if needs_rebuild: if needs_rebuild:
conn.execute("DROP TABLE IF EXISTS log_fts") conn.execute("DROP TABLE IF EXISTS log_fts")
conn.commit()
conn.executescript(""" conn.execute("""
CREATE VIRTUAL TABLE IF NOT EXISTS log_fts USING fts5( CREATE VIRTUAL TABLE IF NOT EXISTS log_fts USING fts5(
text, text,
entry_id UNINDEXED, entry_id UNINDEXED,
source_id UNINDEXED, source_id UNINDEXED,
sequence UNINDEXED, sequence UNINDEXED,
severity UNINDEXED, severity UNINDEXED,
timestamp_iso UNINDEXED, timestamp_iso UNINDEXED,
matched_patterns UNINDEXED, matched_patterns UNINDEXED,
repeat_count UNINDEXED, repeat_count UNINDEXED,
out_of_order UNINDEXED, out_of_order UNINDEXED,
tokenize = 'porter ascii' tokenize = 'porter ascii'
); )
""") """)
# Only insert rows not already indexed conn.execute("""
conn.execute(""" INSERT INTO log_fts(text, entry_id, source_id, sequence, severity,
INSERT INTO log_fts(text, entry_id, source_id, sequence, severity, timestamp_iso, matched_patterns,
timestamp_iso, matched_patterns, repeat_count, out_of_order)
repeat_count, out_of_order) SELECT e.text, e.id, e.source_id, e.sequence, e.severity,
SELECT e.text, e.id, e.source_id, e.sequence, e.severity, e.timestamp_iso, e.matched_patterns,
e.timestamp_iso, e.matched_patterns, e.repeat_count, e.out_of_order
e.repeat_count, e.out_of_order FROM log_entries e
FROM log_entries e WHERE e.id NOT IN (SELECT entry_id FROM log_fts WHERE entry_id IS NOT NULL)
WHERE e.id NOT IN (SELECT entry_id FROM log_fts WHERE entry_id IS NOT NULL) """)
""") conn.commit()
conn.commit()
conn.close()
def _sanitize_fts_query(raw: str, or_mode: bool = False) -> str: def _sanitize_fts_query(raw: str, or_mode: bool = False) -> str:
@ -96,117 +102,188 @@ def search(
limit: int = 20, limit: int = 20,
include_repeats: bool = False, include_repeats: bool = False,
or_mode: bool = False, or_mode: bool = False,
semantic: bool = False,
) -> list[SearchResult]: ) -> list[SearchResult]:
"""Full-text search with optional filters. Returns results ranked by relevance.""" """Full-text search with optional filters. Returns results ranked by relevance.
conn = sqlite3.connect(str(db_path))
conn.execute("PRAGMA journal_mode=WAL")
conn.row_factory = sqlite3.Row
When ``semantic=True`` and an embedding backend is configured, the BM25
candidate pool is re-ranked using hybrid scoring (BM25 + cosine similarity).
Falls back silently to pure BM25 when the embedder is unavailable.
"""
if semantic:
return _hybrid_search(
db_path, query, severity=severity, source_filter=source_filter,
pattern_filter=pattern_filter, since=since, until=until, limit=limit,
include_repeats=include_repeats, or_mode=or_mode,
)
return _bm25_search(
db_path, query, severity=severity, source_filter=source_filter,
pattern_filter=pattern_filter, since=since, until=until, limit=limit,
include_repeats=include_repeats, or_mode=or_mode,
)
def _hybrid_search(
db_path: Path,
query: str,
severity: str | None = None,
source_filter: str | None = None,
pattern_filter: str | None = None,
since: str | None = None,
until: str | None = None,
limit: int = 20,
include_repeats: bool = False,
or_mode: bool = False,
alpha: float = 0.6,
beta: float = 0.4,
) -> list[SearchResult]:
"""BM25 + vector re-ranking (late-fusion hybrid search).
Fetches an oversized BM25 candidate pool, embeds the query and each
candidate text in-process, then combines scores:
hybrid_score = alpha * bm25_normalized + beta * cosine_sim
BM25 normalization: FTS5 rank is negative (more negative = better match).
We flip the sign and divide by the pool maximum so all BM25 scores land
in (0, 1] 1.0 for the top BM25 hit, approaching 0 for the weakest.
Falls back to pure BM25 when the embedding backend is unavailable.
"""
from app.services.embeddings import EMBEDDING_AVAILABLE, cosine_similarity, get_embedder
# Fetch a large candidate pool — 5x limit, minimum 100 entries.
pool_limit = max(limit * 5, 100)
candidates = _bm25_search(
db_path, query, severity=severity, source_filter=source_filter,
pattern_filter=pattern_filter, since=since, until=until,
limit=pool_limit, include_repeats=include_repeats, or_mode=or_mode,
)
if not candidates:
return []
if not EMBEDDING_AVAILABLE:
return candidates[:limit]
embedder = get_embedder()
if embedder is None:
return candidates[:limit]
try:
query_vec = embedder.embed(query)
candidate_vecs = embedder.embed_batch([r.text for r in candidates])
except Exception as exc:
logger.warning("Hybrid search embedding failed (%s) — falling back to BM25", exc)
return candidates[:limit]
# Normalize BM25 ranks: FTS5 rank is negative, flip and scale to [0, 1].
abs_ranks = [abs(r.rank) for r in candidates]
max_rank = max(abs_ranks) or 1.0
scored: list[tuple[float, SearchResult]] = []
for result, abs_rank, cand_vec in zip(candidates, abs_ranks, candidate_vecs):
bm25_norm = abs_rank / max_rank
cos_sim = cosine_similarity(query_vec, cand_vec)
hybrid = alpha * bm25_norm + beta * max(cos_sim, 0.0)
scored.append((hybrid, result))
scored.sort(key=lambda x: x[0], reverse=True)
return [r for _, r in scored[:limit]]
def _bm25_search(
db_path: Path,
query: str,
severity: str | None = None,
source_filter: str | None = None,
pattern_filter: str | None = None,
since: str | None = None,
until: str | None = None,
limit: int = 20,
include_repeats: bool = False,
or_mode: bool = False,
) -> list[SearchResult]:
"""FTS search — BM25 via FTS5 (SQLite) or tsvector (Postgres)."""
tid = resolve_tenant_id()
if BACKEND == Backend.POSTGRES:
return _pg_fts_search(
db_path, query, tid,
severity=severity, source_filter=source_filter,
pattern_filter=pattern_filter, since=since, until=until,
limit=limit, include_repeats=include_repeats,
)
return _sqlite_fts_search(
db_path, query, tid,
severity=severity, source_filter=source_filter,
pattern_filter=pattern_filter, since=since, until=until,
limit=limit, include_repeats=include_repeats, or_mode=or_mode,
)
def _sqlite_fts_search(
db_path: Path,
query: str,
tid: str,
severity: str | None,
source_filter: str | None,
pattern_filter: str | None,
since: str | None,
until: str | None,
limit: int,
include_repeats: bool,
or_mode: bool,
) -> list[SearchResult]:
fts_query = _sanitize_fts_query(query, or_mode=or_mode) fts_query = _sanitize_fts_query(query, or_mode=or_mode)
conditions = ["log_fts MATCH ?"] conditions = [
params: list = [fts_query] "log_fts MATCH ?",
"(e.tenant_id = ? OR e.tenant_id = '')",
]
params: list = [fts_query, tid]
if severity: if severity:
conditions.append("severity = ?") conditions.append("f.severity = ?")
params.append(severity.upper()) params.append(severity.upper())
if source_filter: if source_filter:
conditions.append("source_id LIKE ?") conditions.append("f.source_id LIKE ?")
params.append(f"%{source_filter}%") params.append(f"%{source_filter}%")
if pattern_filter: if pattern_filter:
conditions.append("matched_patterns LIKE ?") conditions.append("f.matched_patterns LIKE ?")
params.append(f'%"{pattern_filter}"%') params.append(f'%"{pattern_filter}"%')
if since: if since:
conditions.append("timestamp_iso >= ?") conditions.append("f.timestamp_iso >= ?")
params.append(since) params.append(since)
if until: if until:
conditions.append("timestamp_iso <= ?") conditions.append("f.timestamp_iso <= ?")
params.append(until) params.append(until)
if not include_repeats: if not include_repeats:
conditions.append("repeat_count = 1") conditions.append("f.repeat_count = 1")
where = " AND ".join(conditions) where = " AND ".join(conditions)
params.append(limit) params.append(limit)
raw = sqlite3.connect(str(db_path), timeout=30.0)
raw.row_factory = sqlite3.Row
try: try:
rows = conn.execute( rows = raw.execute(
f""" f"""
SELECT entry_id, source_id, sequence, timestamp_iso, severity, SELECT f.entry_id, f.source_id, f.sequence, f.timestamp_iso, f.severity,
repeat_count, out_of_order, matched_patterns, text, rank f.repeat_count, f.out_of_order, f.matched_patterns, f.text, f.rank
FROM log_fts FROM log_fts f
JOIN log_entries e ON e.id = f.entry_id
WHERE {where} WHERE {where}
ORDER BY rank ORDER BY f.rank
LIMIT ? LIMIT ?
""", """,
params, params,
).fetchall() ).fetchall()
except sqlite3.OperationalError as e: except sqlite3.OperationalError as exc:
logger.warning("FTS query failed (%s) — index may not be built yet", e) logger.warning("FTS query failed (%s) — index may not be built yet", exc)
conn.close()
return [] return []
finally:
results = [ raw.close()
SearchResult(
entry_id=r["entry_id"],
source_id=r["source_id"],
sequence=r["sequence"],
timestamp_iso=r["timestamp_iso"],
severity=r["severity"],
repeat_count=r["repeat_count"],
out_of_order=bool(r["out_of_order"]),
matched_patterns=json.loads(r["matched_patterns"] or "[]"),
text=r["text"],
rank=r["rank"],
)
for r in rows
]
conn.close()
return results
def entries_in_window(
db_path: Path,
since: str | None,
until: str | None,
severity: str | None = None,
limit: int = 100,
) -> list[SearchResult]:
"""Return log entries within a time window using a plain SQL scan (no FTS).
Used as a fallback when keyword search returns nothing ensures incident
detail always shows the raw log activity in the window even if no keywords match.
"""
conn = sqlite3.connect(str(db_path))
conn.execute("PRAGMA journal_mode=WAL")
conn.row_factory = sqlite3.Row
conditions: list[str] = ["repeat_count = 1"]
params: list = []
if since:
conditions.append("timestamp_iso >= ?")
params.append(since)
if until:
conditions.append("timestamp_iso <= ?")
params.append(until)
if severity:
conditions.append("severity = ?")
params.append(severity.upper())
where = " AND ".join(conditions)
params.append(limit)
rows = conn.execute(
f"""
SELECT id as entry_id, source_id, sequence, timestamp_iso, severity,
repeat_count, out_of_order, matched_patterns, text, 0.0 as rank
FROM log_entries
WHERE {where}
ORDER BY timestamp_iso ASC
LIMIT ?
""",
params,
).fetchall()
conn.close()
return [ return [
SearchResult( SearchResult(
@ -219,7 +296,178 @@ def entries_in_window(
out_of_order=bool(r["out_of_order"]), out_of_order=bool(r["out_of_order"]),
matched_patterns=json.loads(r["matched_patterns"] or "[]"), matched_patterns=json.loads(r["matched_patterns"] or "[]"),
text=r["text"], text=r["text"],
rank=r["rank"], rank=float(r["rank"]),
)
for r in rows
]
def _pg_fts_search(
db_path: Path,
query: str,
tid: str,
severity: str | None,
source_filter: str | None,
pattern_filter: str | None,
since: str | None,
until: str | None,
limit: int,
include_repeats: bool,
) -> list[SearchResult]:
"""Postgres FTS via tsvector column and websearch_to_tsquery."""
tsq = "websearch_to_tsquery('english', %s)"
conditions = [
f"text_tsv @@ {tsq}",
"(tenant_id = %s OR tenant_id = '')",
]
params: list = [query, tid]
if severity:
conditions.append("severity = %s")
params.append(severity.upper())
if source_filter:
conditions.append("source_id LIKE %s")
params.append(f"%{source_filter}%")
if pattern_filter:
conditions.append("matched_patterns LIKE %s")
params.append(f'%"{pattern_filter}"%')
if since:
conditions.append("timestamp_iso >= %s")
params.append(since)
if until:
conditions.append("timestamp_iso <= %s")
params.append(until)
if not include_repeats:
conditions.append("repeat_count = 1")
where = " AND ".join(conditions)
# ts_rank needs the tsquery again — append it then the limit
params.extend([query, limit])
with get_conn(db_path) as conn:
rows = conn.execute(
f"""
SELECT id AS entry_id, source_id, sequence, timestamp_iso, severity,
repeat_count, out_of_order, matched_patterns, text,
ts_rank(text_tsv, {tsq}) AS rank
FROM log_entries
WHERE {where}
ORDER BY rank DESC
LIMIT %s
""",
params,
).fetchall()
return [
SearchResult(
entry_id=r["entry_id"],
source_id=r["source_id"],
sequence=r["sequence"],
timestamp_iso=r["timestamp_iso"],
severity=r["severity"],
repeat_count=r["repeat_count"],
out_of_order=bool(r["out_of_order"]),
matched_patterns=json.loads(r["matched_patterns"] or "[]"),
text=r["text"],
rank=float(r["rank"]),
)
for r in rows
]
def entries_in_window(
db_path: Path,
since: str | None,
until: str | None,
severity: str | None = None,
source_filter: str | None = None,
limit: int = 100,
per_source_cap: int | None = None,
) -> list[SearchResult]:
"""Return log entries within a time window using a plain SQL scan (no FTS).
Used as a fallback when keyword search returns nothing ensures incident
detail always shows the raw log activity in the window even if no keywords match.
per_source_cap: when set, limits rows per source_id so high-volume sources
(e.g. network-syslog) don't crowd out lower-volume but more interesting ones.
Errors/warnings are ranked first within each source partition.
"""
tid = resolve_tenant_id()
conditions: list[str] = [
"repeat_count = 1",
"(tenant_id = ? OR tenant_id = '')",
]
params: list = [tid]
if since:
conditions.append("timestamp_iso >= ?")
params.append(since)
if until:
conditions.append("timestamp_iso <= ?")
params.append(until)
if severity:
conditions.append("severity = ?")
params.append(severity.upper())
if source_filter:
conditions.append("source_id LIKE ?")
params.append(f"%{source_filter}%")
where = " AND ".join(conditions)
if per_source_cap is not None:
sql = f"""
WITH ranked AS (
SELECT id as entry_id, source_id, sequence, timestamp_iso, severity,
repeat_count, out_of_order, matched_patterns, text, 0.0 as rank,
ROW_NUMBER() OVER (
PARTITION BY source_id
ORDER BY
CASE UPPER(severity)
WHEN 'CRITICAL' THEN 0
WHEN 'ERROR' THEN 1
WHEN 'WARN' THEN 2
ELSE 3
END,
timestamp_iso
) AS rn
FROM log_entries
WHERE {where}
)
SELECT entry_id, source_id, sequence, timestamp_iso, severity,
repeat_count, out_of_order, matched_patterns, text, rank
FROM ranked
WHERE rn <= ?
ORDER BY timestamp_iso ASC
LIMIT ?
"""
params.extend([per_source_cap, limit])
else:
sql = f"""
SELECT id as entry_id, source_id, sequence, timestamp_iso, severity,
repeat_count, out_of_order, matched_patterns, text, 0.0 as rank
FROM log_entries
WHERE {where}
ORDER BY timestamp_iso ASC
LIMIT ?
"""
params.append(limit)
with get_conn(db_path) as conn:
rows = conn.execute(sql, params).fetchall()
return [
SearchResult(
entry_id=r["entry_id"],
source_id=r["source_id"],
sequence=r["sequence"],
timestamp_iso=r["timestamp_iso"],
severity=r["severity"],
repeat_count=r["repeat_count"],
out_of_order=bool(r["out_of_order"]),
matched_patterns=json.loads(r["matched_patterns"] or "[]"),
text=r["text"],
rank=float(r["rank"]),
) )
for r in rows for r in rows
] ]
@ -238,16 +486,14 @@ def recent_source_errors(
Bypasses FTS ranking so text content doesn't affect which errors surface. Bypasses FTS ranking so text content doesn't affect which errors surface.
Used by diagnose when FTS keyword search returns nothing for a known source. Used by diagnose when FTS keyword search returns nothing for a known source.
""" """
conn = sqlite3.connect(str(db_path)) tid = resolve_tenant_id()
conn.execute("PRAGMA journal_mode=WAL")
conn.row_factory = sqlite3.Row
conditions = [ conditions = [
"source_id LIKE ?", "source_id LIKE ?",
"severity = ?", "severity = ?",
"repeat_count = 1", "repeat_count = 1",
"(tenant_id = ? OR tenant_id = '')",
] ]
params: list = [f"%{source_filter}%", severity.upper()] params: list = [f"%{source_filter}%", severity.upper(), tid]
if since: if since:
conditions.append("timestamp_iso >= ?") conditions.append("timestamp_iso >= ?")
@ -259,18 +505,18 @@ def recent_source_errors(
params.append(limit) params.append(limit)
where = " AND ".join(conditions) where = " AND ".join(conditions)
rows = conn.execute( with get_conn(db_path) as conn:
f""" rows = conn.execute(
SELECT id as entry_id, source_id, sequence, timestamp_iso, severity, f"""
repeat_count, out_of_order, matched_patterns, text, 0.0 as rank SELECT id as entry_id, source_id, sequence, timestamp_iso, severity,
FROM log_entries repeat_count, out_of_order, matched_patterns, text, 0.0 as rank
WHERE {where} FROM log_entries
ORDER BY timestamp_iso DESC WHERE {where}
LIMIT ? ORDER BY timestamp_iso DESC
""", LIMIT ?
params, """,
).fetchall() params,
conn.close() ).fetchall()
return [ return [
SearchResult( SearchResult(
@ -283,81 +529,157 @@ def recent_source_errors(
out_of_order=bool(r["out_of_order"]), out_of_order=bool(r["out_of_order"]),
matched_patterns=json.loads(r["matched_patterns"] or "[]"), matched_patterns=json.loads(r["matched_patterns"] or "[]"),
text=r["text"], text=r["text"],
rank=r["rank"], rank=float(r["rank"]),
) )
for r in rows for r in rows
] ]
def list_sources(db_path: Path) -> list[dict]: def list_sources(db_path: Path) -> list[dict]:
"""Return distinct sources with entry counts and time ranges.""" """Return sources with entry counts, grouped by prefix:host stem.
conn = sqlite3.connect(str(db_path))
conn.execute("PRAGMA journal_mode=WAL") source_ids with three or more colon-separated segments (e.g.
rows = conn.execute(""" ``muninn-journal:Muninn:ssh.service``) are collapsed to their first two
SELECT segments (``muninn-journal:Muninn``). Single- or two-segment IDs are
source_id, returned as-is. ``unit_count`` reports how many distinct sub-units were
COUNT(*) as entry_count, merged into each row.
MIN(timestamp_iso) as earliest, """
MAX(timestamp_iso) as latest, tid = resolve_tenant_id()
SUM(CASE WHEN severity IN ('ERROR','CRITICAL','EMERGENCY','ALERT') THEN 1 ELSE 0 END) as error_count group_expr = frag.source_group_expr("source_id")
FROM log_entries with get_conn(db_path) as conn:
GROUP BY source_id rows = conn.execute(
ORDER BY entry_count DESC f"""
""").fetchall() SELECT
conn.close() {group_expr} AS group_id,
COUNT(DISTINCT source_id) AS unit_count,
COUNT(*) AS entry_count,
MIN(timestamp_iso) AS earliest,
MAX(timestamp_iso) AS latest,
SUM(CASE WHEN severity IN ('ERROR','CRITICAL','EMERGENCY','ALERT')
THEN 1 ELSE 0 END) AS error_count
FROM log_entries
WHERE (tenant_id = ? OR tenant_id = '')
GROUP BY group_id
ORDER BY entry_count DESC
""",
(tid,),
).fetchall()
return [ return [
{ {
"source_id": r[0], "source_id": r["group_id"],
"entry_count": r[1], "unit_count": r["unit_count"],
"earliest": r[2], "entry_count": r["entry_count"],
"latest": r[3], "earliest": r["earliest"],
"error_count": r[4], "latest": r["latest"],
"error_count": r["error_count"],
} }
for r in rows for r in rows
] ]
def stats_summary(db_path: Path, window_hours: int = 24) -> dict: def _compile_overrides(overrides: list[dict]) -> list[tuple[re.Pattern[str], str]]:
"""Return (compiled_pattern, override_severity) pairs for enabled rules."""
compiled = []
for rule in overrides:
if not rule.get("enabled", True):
continue
try:
compiled.append((re.compile(rule["pattern"], re.IGNORECASE), rule["override_severity"]))
except re.error:
pass
return compiled
def _apply_overrides(text: str, original_severity: str, rules: list[tuple[re.Pattern[str], str]]) -> str:
for pattern, new_sev in rules:
if pattern.search(text):
return new_sev
return original_severity
def stats_summary(db_path: Path, window_hours: int = 24, severity_overrides: list[dict] | None = None) -> dict:
"""Return aggregate health stats for the dashboard. """Return aggregate health stats for the dashboard.
Queries plain log_entries (not FTS) so it works even before the index is built. Queries plain log_entries (not FTS) so it works even before the index is built.
""" """
conn = sqlite3.connect(str(db_path)) rules = _compile_overrides(severity_overrides or [])
conn.execute("PRAGMA journal_mode=WAL") tid = resolve_tenant_id()
conn.row_factory = sqlite3.Row group_expr = frag.source_group_expr("source_id")
since_iso = (
datetime.now(timezone.utc) - timedelta(hours=window_hours)
).strftime("%Y-%m-%dT%H:%M:%S")
since_expr = f"strftime('%Y-%m-%dT%H:%M:%S', 'now', '-{window_hours} hours')" with get_conn(db_path) as conn:
row = conn.execute(
"""
SELECT
COUNT(*) AS total,
SUM(CASE WHEN severity = 'CRITICAL' THEN 1 ELSE 0 END) AS criticals,
SUM(CASE WHEN severity IN ('ERROR','CRITICAL','EMERGENCY','ALERT') THEN 1 ELSE 0 END) AS errors
FROM log_entries
WHERE timestamp_iso >= ?
AND repeat_count = 1
AND (tenant_id = ? OR tenant_id = '')
""",
(since_iso, tid),
).fetchone()
total_24h = int(row["total"] or 0)
criticals_24h = int(row["criticals"] or 0)
errors_24h = int(row["errors"] or 0)
# Overall counts in window source_rows = conn.execute(
row = conn.execute(f""" f"""
SELECT SELECT
COUNT(*) AS total, {group_expr} AS group_id,
SUM(CASE WHEN severity = 'CRITICAL' THEN 1 ELSE 0 END) AS criticals, COUNT(*) AS entry_count,
SUM(CASE WHEN severity IN ('ERROR','CRITICAL','EMERGENCY','ALERT') THEN 1 ELSE 0 END) AS errors SUM(CASE WHEN severity IN ('ERROR','CRITICAL','EMERGENCY','ALERT') THEN 1 ELSE 0 END) AS error_count,
FROM log_entries MAX(timestamp_iso) AS latest
WHERE timestamp_iso >= {since_expr} FROM log_entries
AND repeat_count = 1 WHERE timestamp_iso >= ?
""").fetchone() AND repeat_count = 1
total_24h = int(row["total"] or 0) AND (tenant_id = ? OR tenant_id = '')
criticals_24h = int(row["criticals"] or 0) GROUP BY group_id
errors_24h = int(row["errors"] or 0) ORDER BY error_count DESC, entry_count DESC
""",
(since_iso, tid),
).fetchall()
crit_rows = conn.execute(
"""
SELECT id as entry_id, source_id, timestamp_iso, severity, text
FROM log_entries
WHERE severity = 'CRITICAL'
AND repeat_count = 1
AND (tenant_id = ? OR tenant_id = '')
ORDER BY timestamp_iso DESC
LIMIT 25
""",
(tid,),
).fetchall()
timeline_rows = conn.execute(
"""
SELECT id as entry_id, source_id, timestamp_iso, severity, text
FROM log_entries
WHERE severity IN ('CRITICAL','ERROR','WARN','WARNING','EMERGENCY','ALERT')
AND timestamp_iso >= ?
AND timestamp_iso IS NOT NULL
AND repeat_count = 1
AND (tenant_id = ? OR tenant_id = '')
ORDER BY timestamp_iso DESC
LIMIT 300
""",
(since_iso, tid),
).fetchall()
last_row = conn.execute(
"SELECT MAX(ingest_time) AS t FROM log_entries WHERE (tenant_id = ? OR tenant_id = '')",
(tid,),
).fetchone()
# Per-source breakdown
source_rows = conn.execute(f"""
SELECT
source_id,
COUNT(*) AS entry_count,
SUM(CASE WHEN severity IN ('ERROR','CRITICAL','EMERGENCY','ALERT') THEN 1 ELSE 0 END) AS error_count,
MAX(timestamp_iso) AS latest
FROM log_entries
WHERE timestamp_iso >= {since_expr}
AND repeat_count = 1
GROUP BY source_id
ORDER BY error_count DESC, entry_count DESC
""").fetchall()
source_health = [ source_health = [
{ {
"source_id": r["source_id"], "source_id": r["group_id"],
"entry_count": int(r["entry_count"]), "entry_count": int(r["entry_count"]),
"error_count": int(r["error_count"]), "error_count": int(r["error_count"]),
"latest": r["latest"], "latest": r["latest"],
@ -365,16 +687,24 @@ def stats_summary(db_path: Path, window_hours: int = 24) -> dict:
for r in source_rows for r in source_rows
] ]
# 5 most recent critical entries suppressed = 0
crit_rows = conn.execute(""" recent_criticals = []
SELECT id as entry_id, source_id, sequence, timestamp_iso, severity, for r in crit_rows:
repeat_count, out_of_order, matched_patterns, text, 0.0 as rank effective = _apply_overrides(r["text"], r["severity"], rules)
FROM log_entries if effective == "CRITICAL":
WHERE severity = 'CRITICAL' AND repeat_count = 1 recent_criticals.append({
ORDER BY timestamp_iso DESC "entry_id": r["entry_id"],
LIMIT 5 "source_id": r["source_id"],
""").fetchall() "timestamp_iso": r["timestamp_iso"],
recent_criticals = [ "severity": r["severity"],
"text": r["text"],
})
if len(recent_criticals) == 5:
break
else:
suppressed += 1
timeline_events = [
{ {
"entry_id": r["entry_id"], "entry_id": r["entry_id"],
"source_id": r["source_id"], "source_id": r["source_id"],
@ -382,10 +712,10 @@ def stats_summary(db_path: Path, window_hours: int = 24) -> dict:
"severity": r["severity"], "severity": r["severity"],
"text": r["text"], "text": r["text"],
} }
for r in crit_rows for r in timeline_rows
] ]
conn.close() last_gleaned: str | None = last_row["t"] if last_row else None
return { return {
"window_hours": window_hours, "window_hours": window_hours,
@ -394,6 +724,9 @@ def stats_summary(db_path: Path, window_hours: int = 24) -> dict:
"errors_24h": errors_24h, "errors_24h": errors_24h,
"source_health": source_health, "source_health": source_health,
"recent_criticals": recent_criticals, "recent_criticals": recent_criticals,
"suppressed_criticals": suppressed,
"last_gleaned": last_gleaned,
"timeline_events": timeline_events,
} }

265
app/services/ssh_targets.py Normal file
View file

@ -0,0 +1,265 @@
"""SSH target registry — persisted in the main SQLite DB.
Targets are stored as path references only. The private key is never
read into the database, logged, or returned by any API response.
"""
from __future__ import annotations
import os
import sqlite3
import stat
import time
import uuid
from dataclasses import dataclass
from pathlib import Path
from typing import Any
@dataclass
class SshTarget:
id: str
label: str
host: str
port: int
user: str
key_path: str
last_tested: str | None
last_ok: bool | None
last_error: str | None
created_at: str
updated_at: str
def _row_to_target(row: tuple) -> SshTarget:
return SshTarget(
id=row[0],
label=row[1],
host=row[2],
port=row[3],
user=row[4],
key_path=row[5],
last_tested=row[6],
last_ok=bool(row[7]) if row[7] is not None else None,
last_error=row[8],
created_at=row[9],
updated_at=row[10],
)
def _now() -> str:
return time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
# ---------------------------------------------------------------------------
# CRUD
# ---------------------------------------------------------------------------
def list_targets(db_path: Path) -> list[SshTarget]:
conn = sqlite3.connect(str(db_path), timeout=10)
rows = conn.execute(
"SELECT id, label, host, port, user, key_path, last_tested, last_ok, last_error, created_at, updated_at "
"FROM ssh_targets ORDER BY label"
).fetchall()
conn.close()
return [_row_to_target(r) for r in rows]
def get_target(db_path: Path, target_id: str) -> SshTarget | None:
conn = sqlite3.connect(str(db_path), timeout=10)
row = conn.execute(
"SELECT id, label, host, port, user, key_path, last_tested, last_ok, last_error, created_at, updated_at "
"FROM ssh_targets WHERE id = ?",
(target_id,),
).fetchone()
conn.close()
return _row_to_target(row) if row else None
def create_target(
db_path: Path,
label: str,
host: str,
port: int,
user: str,
key_path: str,
) -> SshTarget:
resolved = _validate_key_path(key_path)
now = _now()
target_id = str(uuid.uuid4())
conn = sqlite3.connect(str(db_path), timeout=10)
conn.execute(
"INSERT INTO ssh_targets (id, label, host, port, user, key_path, created_at, updated_at) "
"VALUES (?,?,?,?,?,?,?,?)",
(target_id, label, host, port, user, str(resolved), now, now),
)
conn.commit()
conn.close()
return get_target(db_path, target_id) # type: ignore[return-value]
def update_target(
db_path: Path,
target_id: str,
*,
label: str | None = None,
host: str | None = None,
port: int | None = None,
user: str | None = None,
key_path: str | None = None,
) -> SshTarget | None:
existing = get_target(db_path, target_id)
if existing is None:
return None
resolved_key = str(_validate_key_path(key_path)) if key_path else existing.key_path
conn = sqlite3.connect(str(db_path), timeout=10)
conn.execute(
"UPDATE ssh_targets SET label=?, host=?, port=?, user=?, key_path=?, updated_at=? WHERE id=?",
(
label if label is not None else existing.label,
host if host is not None else existing.host,
port if port is not None else existing.port,
user if user is not None else existing.user,
resolved_key,
_now(),
target_id,
),
)
conn.commit()
conn.close()
return get_target(db_path, target_id)
def delete_target(db_path: Path, target_id: str) -> bool:
conn = sqlite3.connect(str(db_path), timeout=10)
cur = conn.execute("DELETE FROM ssh_targets WHERE id = ?", (target_id,))
conn.commit()
conn.close()
return cur.rowcount > 0
# ---------------------------------------------------------------------------
# Test connection
# ---------------------------------------------------------------------------
def test_connection(db_path: Path, target_id: str) -> dict[str, Any]:
"""Attempt an SSH no-op and record the result.
Runs `true` on the remote host no data is pulled. Returns
{ok: bool, error: str|null, tested_at: str}.
"""
target = get_target(db_path, target_id)
if target is None:
raise KeyError(f"SSH target {target_id!r} not found")
# Lazy import — paramiko is optional
try:
from paramiko import SSHClient, AutoAddPolicy, AuthenticationException, SSHException
except ImportError:
_record_test(db_path, target_id, ok=False, error="paramiko not installed")
return {"ok": False, "error": "paramiko not installed — run: pip install paramiko", "tested_at": _now()}
key_path = str(Path(target.key_path).expanduser())
error: str | None = None
ok = False
try:
client = SSHClient()
client.set_missing_host_key_policy(AutoAddPolicy())
client.connect(
hostname=target.host,
port=target.port,
username=target.user,
key_filename=key_path,
timeout=10,
banner_timeout=10,
)
stdin, stdout, stderr = client.exec_command("true", timeout=10)
exit_code = stdout.channel.recv_exit_status()
client.close()
ok = exit_code == 0
if not ok:
error = f"Remote command exited with code {exit_code}"
except AuthenticationException:
error = f"Authentication failed — check key path and remote authorized_keys"
except SSHException as exc:
error = f"SSH error: {exc}"
except OSError as exc:
error = f"Connection failed: {exc}"
except Exception as exc:
error = f"Unexpected error: {exc}"
tested_at = _now()
_record_test(db_path, target_id, ok=ok, error=error, tested_at=tested_at)
return {"ok": ok, "error": error, "tested_at": tested_at}
def _record_test(
db_path: Path,
target_id: str,
*,
ok: bool,
error: str | None,
tested_at: str | None = None,
) -> None:
if tested_at is None:
tested_at = _now()
conn = sqlite3.connect(str(db_path), timeout=10)
conn.execute(
"UPDATE ssh_targets SET last_tested=?, last_ok=?, last_error=?, updated_at=? WHERE id=?",
(tested_at, 1 if ok else 0, error, _now(), target_id),
)
conn.commit()
conn.close()
# ---------------------------------------------------------------------------
# Validation
# ---------------------------------------------------------------------------
def _validate_key_path(raw: str) -> Path:
"""Resolve and validate the SSH key path.
Returns the resolved Path. Raises ValueError with a user-readable message
on any problem (does not raise on world-readable just returns a warning
to the caller so the UI can display it non-blocking).
"""
p = Path(raw).expanduser()
if not p.exists():
raise ValueError(f"Key file not found: {p}")
if not p.is_file():
raise ValueError(f"Key path is not a file: {p}")
return p
def key_path_warning(key_path: str) -> str | None:
"""Return a warning string if the key file has overly permissive mode, else None."""
try:
p = Path(key_path).expanduser()
mode = p.stat().st_mode
if mode & (stat.S_IRGRP | stat.S_IWGRP | stat.S_IROTH | stat.S_IWOTH):
perms = oct(mode & 0o777)
return f"Key file permissions are too open ({perms}). SSH may refuse to use it — run: chmod 600 {p}"
except OSError:
pass
return None
def target_to_dict(t: SshTarget, include_warning: bool = False) -> dict[str, Any]:
"""Serialize a target for API responses. Never includes key contents."""
d: dict[str, Any] = {
"id": t.id,
"label": t.label,
"host": t.host,
"port": t.port,
"user": t.user,
"key_path": t.key_path,
"last_tested": t.last_tested,
"last_ok": t.last_ok,
"last_error": t.last_error,
"created_at": t.created_at,
"updated_at": t.updated_at,
}
if include_warning:
d["key_warning"] = key_path_warning(t.key_path)
return d

View file

@ -0,0 +1,213 @@
"""Incident ticket export — push Turnstone incidents to external trackers.
Supported targets: "notion", "jira"
Each exporter receives the incident dict and a list of log entry dicts,
and returns {"url": str, "ticket_id": str}.
"""
from __future__ import annotations
import json
from typing import Any
import httpx
# ---------------------------------------------------------------------------
# Notion exporter
# ---------------------------------------------------------------------------
def _notion_export(
incident: dict[str, Any],
entries: list[dict[str, Any]],
token: str,
database_id: str,
) -> dict[str, str]:
"""Create a Notion page in *database_id* from an incident.
Notion block types used: heading_2, bulleted_list_item, paragraph.
Rich text max length is 2000 chars per block.
"""
if not token or not database_id:
raise ValueError("Notion not configured — set notion_token and notion_database_id in Settings")
def _text(s: str, bold: bool = False) -> dict:
chunk: dict[str, Any] = {"type": "text", "text": {"content": s[:2000]}}
if bold:
chunk["annotations"] = {"bold": True}
return chunk
log_blocks: list[dict] = []
for e in entries[:50]: # Notion has page size limits
line = f"[{e.get('severity') or '?'}] {e.get('source_id', '')}{e.get('text', '')[:160]}"
log_blocks.append({"object": "block", "type": "bulleted_list_item",
"bulleted_list_item": {"rich_text": [_text(line)]}})
sev = incident.get("severity", "medium").upper()
issue_type = incident.get("issue_type") or ""
window = f"{incident.get('started_at') or '?'}{incident.get('ended_at') or 'ongoing'}"
children: list[dict] = [
{"object": "block", "type": "heading_2",
"heading_2": {"rich_text": [_text("Incident Details", bold=True)]}},
{"object": "block", "type": "paragraph",
"paragraph": {"rich_text": [
_text("Severity: ", bold=True), _text(sev),
_text(" Type: ", bold=True), _text(issue_type),
_text(" Window: ", bold=True), _text(window),
]}},
]
if incident.get("notes"):
children.append({"object": "block", "type": "paragraph",
"paragraph": {"rich_text": [_text("Notes: ", bold=True), _text(incident["notes"])]}})
children.append({"object": "block", "type": "heading_2",
"heading_2": {"rich_text": [_text("Log Evidence")]}})
children.extend(log_blocks)
payload = {
"parent": {"database_id": database_id},
"properties": {
"title": {"title": [_text(incident.get("label", "Unnamed Incident"))]},
},
"children": children,
}
resp = httpx.post(
"https://api.notion.com/v1/pages",
headers={
"Authorization": f"Bearer {token}",
"Notion-Version": "2022-06-28",
"Content-Type": "application/json",
},
json=payload,
timeout=15,
)
if not resp.is_success:
raise RuntimeError(f"Notion API error {resp.status_code}: {resp.text[:300]}")
page = resp.json()
page_id = page["id"]
url = page.get("url") or f"https://notion.so/{page_id.replace('-', '')}"
return {"url": url, "ticket_id": page_id}
# ---------------------------------------------------------------------------
# Jira exporter
# ---------------------------------------------------------------------------
def _jira_export(
incident: dict[str, Any],
entries: list[dict[str, Any]],
jira_url: str,
email: str,
api_token: str,
project_key: str,
issue_type: str = "Bug",
) -> dict[str, str]:
"""Create a Jira issue via REST API v3 (cloud or Server 8.4+)."""
if not jira_url or not email or not api_token or not project_key:
raise ValueError("Jira not configured — set jira_url, jira_email, jira_api_token, and jira_project_key in Settings")
base = jira_url.rstrip("/")
sev = incident.get("severity", "medium").upper()
inc_type = incident.get("issue_type") or "incident"
window = f"{incident.get('started_at') or '?'}{incident.get('ended_at') or 'ongoing'}"
log_lines = "\n".join(
f"[{e.get('severity') or '?'}] {e.get('source_id', '')}{e.get('text', '')[:160]}"
for e in entries[:40]
)
description = (
f"*Severity:* {sev} | *Type:* {inc_type} | *Window:* {window}\n\n"
+ (f"*Notes:* {incident['notes']}\n\n" if incident.get("notes") else "")
+ "h2. Log Evidence\n\n{{code}}\n" + log_lines + "\n{{code}}"
)
# Jira REST v3 uses Atlassian Document Format for description
adf_body = {
"type": "doc",
"version": 1,
"content": [
{"type": "paragraph", "content": [{"type": "text", "text": description}]},
],
}
payload = {
"fields": {
"project": {"key": project_key},
"summary": incident.get("label", "Unnamed Incident"),
"issuetype": {"name": issue_type},
"description": adf_body,
}
}
import base64 as _b64
creds = _b64.b64encode(f"{email}:{api_token}".encode()).decode()
resp = httpx.post(
f"{base}/rest/api/3/issue",
headers={
"Authorization": f"Basic {creds}",
"Content-Type": "application/json",
"Accept": "application/json",
},
json=payload,
timeout=15,
)
if not resp.is_success:
raise RuntimeError(f"Jira API error {resp.status_code}: {resp.text[:300]}")
data = resp.json()
issue_key = data["key"]
url = f"{base}/browse/{issue_key}"
return {"url": url, "ticket_id": issue_key}
# ---------------------------------------------------------------------------
# Public API
# ---------------------------------------------------------------------------
_EXPORTERS = {
"notion": _notion_export,
"jira": _jira_export,
}
def available_targets() -> list[str]:
return list(_EXPORTERS.keys())
def export_incident(
target: str,
incident: dict[str, Any],
entries: list[dict[str, Any]],
config: dict[str, str],
) -> dict[str, str]:
"""Dispatch to the appropriate exporter.
*config* is pulled from the settings pref dict callers pass the relevant
subset so this service stays stateless and testable.
Returns {"url": str, "ticket_id": str}.
Raises ValueError for unknown target or missing config.
Raises RuntimeError on API-level failures.
"""
if target not in _EXPORTERS:
raise ValueError(f"Unknown ticket target: {target!r}. Supported: {list(_EXPORTERS)}")
if target == "notion":
return _notion_export(
incident, entries,
token=config.get("notion_token", ""),
database_id=config.get("notion_database_id", ""),
)
if target == "jira":
return _jira_export(
incident, entries,
jira_url=config.get("jira_url", ""),
email=config.get("jira_email", ""),
api_token=config.get("jira_api_token", ""),
project_key=config.get("jira_project_key", ""),
issue_type=config.get("jira_issue_type", "Bug"),
)
raise ValueError(f"Unhandled target: {target!r}")

0
app/tasks/__init__.py Normal file
View file

114
app/tasks/anomaly_scorer.py Normal file
View file

@ -0,0 +1,114 @@
"""Background anomaly scoring task.
Runs score_unscored() after each glean cycle (triggered by glean_scheduler)
or on its own interval when TURNSTONE_ANOMALY_INTERVAL is set.
Set TURNSTONE_ANOMALY_MODEL to a HuggingFace model ID to activate.
When the env var is empty (default) the scorer is a no-op.
"""
from __future__ import annotations
import asyncio
import logging
import os
from dataclasses import dataclass, field
from datetime import datetime, timedelta, timezone
from pathlib import Path
from app.services.anomaly import ScoringResult, score_unscored
logger = logging.getLogger(__name__)
_DEFAULT_INTERVAL = int(os.environ.get("TURNSTONE_ANOMALY_INTERVAL", "0"))
_lock = asyncio.Lock()
@dataclass
class ScorerState:
last_run_at: str | None = None
last_duration_s: float | None = None
last_scored: int = 0
last_detections: int = 0
last_error: str | None = None
run_count: int = 0
next_run_at: str | None = None
running: bool = False
total_scored: int = 0
total_detections: int = 0
_state = ScorerState()
def get_state() -> ScorerState:
return _state
async def run_once(
db_path: Path,
model_id: str = "",
device: str = "cpu",
batch_size: int = 256,
threshold: float = 0.75,
) -> ScoringResult:
"""Score unscored entries once. Skips if already running or model not configured."""
if _lock.locked():
return ScoringResult(skipped=True, error="scorer already running")
async with _lock:
_state.running = True
started = datetime.now(tz=timezone.utc)
try:
loop = asyncio.get_running_loop()
result: ScoringResult = await loop.run_in_executor(
None,
lambda: score_unscored(db_path, model_id, device, batch_size, threshold),
)
duration = (datetime.now(tz=timezone.utc) - started).total_seconds()
_state.last_run_at = started.isoformat()
_state.last_duration_s = round(duration, 2)
_state.last_scored = result.scored
_state.last_detections = result.detections
_state.last_error = result.error
_state.run_count += 1
_state.total_scored += result.scored
_state.total_detections += result.detections
if not result.skipped:
logger.info(
"Anomaly scorer: %d scored, %d detections in %.1fs",
result.scored, result.detections, duration,
)
return result
except Exception as exc:
duration = (datetime.now(tz=timezone.utc) - started).total_seconds()
_state.last_run_at = started.isoformat()
_state.last_duration_s = round(duration, 2)
_state.last_error = str(exc)
_state.run_count += 1
logger.error("Anomaly scorer failed: %s", exc)
return ScoringResult(error=str(exc))
finally:
_state.running = False
async def scorer_loop(
db_path: Path,
model_id: str,
device: str,
interval_s: int,
batch_size: int = 256,
threshold: float = 0.75,
) -> None:
"""Score unscored entries every interval_s seconds until cancelled."""
logger.info("Anomaly scorer loop started — interval %ds, model: %s", interval_s, model_id)
while True:
await run_once(db_path, model_id, device, batch_size, threshold)
next_run = datetime.now(tz=timezone.utc) + timedelta(seconds=interval_s)
_state.next_run_at = next_run.isoformat()
try:
await asyncio.sleep(interval_s)
except asyncio.CancelledError:
logger.info("Anomaly scorer loop cancelled")
_state.next_run_at = None
raise

View file

@ -0,0 +1,84 @@
"""Background task wrapper for the cybersec zero-shot scoring pipeline."""
from __future__ import annotations
import asyncio
import logging
from dataclasses import dataclass, field
from datetime import datetime, timezone
from pathlib import Path
from app.services.cybersec import score_security_entries
logger = logging.getLogger(__name__)
_lock = asyncio.Lock()
@dataclass
class CybersecState:
last_run_at: str | None = None
last_duration_s: float | None = None
last_scored: int = 0
last_detections: int = 0
last_error: str | None = None
run_count: int = 0
running: bool = False
total_scored: int = 0
total_detections: int = 0
_state = CybersecState()
def get_state() -> dict:
return {
"last_run_at": _state.last_run_at,
"last_duration_s":_state.last_duration_s,
"last_scored": _state.last_scored,
"last_detections":_state.last_detections,
"last_error": _state.last_error,
"run_count": _state.run_count,
"running": _state.running,
"total_scored": _state.total_scored,
"total_detections": _state.total_detections,
}
async def run_once(
db_path: Path,
model_id: str,
device: str = "cpu",
batch_size: int = 32,
threshold: float = 0.60,
) -> None:
"""Single cybersec scoring pass — no-op if already running or no model set."""
if not model_id or _lock.locked():
return
async with _lock:
_state.running = True
started = datetime.now(tz=timezone.utc)
try:
loop = asyncio.get_running_loop()
result = await loop.run_in_executor(
None,
lambda: score_security_entries(db_path, model_id, device, batch_size, threshold),
)
elapsed = (datetime.now(tz=timezone.utc) - started).total_seconds()
_state.last_run_at = started.isoformat()
_state.last_duration_s = elapsed
_state.last_scored = result.scored
_state.last_detections = result.detections
_state.last_error = result.error
_state.run_count += 1
_state.total_scored += result.scored
_state.total_detections += result.detections
if result.error:
logger.error("cybersec scorer error: %s", result.error)
elif not result.skipped:
logger.info(
"cybersec scorer: scored=%d detections=%d in %.1fs",
result.scored, result.detections, elapsed,
)
finally:
_state.running = False

View file

@ -0,0 +1,237 @@
"""Periodic batch glean scheduler with optional CF submission.
Runs glean_sources on a configurable interval (TURNSTONE_GLEAN_INTERVAL env var,
default 900s / 15 min). Set to 0 to disable.
When TURNSTONE_SUBMIT_ENDPOINT is set, pushes pattern-matched entries to a remote
Turnstone instance (the CF receiving store) after each glean run.
"""
from __future__ import annotations
import asyncio
import json
import logging
from app.db import get_conn, resolve_tenant_id
from dataclasses import dataclass, field
from datetime import datetime, timedelta, timezone
from pathlib import Path
from typing import Any
import httpx
from app.glean.pipeline import glean_sources
from app.tasks.anomaly_scorer import run_once as _run_scorer
from app.tasks.cybersec_scorer import run_once as _run_cybersec
from app.tasks.incident_detector import run_once as _run_incident_detector
logger = logging.getLogger(__name__)
_lock = asyncio.Lock()
@dataclass
class IngestState:
last_run_at: str | None = None
last_duration_s: float | None = None
last_stats: dict[str, int] = field(default_factory=dict)
last_error: str | None = None
run_count: int = 0
next_run_at: str | None = None
running: bool = False
last_submitted_at: str | None = None
last_submit_count: int = 0
last_submit_error: str | None = None
_state = IngestState()
def get_state() -> IngestState:
return _state
def _query_matched_since(db_path: Path, since: str | None) -> list[dict]:
"""Return entries with non-empty matched_patterns, optionally filtered by ingest_time."""
tid = resolve_tenant_id()
with get_conn(db_path) as conn:
if since:
rows = conn.execute(
"""
SELECT id, source_id, sequence, timestamp_raw, timestamp_iso,
ingest_time, severity, repeat_count, out_of_order,
matched_patterns, text
FROM log_entries
WHERE matched_patterns != '[]'
AND ingest_time > ?
AND (tenant_id = ? OR tenant_id = '')
ORDER BY ingest_time
LIMIT 5000
""",
(since, tid),
).fetchall()
else:
rows = conn.execute(
"""
SELECT id, source_id, sequence, timestamp_raw, timestamp_iso,
ingest_time, severity, repeat_count, out_of_order,
matched_patterns, text
FROM log_entries
WHERE matched_patterns != '[]'
AND (tenant_id = ? OR tenant_id = '')
ORDER BY ingest_time DESC
LIMIT 5000
""",
(tid,),
).fetchall()
return [dict(r) for r in rows]
async def submit_matched(
db_path: Path,
submit_endpoint: str,
source_host: str,
since: str | None = None,
) -> dict[str, Any]:
"""Push pattern-matched entries to the remote CF receiving instance."""
loop = asyncio.get_running_loop()
entries = await loop.run_in_executor(
None, lambda: _query_matched_since(db_path, since)
)
if not entries:
return {"ok": True, "submitted": 0, "skipped": True}
url = f"{submit_endpoint.rstrip('/')}/turnstone/api/glean/batch"
payload = {"source_host": source_host, "entries": entries}
try:
async with httpx.AsyncClient(timeout=30.0) as client:
resp = await client.post(url, json=payload)
resp.raise_for_status()
result = resp.json()
submitted = result.get("gleaned", len(entries))
_state.last_submitted_at = datetime.now(tz=timezone.utc).isoformat()
_state.last_submit_count = submitted
_state.last_submit_error = None
logger.info("Submitted %d matched entries to %s", submitted, submit_endpoint)
return {"ok": True, "submitted": submitted}
except Exception as exc:
_state.last_submit_error = str(exc)
logger.warning("Submission to %s failed: %s", submit_endpoint, exc)
return {"ok": False, "error": str(exc)}
async def run_once(
sources_file: Path,
db_path: Path,
pattern_file: Path | None = None,
submit_endpoint: str | None = None,
source_host: str = "unknown",
force: bool = False,
anomaly_model: str = "",
anomaly_device: str = "cpu",
anomaly_threshold: float = 0.75,
cybersec_model: str = "",
cybersec_device: str = "cpu",
cybersec_threshold: float = 0.60,
incidents_db_path: Path | None = None,
auto_incident: bool = True,
) -> dict[str, Any]:
"""Ingest all sources once, then submit matched entries if configured.
Pass ``force=True`` to bypass fingerprint checks and re-glean all local
file sources regardless of whether they appear unchanged.
"""
if _lock.locked():
return {"ok": False, "error": "glean already running", "skipped": True}
async with _lock:
_state.running = True
started = datetime.now(tz=timezone.utc)
try:
loop = asyncio.get_running_loop()
stats: dict[str, int] = await loop.run_in_executor(
None,
lambda: glean_sources(sources_file, db_path, pattern_file, force=force),
)
duration = (datetime.now(tz=timezone.utc) - started).total_seconds()
_state.last_run_at = started.isoformat()
_state.last_duration_s = round(duration, 2)
_state.last_stats = stats
_state.last_error = None
_state.run_count += 1
logger.info("Batch glean complete in %.1fs — %s", duration, stats)
except Exception as exc:
duration = (datetime.now(tz=timezone.utc) - started).total_seconds()
_state.last_run_at = started.isoformat()
_state.last_duration_s = round(duration, 2)
_state.last_error = str(exc)
_state.run_count += 1
logger.error("Batch glean failed: %s", exc)
_state.running = False
return {"ok": False, "error": str(exc)}
finally:
_state.running = False
if submit_endpoint:
await submit_matched(db_path, submit_endpoint, source_host, since=_state.last_submitted_at)
if anomaly_model:
await _run_scorer(db_path, anomaly_model, anomaly_device, threshold=anomaly_threshold)
if cybersec_model:
await _run_cybersec(db_path, cybersec_model, cybersec_device, threshold=cybersec_threshold)
if auto_incident and incidents_db_path:
glean_started_iso = _state.last_run_at
result = await _run_incident_detector(db_path, incidents_db_path, since=glean_started_iso)
if result["created"]:
logger.info("Incident detector: %d incident(s) auto-created", result["created"])
return {"ok": True, "stats": _state.last_stats, "duration_s": _state.last_duration_s}
async def scheduler_loop(
sources_file: Path,
db_path: Path,
pattern_file: Path | None,
interval_s: int,
submit_endpoint: str | None = None,
source_host: str = "unknown",
anomaly_model: str = "",
anomaly_device: str = "cpu",
anomaly_threshold: float = 0.75,
cybersec_model: str = "",
cybersec_device: str = "cpu",
cybersec_threshold: float = 0.60,
incidents_db_path: Path | None = None,
auto_incident: bool = True,
) -> None:
"""Run glean + optional submission + optional anomaly/cybersec scoring every interval_s seconds."""
logger.info("Ingest scheduler started — interval %ds, sources: %s", interval_s, sources_file)
if submit_endpoint:
logger.info("Submission enabled — endpoint: %s", submit_endpoint)
if anomaly_model:
logger.info("Anomaly scoring enabled — model: %s", anomaly_model)
if cybersec_model:
logger.info("Cybersec scoring enabled — model: %s", cybersec_model)
if auto_incident and incidents_db_path:
logger.info("Auto-incident detection enabled")
while True:
await run_once(
sources_file, db_path, pattern_file, submit_endpoint, source_host,
anomaly_model=anomaly_model,
anomaly_device=anomaly_device,
anomaly_threshold=anomaly_threshold,
cybersec_model=cybersec_model,
cybersec_device=cybersec_device,
cybersec_threshold=cybersec_threshold,
incidents_db_path=incidents_db_path,
auto_incident=auto_incident,
)
next_run = datetime.now(tz=timezone.utc) + timedelta(seconds=interval_s)
_state.next_run_at = next_run.isoformat()
try:
await asyncio.sleep(interval_s)
except asyncio.CancelledError:
logger.info("Ingest scheduler cancelled")
_state.next_run_at = None
raise

View file

@ -0,0 +1,188 @@
"""Post-glean automatic incident detection.
After each batch glean, scan entries ingested since the last run for
ERROR/CRITICAL clusters. If a source produces >= threshold errors within
window_s seconds, auto-create an incident unless one already exists for
that source in that time window.
Environment variables (all optional):
TURNSTONE_AUTO_INCIDENT_THRESHOLD integer, default 5
TURNSTONE_AUTO_INCIDENT_WINDOW seconds, default 600 (10 min)
"""
from __future__ import annotations
import asyncio
import logging
import os
from collections import defaultdict
from datetime import datetime, timezone
from pathlib import Path
from app.db import get_conn, resolve_tenant_id
from app.services.incidents import create_incident
logger = logging.getLogger(__name__)
_THRESHOLD = int(os.environ.get("TURNSTONE_AUTO_INCIDENT_THRESHOLD", "5"))
_WINDOW_S = int(os.environ.get("TURNSTONE_AUTO_INCIDENT_WINDOW", "600"))
# Severity rank — used to pick the cluster's worst severity
_SEV_RANK = {"CRITICAL": 3, "ERROR": 2, "WARN": 1, "INFO": 0, "DEBUG": 0}
def _query_recent_errors(db_path: Path, since: str | None) -> list[dict]:
tid = resolve_tenant_id()
with get_conn(db_path) as conn:
if since:
rows = conn.execute(
"""
SELECT source_id, timestamp_iso, severity
FROM log_entries
WHERE severity IN ('ERROR', 'CRITICAL')
AND ingest_time > ?
AND (tenant_id = ? OR tenant_id = '')
ORDER BY source_id, timestamp_iso ASC
""",
(since, tid),
).fetchall()
else:
rows = conn.execute(
"""
SELECT source_id, timestamp_iso, severity
FROM log_entries
WHERE severity IN ('ERROR', 'CRITICAL')
AND (tenant_id = ? OR tenant_id = '')
ORDER BY source_id, timestamp_iso ASC
LIMIT 10000
""",
(tid,),
).fetchall()
return [dict(r) for r in rows]
def _parse_ts(iso: str | None) -> float | None:
"""Parse ISO timestamp to epoch seconds; return None on failure."""
if not iso:
return None
try:
dt = datetime.fromisoformat(iso.replace("Z", "+00:00"))
return dt.timestamp()
except (ValueError, TypeError):
return None
def _find_clusters(
events: list[dict], window_s: int, threshold: int
) -> list[tuple[str, str, str]]:
"""Return (started_at_iso, ended_at_iso, worst_severity) for each cluster."""
# Filter to events with parseable timestamps, sorted ascending
timed = []
for e in events:
t = _parse_ts(e["timestamp_iso"])
if t is not None:
timed.append((t, e["timestamp_iso"], e["severity"]))
timed.sort()
clusters: list[tuple[str, str, str]] = []
i = 0
while i < len(timed):
j = i
while j < len(timed) and timed[j][0] - timed[i][0] <= window_s:
j += 1
count = j - i
if count >= threshold:
worst = max((timed[k][2] for k in range(i, j)), key=lambda s: _SEV_RANK.get(s, 0))
clusters.append((timed[i][1], timed[j - 1][1], worst))
i = j # skip past the cluster to avoid overlap
else:
i += 1
return clusters
def _incident_exists_for_cluster(
incidents_db_path: Path, source_id: str, started_at: str, ended_at: str
) -> bool:
"""Return True if an auto-incident for this source already covers the window."""
issue_type = f"auto:{source_id}"
start_ts = _parse_ts(started_at)
end_ts = _parse_ts(ended_at)
if start_ts is None or end_ts is None:
return False
tid = resolve_tenant_id()
with get_conn(incidents_db_path) as conn:
rows = conn.execute(
"""
SELECT started_at, ended_at FROM incidents
WHERE issue_type = ?
AND (tenant_id = ? OR tenant_id = '')
""",
(issue_type, tid),
).fetchall()
for row in rows:
ex_start = _parse_ts(row["started_at"])
ex_end = _parse_ts(row["ended_at"])
if ex_start is None or ex_end is None:
continue
# Overlap check: two intervals [a,b] and [c,d] overlap when a<=d and b>=c
if ex_start <= end_ts and ex_end >= start_ts:
return True
return False
def detect_and_create(
db_path: Path,
incidents_db_path: Path,
since: str | None,
threshold: int = _THRESHOLD,
window_s: int = _WINDOW_S,
) -> dict[str, int]:
"""Detect error clusters and create incidents. Returns {"created": N}."""
entries = _query_recent_errors(db_path, since)
if not entries:
return {"created": 0}
by_source: dict[str, list[dict]] = defaultdict(list)
for e in entries:
by_source[e["source_id"]].append(e)
created = 0
for source_id, events in by_source.items():
clusters = _find_clusters(events, window_s, threshold)
for started_at, ended_at, worst_sev in clusters:
if _incident_exists_for_cluster(incidents_db_path, source_id, started_at, ended_at):
continue
n = len(events) # event count for this source in the glean window
sev_label = "critical" if worst_sev == "CRITICAL" else "high"
create_incident(
incidents_db_path,
label=f"Auto: {source_id}{n} errors",
issue_type=f"auto:{source_id}",
started_at=started_at,
ended_at=ended_at,
notes="Auto-detected error cluster. Review and label as needed.",
severity=sev_label,
)
logger.info(
"Auto-incident created: source=%s window=[%s, %s] severity=%s",
source_id, started_at, ended_at, sev_label,
)
created += 1
if created:
logger.info("Incident detector: %d new incident(s) created", created)
return {"created": created}
async def run_once(
db_path: Path,
incidents_db_path: Path,
since: str | None,
threshold: int = _THRESHOLD,
window_s: int = _WINDOW_S,
) -> dict[str, int]:
"""Async wrapper — runs detection in a thread to avoid blocking the event loop."""
loop = asyncio.get_running_loop()
return await loop.run_in_executor(
None,
lambda: detect_and_create(db_path, incidents_db_path, since, threshold, window_s),
)

0
app/watch/__init__.py Normal file
View file

282
app/watch/watcher.py Normal file
View file

@ -0,0 +1,282 @@
"""Live watch: tail active log sources and glean entries in near-real-time.
Each WatchSource runs a subprocess (journalctl -f, podman/docker logs -f)
in a daemon thread and pipes lines through the existing ingestors into SQLite.
FTS is synced incrementally after each flush.
"""
from __future__ import annotations
import json
import logging
import subprocess
import threading
from dataclasses import dataclass, field
from datetime import datetime, timezone
from pathlib import Path
from typing import Iterator
import yaml
from app.glean import journald as journald_parser, syslog as syslog_parser
from app.glean import plaintext as plaintext_parser, servarr as servarr_parser, plex as plex_parser
from app.glean import qbittorrent as qbit_parser, caddy as caddy_parser
from app.db import get_conn
from app.db.schema import ensure_schema
from app.glean.pipeline import _detect_format, _write_batch
from app.glean.base import _compile, load_patterns, now_iso
from app.services.models import RetrievedEntry
logger = logging.getLogger(__name__)
FLUSH_INTERVAL_SEC = 10
FLUSH_BATCH_SIZE = 100
# ── Config ────────────────────────────────────────────────────────────────────
@dataclass(frozen=True)
class WatchConfig:
source_type: str # "journald" | "docker" | "podman" | "file"
source_id: str
args: list[str] = field(default_factory=list) # extra CLI args
def load_watch_config(path: Path) -> list[WatchConfig]:
"""Load watch.yaml; return empty list if file absent."""
if not path.exists():
return []
raw = yaml.safe_load(path.read_text()) or {}
sources = []
for src in raw.get("sources", []):
sources.append(WatchConfig(
source_type=src["type"],
source_id=src["id"],
args=src.get("args", []),
))
return sources
# ── Per-source runner ─────────────────────────────────────────────────────────
class WatchSource:
"""Tails a single log source in a background daemon thread."""
def __init__(
self,
config: WatchConfig,
db_path: Path,
pattern_file: Path,
) -> None:
self.config = config
self.db_path = db_path
self.pattern_file = pattern_file
self._stop = threading.Event()
self._thread: threading.Thread | None = None
self._proc: subprocess.Popen | None = None
self._last_event: str | None = None
self._entry_count: int = 0
self._error: str | None = None
@property
def status(self) -> dict:
return {
"source_id": self.config.source_id,
"type": self.config.source_type,
"running": self._thread is not None and self._thread.is_alive(),
"entries_gleaned": self._entry_count,
"last_event": self._last_event,
"error": self._error,
}
def start(self) -> None:
self._stop.clear()
self._thread = threading.Thread(target=self._run, daemon=True, name=f"watch:{self.config.source_id}")
self._thread.start()
logger.info("Watch source started: %s (%s)", self.config.source_id, self.config.source_type)
def stop(self) -> None:
self._stop.set()
if self._proc:
try:
self._proc.terminate()
except OSError:
pass
if self._thread:
self._thread.join(timeout=5)
logger.info("Watch source stopped: %s", self.config.source_id)
def _run(self) -> None:
patterns = load_patterns(self.pattern_file)
compiled = _compile(patterns)
ensure_schema(self.db_path)
try:
cmd = self._build_command()
if not cmd:
return
self._proc = subprocess.Popen(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
bufsize=1,
)
self._drain(compiled)
except Exception as exc:
self._error = str(exc)
logger.error("Watch source %r crashed: %s", self.config.source_id, exc)
def _build_command(self) -> list[str] | None:
t = self.config.source_type
extra = self.config.args
if t == "journald":
return ["journalctl", "-f", "--output=json", "--no-pager"] + extra
if t == "docker":
if not extra:
logger.error("docker source %r requires args: [container_name]", self.config.source_id)
return None
return ["docker", "logs", "-f", "--timestamps", extra[0]] + extra[1:]
if t == "podman":
if not extra:
logger.error("podman source %r requires args: [container_name]", self.config.source_id)
return None
return ["podman", "logs", "-f", "--timestamps", extra[0]] + extra[1:]
if t == "file":
if not extra:
logger.error("file source %r requires args: [/path/to/log]", self.config.source_id)
return None
# -F: follow by name (handles rotation); -n 0: start from end, don't replay old data
return ["tail", "-F", "-n", "0", extra[0]]
logger.error("Unknown watch source type: %r", t)
return None
def _parse_lines(self, lines: Iterator[str], ingest_time: str, compiled) -> list[RetrievedEntry]:
t = self.config.source_type
sid = self.config.source_id
if t == "journald":
return list(journald_parser.parse(iter(lines), sid, compiled, ingest_time))
if t in ("docker", "podman"):
# Output: "2024-01-15T12:34:56.789012345Z log line text"
stripped = [_strip_docker_ts(ln) for ln in lines]
return list(plaintext_parser.parse(iter(stripped), sid, compiled, ingest_time))
if t == "file":
# Auto-detect format from the first non-empty line
non_empty = [ln for ln in lines if ln.strip()]
if not non_empty:
return []
fmt = _detect_format(non_empty[0])
it = iter(non_empty)
if fmt == "journald":
return list(journald_parser.parse(it, sid, compiled, ingest_time))
if fmt == "servarr":
return list(servarr_parser.parse(it, sid, compiled, ingest_time))
if fmt == "plex":
return list(plex_parser.parse(it, sid, compiled, ingest_time))
if fmt == "qbittorrent":
return list(qbit_parser.parse(it, sid, compiled, ingest_time))
if fmt == "caddy":
return list(caddy_parser.parse(it, sid, compiled, ingest_time))
if fmt == "syslog":
return list(syslog_parser.parse(it, sid, compiled, ingest_time))
return list(plaintext_parser.parse(it, sid, compiled, ingest_time))
return []
def _drain(self, compiled) -> None:
"""Read lines from the subprocess and flush to DB periodically."""
assert self._proc is not None
buffer: list[str] = []
flush_count = 0
last_flush = datetime.now(tz=timezone.utc)
while not self._stop.is_set():
assert self._proc.stdout is not None
# Non-blocking check with short readline timeout via select
import select
ready, _, _ = select.select([self._proc.stdout], [], [], 1.0)
if ready:
line = self._proc.stdout.readline()
if not line:
if not self._stop.is_set():
logger.warning("Watch process exited for %r — will retry in 5s", self.config.source_id)
self._stop.wait(5)
break
line = line.rstrip("\n")
if line:
buffer.append(line)
elapsed = (datetime.now(tz=timezone.utc) - last_flush).total_seconds()
should_flush = len(buffer) >= FLUSH_BATCH_SIZE or elapsed >= FLUSH_INTERVAL_SEC
if buffer and should_flush:
flush_count = self._flush(buffer, compiled, flush_count)
buffer.clear()
last_flush = datetime.now(tz=timezone.utc)
# Flush remainder
if buffer:
self._flush(buffer, compiled, flush_count)
def _flush(self, lines: list[str], compiled, flush_count: int) -> int:
ingest_time = now_iso()
try:
entries = self._parse_lines(lines, ingest_time, compiled)
if entries:
with get_conn(self.db_path) as conn:
_write_batch(conn, entries)
conn.commit()
self._entry_count += len(entries)
self._last_event = now_iso()
if entries:
self._last_event = entries[-1].timestamp_iso or self._last_event
flush_count += 1
except Exception as exc:
logger.warning("Flush error for %r: %s", self.config.source_id, exc)
return flush_count
def _strip_docker_ts(line: str) -> str:
"""Remove leading RFC3339 timestamp that docker/podman logs -f --timestamps adds."""
# Format: "2024-01-15T12:34:56.789012345Z actual log text"
parts = line.split(" ", 1)
if len(parts) == 2 and "T" in parts[0] and parts[0].endswith("Z"):
return parts[1]
return line
# ── Orchestrator ──────────────────────────────────────────────────────────────
class Watcher:
"""Manages all active WatchSource instances."""
def __init__(self, db_path: Path, pattern_file: Path) -> None:
self.db_path = db_path
self.pattern_file = pattern_file
self._sources: list[WatchSource] = []
def configure(self, configs: list[WatchConfig]) -> None:
self._sources = [
WatchSource(c, self.db_path, self.pattern_file)
for c in configs
]
def start(self) -> None:
for src in self._sources:
src.start()
def stop(self) -> None:
for src in self._sources:
src.stop()
@property
def status(self) -> list[dict]:
return [src.status for src in self._sources]
def is_active(self) -> bool:
return any(src._thread is not None and src._thread.is_alive() for src in self._sources)

View file

@ -0,0 +1,74 @@
# Turnstone — CF receiving instances for external node submissions.
#
# These are SEPARATE instances from the main Turnstone deployment. Each node
# that has TURNSTONE_SUBMIT_ENDPOINT configured pushes pattern-matched entries
# here. Each instance has its own isolated database. Avocet reads these
# databases for training data.
#
# Ports:
# 8536 → submissions-contrib1 (harvest.circuitforge.tech/contrib1/*)
# 8537 → submissions-contrib2 (harvest.circuitforge.tech/contrib2/*)
#
# Deploy on Heimdall:
# docker compose -f docker-compose.submissions.yml up -d
#
# Database locations:
# /devl/docker/turnstone-submissions/contrib1/turnstone.db
# /devl/docker/turnstone-submissions/contrib2/turnstone.db
#
# These instances have TURNSTONE_INGEST_INTERVAL=0 — they only receive POSTs,
# they do not run their own scheduled ingest.
services:
submissions-contrib1:
image: turnstone:latest
container_name: turnstone-submissions-contrib1
restart: unless-stopped
ports:
- "8536:8534"
volumes:
- /devl/docker/turnstone-submissions/contrib1:/data:z
- /devl/docker/turnstone-submissions/contrib1/patterns:/patterns:ro
environment:
TURNSTONE_DB: /data/turnstone.db
TURNSTONE_PATTERNS: /patterns
TURNSTONE_SOURCE_HOST: submissions-contrib1
TURNSTONE_INGEST_INTERVAL: "0"
PYTHONUNBUFFERED: "1"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8534/turnstone/health"]
interval: 30s
timeout: 10s
start_period: 20s
retries: 3
networks:
- caddy-internal
submissions-contrib2:
image: turnstone:latest
container_name: turnstone-submissions-contrib2
restart: unless-stopped
ports:
- "8537:8534"
volumes:
- /devl/docker/turnstone-submissions/contrib2:/data:z
- /devl/docker/turnstone-submissions/contrib2/patterns:/patterns:ro
environment:
TURNSTONE_DB: /data/turnstone.db
TURNSTONE_PATTERNS: /patterns
TURNSTONE_SOURCE_HOST: submissions-contrib2
TURNSTONE_INGEST_INTERVAL: "0"
PYTHONUNBUFFERED: "1"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8534/turnstone/health"]
interval: 30s
timeout: 10s
start_period: 20s
retries: 3
networks:
- caddy-internal
networks:
caddy-internal:
name: caddy-proxy_caddy-internal
external: true

68
docker-compose.yml Normal file
View file

@ -0,0 +1,68 @@
version: "3.9"
# Turnstone with external Postgres DB.
# Data lives in the named volume `turnstone_pgdata` — survives image rebuilds.
# To adopt an EXISTING Postgres install, set DATABASE_URL to point at it and
# remove the `db` service and `depends_on` blocks.
#
# Quick start:
# docker compose up -d
# # Then open http://localhost:8520
services:
db:
image: postgres:16-alpine
restart: unless-stopped
environment:
POSTGRES_DB: turnstone
POSTGRES_USER: turnstone
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-turnstone_dev}
volumes:
- turnstone_pgdata:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U turnstone -d turnstone"]
interval: 5s
timeout: 5s
retries: 5
turnstone:
build: .
restart: unless-stopped
ports:
- "${TURNSTONE_PORT:-8520}:8520"
depends_on:
db:
condition: service_healthy
environment:
# Backend selection — comment out DATABASE_URL to fall back to SQLite
DATABASE_URL: postgresql://turnstone:${POSTGRES_PASSWORD:-turnstone_dev}@db:5432/turnstone
TURNSTONE_TENANT_ID: ${TURNSTONE_TENANT_ID:-}
TURNSTONE_API_KEY: ${TURNSTONE_API_KEY:-}
TURNSTONE_GLEAN_INTERVAL: ${TURNSTONE_GLEAN_INTERVAL:-900}
TURNSTONE_SOURCE_HOST: ${TURNSTONE_SOURCE_HOST:-}
TURNSTONE_SUBMIT_ENDPOINT: ${TURNSTONE_SUBMIT_ENDPOINT:-}
# --- Multi-agent diagnose pipeline ---
TURNSTONE_MULTI_AGENT_DIAGNOSE: ${TURNSTONE_MULTI_AGENT_DIAGNOSE:-false}
TURNSTONE_CLASSIFIER_MODEL: ${TURNSTONE_CLASSIFIER_MODEL:-}
TURNSTONE_EMBED_BACKEND: ${TURNSTONE_EMBED_BACKEND:-}
TURNSTONE_EMBED_MODEL: ${TURNSTONE_EMBED_MODEL:-}
TURNSTONE_EMBED_DEVICE: ${TURNSTONE_EMBED_DEVICE:-cpu}
# --- Cybersec scoring pipeline ---
TURNSTONE_CYBERSEC_MODEL: ${TURNSTONE_CYBERSEC_MODEL:-}
TURNSTONE_CYBERSEC_DEVICE: ${TURNSTONE_CYBERSEC_DEVICE:-cpu}
TURNSTONE_CYBERSEC_THRESHOLD: ${TURNSTONE_CYBERSEC_THRESHOLD:-0.60}
# --- Anomaly scoring pipeline ---
TURNSTONE_ANOMALY_MODEL: ${TURNSTONE_ANOMALY_MODEL:-}
TURNSTONE_ANOMALY_DEVICE: ${TURNSTONE_ANOMALY_DEVICE:-cpu}
TURNSTONE_ANOMALY_THRESHOLD: ${TURNSTONE_ANOMALY_THRESHOLD:-0.75}
TURNSTONE_ANOMALY_INTERVAL: ${TURNSTONE_ANOMALY_INTERVAL:-0}
# --- HuggingFace model cache ---
HF_HOME: /hf_cache
volumes:
- ./patterns:/app/patterns:ro
- ./data:/app/data # optional: persists SQLite files if DATABASE_URL unset
- ${HF_CACHE_PATH:-/Library/Assets/LLM}:/hf_cache:ro # shared model cache
volumes:
turnstone_pgdata:
name: turnstone_pgdata

171
docker-standalone.sh Executable file
View file

@ -0,0 +1,171 @@
#!/usr/bin/env bash
# docker-standalone.sh — Turnstone Docker setup (no Compose)
#
# For hosts running Docker (not Podman). The container restarts automatically
# on boot via Docker's built-in restart policy — no systemd unit needed.
# Turnstone is a diagnostic log intelligence layer — glean service logs,
# search by symptom, and view incidents in a lightweight web UI.
#
# ── Prerequisites ────────────────────────────────────────────────────────────
# 1. Clone the repo:
# git clone https://git.opensourcesolarpunk.com/Circuit-Forge/turnstone.git ~/turnstone
# (or wherever you prefer — update REPO_DIR below)
#
# 2. Build the image:
# cd ~/turnstone && docker build -t localhost/turnstone:latest .
#
# 3. Create data and patterns directories, then copy config files:
# mkdir -p ~/turnstone/{data,patterns}
# cp ~/turnstone/patterns/default.yaml ~/turnstone/patterns/
# cp ~/turnstone/patterns/sources.yaml ~/turnstone/patterns/
# # Edit sources.yaml — set log paths that exist on this host.
#
# 4. Set any env vars (see sections below), then run this script:
# bash ~/turnstone/docker-standalone.sh
#
# ── After setup ──────────────────────────────────────────────────────────────
# The container starts with --restart=unless-stopped so it survives reboots.
# To upgrade: git pull && bash ~/turnstone/docker-standalone.sh
#
# ── Gleaning logs ─────────────────────────────────────────────────────────────
# All service logs under /opt are accessible inside the container.
# Sources are configured in patterns/sources.yaml (bind-mounted at /patterns/).
#
# To glean all sources (run manually or via cron):
#
# docker exec turnstone python scripts/glean_corpus.py \
# --sources /patterns/sources.yaml --db /data/turnstone.db
#
# Example cron (every 15 minutes, add with: crontab -e):
# */15 * * * * docker exec turnstone python scripts/glean_corpus.py \
# --sources /patterns/sources.yaml --db /data/turnstone.db >> /var/log/turnstone-glean.log 2>&1
#
# To add a new log source: edit patterns/sources.yaml — no restart needed.
#
# ── Adding Caddy reverse proxy ────────────────────────────────────────────────
# Add to /etc/caddy/Caddyfile on this host:
#
# turnstone.yourdomain.tld {
# import protected
# reverse_proxy 127.0.0.1:8534
# import cloudflare
# }
#
# Then: sudo systemctl reload caddy
#
# ── Ports ────────────────────────────────────────────────────────────────────
# Turnstone UI → http://localhost:8534/turnstone/
#
set -euo pipefail
# ── Paths — update to match your clone location ──────────────────────────────
REPO_DIR="${HOME}/turnstone"
DATA_DIR="${REPO_DIR}/data"
PATTERNS_DIR="${REPO_DIR}/patterns"
# HF_CACHE_DIR: override to a shared cache directory to avoid re-downloading models.
# Example (Heimdall, where byviz/bylastic_classification_logs is already cached):
# export HF_CACHE_DIR=/Library/Assets/LLM
HF_CACHE_DIR="${HF_CACHE_DIR:-${REPO_DIR}/hf-cache}"
TZ="${TZ:-America/Los_Angeles}"
# ── Bundle push configuration ────────────────────────────────────────────────
# Set TURNSTONE_BUNDLE_ENDPOINT to enable the "Send Bundle" button in the
# Incidents UI:
#
# export TURNSTONE_BUNDLE_ENDPOINT=https://turnstone.circuitforge.tech/turnstone/api/bundles
# bash ~/turnstone/docker-standalone.sh
#
# ── Orchard submission (opt-in telemetry) ────────────────────────────────────
# Set TURNSTONE_SUBMIT_ENDPOINT to push pattern-matched log entries to a CF
# receiving instance after each glean run. Only matched entries are sent —
# no raw log content. Used to build Avocet training data.
#
# export TURNSTONE_SUBMIT_ENDPOINT=https://harvest.circuitforge.tech/your-node-id
# bash ~/turnstone/docker-standalone.sh
#
# ── Anomaly scoring pipeline (IDS / watchdog) ────────────────────────────────
# Set TURNSTONE_ANOMALY_MODEL to enable automatic anomaly scoring after each
# glean run. The byviz classifier (already used by the diagnose pipeline) is
# a good default — it's cached alongside the other models.
#
# export TURNSTONE_ANOMALY_MODEL=byviz/bylastic_classification_logs
# export TURNSTONE_ANOMALY_THRESHOLD=0.80 # confidence floor (default 0.75)
# bash ~/turnstone/docker-standalone.sh
#
# ── Multi-agent diagnose pipeline ────────────────────────────────────────────
# Enable the 5-stage ML pipeline to get smarter diagnose results.
#
# If your host has WireGuard to Heimdall's LAN:
# export GPU_SERVER_URL=http://<HEIMDALL_LAN_IP>:7700
# export TURNSTONE_MULTI_AGENT_DIAGNOSE=true
# bash ~/turnstone/docker-standalone.sh
#
# If your host has no WireGuard to Heimdall (use public cf-orch endpoint):
# export GPU_SERVER_URL=https://orch.circuitforge.tech
# export TURNSTONE_MULTI_AGENT_DIAGNOSE=true
# bash ~/turnstone/docker-standalone.sh
#
# ML models are downloaded on first diagnose run and cached in HF_CACHE_DIR.
# First run takes a few minutes (downloading ~400MB of CPU-only models).
# Subsequent runs are instant (models served from hf-cache/).
#
# ── Build image from current source ─────────────────────────────────────────
echo "Building Turnstone image..."
docker build -t localhost/turnstone:latest "${REPO_DIR}"
# Create HF model cache dir if not present (persists across container rebuilds)
mkdir -p "${HF_CACHE_DIR}"
mkdir -p "${DATA_DIR}" "${PATTERNS_DIR}"
# Remove existing container if present (safe re-run)
docker rm -f turnstone 2>/dev/null || true
docker run -d \
--name=turnstone \
--restart=unless-stopped \
-p 8534:8534 \
-v "${DATA_DIR}:/data" \
-v "${PATTERNS_DIR}:/patterns" \
-v "${HF_CACHE_DIR}:/hf-cache" \
-v /opt:/opt:ro \
-v /var/log:/var/log:ro \
-e TURNSTONE_DB=/data/turnstone.db \
-e TURNSTONE_SOURCE_HOST="$(hostname)" \
-e TURNSTONE_BUNDLE_ENDPOINT="${TURNSTONE_BUNDLE_ENDPOINT:-}" \
-e TURNSTONE_SUBMIT_ENDPOINT="${TURNSTONE_SUBMIT_ENDPOINT:-}" \
-e PYTHONUNBUFFERED=1 \
-e TZ="${TZ}" \
-e TURNSTONE_MULTI_AGENT_DIAGNOSE="${TURNSTONE_MULTI_AGENT_DIAGNOSE:-false}" \
-e GPU_SERVER_URL="${GPU_SERVER_URL:-}" \
-e HF_HOME=/hf-cache \
-e TURNSTONE_CLASSIFIER_MODEL="${TURNSTONE_CLASSIFIER_MODEL:-byviz/bylastic_classification_logs}" \
-e TURNSTONE_EMBED_BACKEND="${TURNSTONE_EMBED_BACKEND:-sentence_transformers}" \
-e TURNSTONE_EMBED_MODEL="${TURNSTONE_EMBED_MODEL:-sentence-transformers/all-MiniLM-L6-v2}" \
-e TURNSTONE_EMBED_DEVICE="${TURNSTONE_EMBED_DEVICE:-cpu}" \
-e TURNSTONE_CYBERSEC_MODEL="${TURNSTONE_CYBERSEC_MODEL:-}" \
-e TURNSTONE_CYBERSEC_DEVICE="${TURNSTONE_CYBERSEC_DEVICE:-cpu}" \
-e TURNSTONE_CYBERSEC_THRESHOLD="${TURNSTONE_CYBERSEC_THRESHOLD:-0.60}" \
-e TURNSTONE_ANOMALY_MODEL="${TURNSTONE_ANOMALY_MODEL:-}" \
-e TURNSTONE_ANOMALY_DEVICE="${TURNSTONE_ANOMALY_DEVICE:-cpu}" \
-e TURNSTONE_ANOMALY_THRESHOLD="${TURNSTONE_ANOMALY_THRESHOLD:-0.75}" \
-e TURNSTONE_ANOMALY_INTERVAL="${TURNSTONE_ANOMALY_INTERVAL:-0}" \
localhost/turnstone:latest
echo ""
echo "Turnstone is starting up."
echo " UI: http://localhost:8534/turnstone/"
echo ""
echo "Check container health with:"
echo " docker ps"
echo " docker logs turnstone"
echo ""
echo "To glean all sources now:"
echo " docker exec turnstone python scripts/glean_corpus.py \\"
echo " --sources /patterns/sources.yaml --db /data/turnstone.db"
echo ""
echo "To add a new source: edit ${PATTERNS_DIR}/sources.yaml — no restart needed."

View file

@ -0,0 +1,129 @@
# Air-Gapped Deployment Guide
Turnstone can run entirely without internet access. This guide covers pre-downloading
all model weights, configuring offline mode, and verifying that no outbound connections
are made at runtime.
## What requires network access by default
| Component | When | What it downloads |
|-----------|------|------------------|
| Stage 2 ML classifier | First diagnose run (if `TURNSTONE_CLASSIFIER_MODEL` is set) | HuggingFace model weights (~300 MB) |
| Stage 4 sentence-transformers embedder | First diagnose run (if `TURNSTONE_EMBED_BACKEND=sentence_transformers`) | Embedding model (~130 MB) |
| LLM inference | Every diagnose run | Nothing — calls your configured `GPU_SERVER_URL` only |
| Log glean | Every glean run | Nothing — reads local files or SSH sources |
If neither the classifier nor the sentence-transformers embedder is enabled, Turnstone
makes no outbound network calls at runtime (only local SQLite reads/writes and your
configured LLM endpoint).
## Step 1 — Pre-download models (on an internet-connected machine)
Run these commands in the `cf` conda environment before moving to the air-gapped host:
```bash
# Stage 2 ML classifier (only needed if TURNSTONE_CLASSIFIER_MODEL is set)
conda run -n cf python -c "
from transformers import pipeline
pipeline('text-classification', model='byviz/bylastic_classification_logs')
print('classifier cached')
"
# Stage 4 sentence-transformers embedder (only if TURNSTONE_EMBED_BACKEND=sentence_transformers)
conda run -n cf python -c "
from sentence_transformers import SentenceTransformer
SentenceTransformer('BAAI/bge-small-en-v1.5')
print('embedder cached')
"
```
Models are cached to `~/.cache/huggingface/`. Copy that directory to the air-gapped host
at the same path before deployment.
## Step 2 — Pre-ingest your documentation corpus
On the internet-connected machine, or before cutting the network:
```bash
# Write your manifest (see scripts/manifests/example.yaml)
# Then bulk-upload to the context DB:
conda run -n cf python scripts/harvest_docs.py --manifest scripts/manifests/your-site.yaml
```
The context DB (`turnstone-context.db`) is a plain SQLite file — copy it to the
air-gapped host alongside `turnstone.db`.
## Step 3 — Set offline environment variables
Add to your `.env` file (copy from `.env.example`):
```bash
# Block all HuggingFace hub network access
TURNSTONE_OFFLINE_MODE=1
# Point models at the pre-downloaded cache (usually the default)
# HF_HOME=/home/youruser/.cache/huggingface
```
`TURNSTONE_OFFLINE_MODE=1` sets both `HF_HUB_OFFLINE=1` and `TRANSFORMERS_OFFLINE=1`
before any model library loads. If the cache is missing or incomplete, the classifier
falls back to the pattern-tag / regex path and embedding is skipped — diagnose still
works, just without ML-assisted severity or suppression.
## Step 4 — Configure a local LLM endpoint
Turnstone's LLM reasoning calls your `GPU_SERVER_URL`. On an air-gapped host this
must be a local endpoint — either Ollama or a local cf-orch coordinator:
```bash
# Local Ollama
GPU_SERVER_URL=http://localhost:11434
# Local cf-orch coordinator
GPU_SERVER_URL=http://localhost:7700
```
Pull the Ollama model before cutting network access:
```bash
ollama pull llama3.1:8b
```
## Step 5 — Verify no outbound connections at runtime
Start Turnstone and run a diagnose query, then check for unexpected outbound connections:
```bash
# Watch for any connection to HuggingFace, PyPI, or other external hosts
ss -tp | grep python
# or
lsof -i -n -P | grep python | grep ESTABLISHED
```
Expected: only connections to your `GPU_SERVER_URL` and any SSH log sources.
No connections to `huggingface.co`, `cdn-lfs.huggingface.co`, or `pypi.org`.
## Deployment checklist
- [ ] `~/.cache/huggingface/` copied to air-gapped host (if using ML classifier or embedder)
- [ ] `TURNSTONE_OFFLINE_MODE=1` set in `.env`
- [ ] `GPU_SERVER_URL` points to a local inference endpoint
- [ ] Ollama model pulled locally (if using Ollama)
- [ ] Context DB pre-populated with runbooks via `harvest_docs.py`
- [ ] No internet access verified with `ss -tp` during a diagnose run
- [ ] `TURNSTONE_API_KEY` set if the host is accessible over the network (see API auth docs)
## Troubleshooting
**"OSError: We couldn't connect to huggingface.co…"**
The model is not in the local cache. Either download it on a connected machine and copy
`~/.cache/huggingface/`, or unset `TURNSTONE_CLASSIFIER_MODEL` to fall back to the
pattern-based classifier.
**Diagnose still works but no ML severity in pipeline stages**
Expected when running offline without a pre-cached model. Stage 2 falls back to
`pattern_tags` → regex severity detection automatically.
**LLM reasoning missing from diagnose output**
Check that `GPU_SERVER_URL` is reachable from the air-gapped host and that your local
Ollama/vLLM has the configured model pulled.

View file

@ -0,0 +1,154 @@
# Turnstone Compliance Checklist
**Last reviewed:** 2026-05-28
**Applies to:** All deployments handling log data in compliance-sensitive environments.
Symbols: ✅ satisfied by code, ⚙️ operator action required, ⚠️ known limitation, 🔲 not implemented.
---
## Data Isolation
### Source-level query isolation
**`source_filter` enforced on all log-returning endpoints.**
Every endpoint that returns log entries accepts a `source` parameter. Both the FTS5 keyword search path and the time-window scan path apply `source_id LIKE ?` before returning results. No cross-source data leakage is possible through the API.
Relevant code: `app/services/search.py``search()` and `entries_in_window()`.
### FTS5 cross-source leakage
✅ **FTS5 index includes `source_id` as an UNINDEXED column; all queries filter on it.**
The virtual table schema stores `source_id` alongside each entry. Query functions always join back to the base table or filter the FTS result set by `source_id`. There is no full-corpus FTS path that ignores source.
### SQLite file permissions
⚙️ **Operator responsibility — not enforced by Turnstone.**
Turnstone does not set file permissions on the database. Recommended posture for multi-user hosts:
```bash
# Restrict DB to the Turnstone process user only
chmod 600 /devl/turnstone-cluster/data/turnstone.db
chmod 600 /devl/turnstone-cluster/data/turnstone-context.db
chown turnstone:turnstone /devl/turnstone-cluster/data/
```
Run Turnstone as a dedicated non-root user via systemd `User=turnstone`.
---
## Audit Logging
### API query logging
✅ **Implemented as FastAPI middleware (`turnstone.audit` logger).**
Every request to `/turnstone/api/*` is logged at INFO level with:
- Timestamp (from the logging handler)
- HTTP method
- Path + query string
- Response status code
- Request duration (ms)
Body content is never logged. Example output:
```
2026-05-28 14:23:01 INFO turnstone.audit GET /turnstone/api/diagnose/stream?source=heimdall-journal 200 1843ms
```
To capture audit logs to a separate file, configure the `turnstone.audit` logger in your logging config:
```python
# In your uvicorn startup or log config YAML:
logging.getLogger("turnstone.audit").addHandler(
logging.FileHandler("/var/log/turnstone/audit.log")
)
```
### Glean operation logging
✅ **Glean scheduler logs source ID, entry count, and duration at INFO level.**
Relevant logger: `app.tasks.glean_scheduler` — logs start, per-source stats, and errors.
Log example:
```
INFO app.tasks.glean_scheduler Batch glean complete in 12.4s — {'heimdall-journal': 847, 'plex': 12}
```
### Error logging
✅ **Errors logged with source context but without PII in message fields.**
Exception handlers in `rest.py` log at ERROR level with the endpoint path and error type. Raw log entry text is not included in error messages. Stack traces go to the `uvicorn.error` logger.
---
## LLM / PII Egress
### Multi-agent pipeline (recommended path, `TURNSTONE_MULTI_AGENT_DIAGNOSE=true`)
✅ **Raw log message text is NOT sent to the LLM.**
Stage 5 (synthesizer) sends only:
- The operator's query string
- Timeline statistics (cluster counts, burst counts, gap counts — no entry text)
- Hypothesis titles from Stage 3 (derived labels, not raw messages)
- Runbook context from the operator's own uploaded documents
No raw `MESSAGE` field content reaches the LLM in this path. Review: `app/services/diagnose/synthesizer.py`.
### Legacy single-call path (`TURNSTONE_MULTI_AGENT_DIAGNOSE` unset or `false`)
⚠️ **Raw log message text (truncated to 200 chars) IS sent to the LLM.**
The legacy `summarize()` function in `app/services/llm.py` builds a prompt that includes up to 25 log entries with their `text` field (truncated). If log entries contain hostnames, usernames, IP addresses, or other PII, those values are included in the LLM call.
**Operator action for PII-sensitive deployments:** Enable `TURNSTONE_MULTI_AGENT_DIAGNOSE=true` to use the pipeline path, which does not expose raw log text.
### Avocet harvester (corpus export)
✅ **Only pattern-tagged entries are exported; export can be disabled.**
The harvester (`harvester/harvester.py`) only POSTs entries that matched at least one named pattern. It does not export the full corpus. Disable by leaving `TURNSTONE_SUBMIT_ENDPOINT` unset (the default).
### External telemetry
**None.** Turnstone makes no calls to Sentry, Segment, Amplitude, or any analytics service. The only outbound network calls are:
- Your configured `GPU_SERVER_URL` (LLM inference, operator-controlled)
- HuggingFace Hub (model downloads — disable with `TURNSTONE_OFFLINE_MODE=1`)
- SSH connections to configured remote log sources (operator-defined)
---
## Configuration Hardening
For compliance deployments, set these in `.env`:
```bash
# Block HuggingFace network access (model weights pre-downloaded)
TURNSTONE_OFFLINE_MODE=1
# Require bearer token for all API calls
TURNSTONE_API_KEY=<strong-random-token>
# Use multi-agent pipeline (no raw log text to LLM)
TURNSTONE_MULTI_AGENT_DIAGNOSE=true
# Disable Avocet corpus push if not needed
# (leave TURNSTONE_SUBMIT_ENDPOINT unset)
```
---
## Outstanding Items
🔲 **Per-user access control** — all authenticated clients share the same API key. There is no per-user identity, role separation, or per-source ACL. Track as a future enhancement.
🔲 **Audit log retention policy** — Turnstone writes audit events to the logging system but does not manage log rotation or retention. Operator must configure log rotation (logrotate, systemd journal limits, etc.).
🔲 **Encrypted DB at rest** — SQLite does not support transparent encryption. For encryption at rest, use full-disk encryption (LUKS) or an encrypted filesystem on the host.
🔲 **TLS between client and Turnstone** — Turnstone binds to HTTP by default. For production, place Caddy or nginx in front and terminate TLS there. Do not expose port 8534 directly over untrusted networks.
---
## Data Subject Rights (GDPR / CCPA)
### Right to erasure — anonymized records
⚠️ **Anonymized log data cannot be selectively deleted on a per-subject basis.**
When PII sanitization is applied to a bundle export (redacting IP addresses, usernames, hostnames), the resulting data is no longer linked to a specific data subject. As a consequence, Turnstone cannot identify which stored log entries relate to that subject and cannot fulfill a targeted deletion request for records that have already been anonymized.
**Operators must clearly disclose this limitation to data subjects before export:**
> "Anonymized log data exported or submitted from this system cannot be individually identified or selectively deleted. If data was exported in anonymized form, Turnstone cannot distinguish your records from others in the exported set. The right to erasure does not apply to data that is no longer personally identifiable."
This is consistent with GDPR Recital 26, which excludes anonymized data from the regulation's scope. However, the original (pre-anonymization) records in Turnstone's local SQLite database *can* be deleted by source ID via the Sources view (Delete all entries for source) or directly via the database.
**Recommended operator practice:**
- Maintain a log of which bundles were exported, when, and to whom — the audit log (`turnstone.audit`) covers this.
- Provide data subjects with the bundle export timestamp and source scope so they can verify what was shared.
- For full erasure of pre-anonymization records: use `DELETE /api/sources/{source_id}` to purge all entries for a given source from the local DB.

63
docs/tautulli-setup.md Normal file
View file

@ -0,0 +1,63 @@
# Tautulli Webhook Setup
Tautulli is a Plex Media Server (PMS) monitoring application. This guide shows
how to configure Tautulli to send playback events to Turnstone.
## Triggers to enable
In your notification agent, enable these triggers:
- Playback Start
- Playback Stop
- Playback Pause
- Playback Resume
- Playback Error
- Playback Buffering
## JSON body template
Paste this into the **JSON Data** field of the Tautulli Custom Script / Webhook
notification agent:
```json
{
"action": "{action}",
"timestamp": "{timestamp}",
"user": "{user}",
"player": "{player}",
"media_type": "{media_type}",
"title": "{title}",
"grandparent_title": "{grandparent_title}",
"quality": "{quality}",
"video_decision": "{video_decision}",
"audio_decision": "{audio_decision}",
"error_message": "{error_message}",
"session_key": "{session_key}"
}
```
## Webhook URL
```
http://<turnstone-host>:8534/turnstone/api/glean/tautulli
```
Replace `<turnstone-host>` with the hostname or IP of the machine running
Turnstone. (8534 is the default port; adjust if you're using a reverse proxy or changed the port)
## Optional token authentication
If you set `tautulli_token` in Turnstone settings, every webhook request must
include a matching header:
```
X-Tautulli-Token: <your-token>
```
Add this header in the Tautulli notification agent's **Headers** section.
Requests with a missing or wrong token are rejected with HTTP 403.
## Searching events
All events are stored under source `tautulli` and are immediately searchable
in Turnstone after each webhook is received.

18
harvester/Dockerfile Normal file
View file

@ -0,0 +1,18 @@
FROM python:3.12-slim
WORKDIR /harvester
RUN pip install --no-cache-dir pyyaml
COPY harvester.py .
# Default volume mounts expected at runtime:
# /var/log → host /var/log (read-only)
# /run/log/journal → host /run/log/journal (read-only)
# /patterns → sources.yaml directory (read-only)
ENV TURNSTONE_URL=http://turnstone:8534
ENV TURNSTONE_SOURCES=/patterns/sources.yaml
ENTRYPOINT ["python", "harvester.py"]
CMD ["push"]

View file

@ -0,0 +1,23 @@
services:
harvester:
build: .
image: turnstone-harvester:latest
environment:
TURNSTONE_URL: http://turnstone:8534 # or http://host.docker.internal:8534 for host-network Turnstone
TURNSTONE_SOURCES: /patterns/sources.yaml
volumes:
- /var/log:/var/log:ro
- /run/log/journal:/run/log/journal:ro
- ../patterns:/patterns:ro # sources.yaml lives here
networks:
- turnstone-net
restart: "no" # run on demand; use cron or systemd timer to repeat
# To run on a schedule, replace restart: "no" with a cron timer via:
# docker run --rm turnstone-harvester:latest push
# or add a systemd timer that calls:
# docker compose -f docker-compose.yml run --rm harvester
networks:
turnstone-net:
external: true # join the same network as the main Turnstone container

201
harvester/harvester.py Normal file
View file

@ -0,0 +1,201 @@
#!/usr/bin/env python3
"""Turnstone Harvester — collect logs and ship them to a Turnstone instance.
Subcommands:
push Read sources.yaml, POST each log file to Turnstone /api/glean/upload
incident Tag an incident on the remote Turnstone instance
Usage:
# Push all configured sources
python harvester.py push --url http://turnstone:8534 --sources /patterns/sources.yaml
# Tag an incident
python harvester.py incident "jellyseerr went down" \\
--url http://turnstone:8534 \\
--started "2026-05-19 10:00" --ended "2026-05-19 10:30" \\
--type crash --severity HIGH
Environment variables (override flags):
TURNSTONE_URL Base URL of the Turnstone instance
TURNSTONE_SOURCES Path to sources.yaml
"""
from __future__ import annotations
import argparse
import json
import logging
import sys
import urllib.error
import urllib.parse
import urllib.request
from pathlib import Path
import yaml
logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")
logger = logging.getLogger("harvester")
# ---------------------------------------------------------------------------
# HTTP helpers
# ---------------------------------------------------------------------------
def _post_json(url: str, payload: dict) -> dict:
data = json.dumps(payload).encode()
req = urllib.request.Request(
url,
data=data,
headers={"Content-Type": "application/json"},
method="POST",
)
with urllib.request.urlopen(req, timeout=30) as resp:
return json.loads(resp.read())
def _post_file(url: str, path: Path, source_id: str) -> dict:
"""POST a log file as multipart/form-data."""
boundary = "----TurnstoneHarvesterBoundary"
body_parts: list[bytes] = []
content = path.read_bytes()
body_parts.append(
f"--{boundary}\r\n"
f'Content-Disposition: form-data; name="file"; filename="{path.name}"\r\n'
f"Content-Type: text/plain\r\n\r\n".encode()
)
body_parts.append(content)
body_parts.append(b"\r\n")
body_parts.append(f"--{boundary}--\r\n".encode())
body = b"".join(body_parts)
params = urllib.parse.urlencode({"source_id": source_id})
req = urllib.request.Request(
f"{url}?{params}",
data=body,
headers={"Content-Type": f"multipart/form-data; boundary={boundary}"},
method="POST",
)
with urllib.request.urlopen(req, timeout=60) as resp:
return json.loads(resp.read())
# ---------------------------------------------------------------------------
# push subcommand
# ---------------------------------------------------------------------------
def cmd_push(args: argparse.Namespace) -> int:
sources_path = Path(args.sources)
if not sources_path.exists():
logger.error("sources file not found: %s", sources_path)
return 1
with open(sources_path) as f:
config = yaml.safe_load(f) or {}
sources = config.get("sources", [])
if not sources:
logger.warning("No sources defined in %s", sources_path)
return 0
upload_url = args.url.rstrip("/") + "/turnstone/api/glean/upload"
total_gleaned = 0
errors = 0
for src in sources:
src_id = src.get("id", "unknown")
src_path = Path(src["path"])
if not src_path.exists():
logger.warning("Source %r not found, skipping: %s", src_id, src_path)
continue
logger.info("Pushing %s (%s) ...", src_id, src_path)
try:
result = _post_file(upload_url, src_path, src_id)
count = result.get("gleaned", 0)
total_gleaned += count
logger.info(" %s: %d entries gleaned", src_id, count)
except urllib.error.HTTPError as exc:
logger.error(" %s: HTTP %d%s", src_id, exc.code, exc.read().decode(errors="replace"))
errors += 1
except Exception as exc:
logger.error(" %s: %s", src_id, exc)
errors += 1
logger.info("Done. Total gleaned: %d entries, errors: %d", total_gleaned, errors)
return 1 if errors else 0
# ---------------------------------------------------------------------------
# incident subcommand
# ---------------------------------------------------------------------------
def cmd_incident(args: argparse.Namespace) -> int:
payload = {
"label": args.label,
"issue_type": args.type or "",
"started_at": args.started or "",
"ended_at": args.ended or "",
"notes": args.notes or "",
"severity": args.severity or "MEDIUM",
}
url = args.url.rstrip("/") + "/turnstone/api/incidents"
try:
result = _post_json(url, payload)
logger.info("Incident created: %s", result.get("id", result))
return 0
except urllib.error.HTTPError as exc:
logger.error("HTTP %d%s", exc.code, exc.read().decode(errors="replace"))
return 1
except Exception as exc:
logger.error("%s", exc)
return 1
# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------
def build_parser() -> argparse.ArgumentParser:
import os
default_url = os.environ.get("TURNSTONE_URL", "http://localhost:8534")
default_sources = os.environ.get("TURNSTONE_SOURCES", "/patterns/sources.yaml")
parser = argparse.ArgumentParser(
description="Turnstone Harvester — ship logs and tag incidents",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
sub = parser.add_subparsers(dest="cmd", required=True)
# push
p_push = sub.add_parser("push", help="Push log files to Turnstone")
p_push.add_argument("--url", default=default_url, help="Turnstone base URL (default: %(default)s)")
p_push.add_argument("--sources", default=default_sources, help="Path to sources.yaml (default: %(default)s)")
# incident
p_inc = sub.add_parser("incident", help="Tag an incident on the Turnstone instance")
p_inc.add_argument("label", help="Short description of the incident")
p_inc.add_argument("--url", default=default_url, help="Turnstone base URL (default: %(default)s)")
p_inc.add_argument("--started", help="Start time (ISO or natural language)")
p_inc.add_argument("--ended", help="End time (ISO or natural language)")
p_inc.add_argument("--type", dest="type", help="Issue type tag (e.g. crash, oom, auth_fail)")
p_inc.add_argument("--severity", default="MEDIUM",
choices=["LOW", "MEDIUM", "HIGH", "CRITICAL"],
help="Incident severity (default: MEDIUM)")
p_inc.add_argument("--notes", help="Additional notes")
return parser
def main() -> int:
parser = build_parser()
args = parser.parse_args()
if args.cmd == "push":
return cmd_push(args)
if args.cmd == "incident":
return cmd_incident(args)
parser.print_help()
return 1
if __name__ == "__main__":
sys.exit(main())

26
harvester/harvester.sh Executable file
View file

@ -0,0 +1,26 @@
#!/usr/bin/env bash
# Turnstone Harvester — containerless wrapper
# Requires: python3, pip install pyyaml
#
# Usage:
# ./harvester.sh push
# ./harvester.sh incident "jellyseerr went down" --started "2026-05-19 10:00" --type crash
#
# Environment variables:
# TURNSTONE_URL Base URL of the Turnstone instance (default: http://localhost:8534)
# TURNSTONE_SOURCES Path to sources.yaml (default: /etc/turnstone/sources.yaml)
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
export TURNSTONE_URL="${TURNSTONE_URL:-http://localhost:8534}"
export TURNSTONE_SOURCES="${TURNSTONE_SOURCES:-/etc/turnstone/sources.yaml}"
# Install dependencies if not present
if ! python3 -c "import yaml" 2>/dev/null; then
echo "Installing pyyaml..."
pip3 install --quiet pyyaml
fi
exec python3 "$SCRIPT_DIR/harvester.py" "$@"

View file

@ -0,0 +1,51 @@
# Turnstone Harvester — sources.example.yaml
# Copy to sources.yaml and adjust paths for your system.
# The harvester reads this file and POSTs each log file to Turnstone.
#
# Each source needs:
# id: Short identifier (used as source_id in Turnstone)
# path: Absolute path to the log file on the host
sources:
# System journal (export with: journalctl -o json-pretty > /var/log/journal-export.jsonl)
# - id: system-journal
# path: /var/log/journal-export.jsonl
# Syslog
- id: syslog
path: /var/log/syslog
# Docker daemon log
# - id: docker
# path: /var/log/docker.log
# Podman events (rootful)
# - id: podman
# path: /var/log/podman-events.log
# Caddy access log
# - id: caddy
# path: /var/log/caddy/access.log
# Arr stack — adjust container paths to match your setup
# - id: sonarr
# path: /opt/sonarr/config/logs/sonarr.0.txt
# - id: radarr
# path: /opt/radarr/config/logs/radarr.0.txt
# - id: prowlarr
# path: /opt/prowlarr/config/logs/prowlarr.0.txt
# qBittorrent
# - id: qbittorrent
# path: /opt/qbittorrent/config/data/logs/qbittorrent.log
# Jellyfin
# - id: jellyfin
# path: /opt/jellyfin/log/jellyfin.log
# Wazuh SIEM — alerts.json on the Wazuh manager
# Turnstone auto-detects this format; source_id is qualified per agent automatically.
# For push-based ingestion from Wazuh custom integrations, use:
# POST /api/glean/wazuh/alert (single alert JSON body)
# - id: wazuh
# path: /var/ossec/logs/alerts/alerts.json

View file

@ -23,11 +23,30 @@ VITE_PORT=5174 # Vite HMR port in dev mode (proxies /api → 8534)
LOG_DIR="log" LOG_DIR="log"
API_PID_FILE=".turnstone-api.pid" API_PID_FILE=".turnstone-api.pid"
DB="${TURNSTONE_DB:-${SCRIPT_DIR}/data/turnstone.db}" # Default to the live cluster DB when present; fall back to dev DB.
_CLUSTER_DB="/devl/turnstone-cluster/data/turnstone.db"
_DEV_DB="${SCRIPT_DIR}/data/turnstone.db"
if [[ -z "${TURNSTONE_DB:-}" ]]; then
DB="$([[ -d /devl/turnstone-cluster ]] && echo "${_CLUSTER_DB}" || echo "${_DEV_DB}")"
else
DB="${TURNSTONE_DB}"
fi
# Use cluster patterns (watch.yaml, default.yaml) when available.
PATTERN_DIR="${TURNSTONE_PATTERNS:-$([[ -d /devl/turnstone-cluster/patterns ]] && echo "/devl/turnstone-cluster/patterns" || echo "${SCRIPT_DIR}/patterns")}"
CONDA_BASE="${CONDA_BASE:-/devl/miniconda3}" CONDA_BASE="${CONDA_BASE:-/devl/miniconda3}"
PYTHON="${CONDA_BASE}/envs/cf/bin/python" PYTHON="${CONDA_BASE}/envs/cf/bin/python"
# Source .env if present — loads TURNSTONE_MULTI_AGENT_DIAGNOSE, GPU_SERVER_URL, etc.
# Variables already set in the environment take precedence (set -a / set +a scoping).
if [[ -f "${SCRIPT_DIR}/.env" ]]; then
set -a
# shellcheck source=/dev/null
source "${SCRIPT_DIR}/.env"
set +a
fi
# ── Helpers ─────────────────────────────────────────────────────────────────── # ── Helpers ───────────────────────────────────────────────────────────────────
_is_alive() { _is_alive() {
@ -35,6 +54,31 @@ _is_alive() {
[[ -f "$pid_file" ]] && kill -0 "$(<"$pid_file")" 2>/dev/null [[ -f "$pid_file" ]] && kill -0 "$(<"$pid_file")" 2>/dev/null
} }
# Kill any process currently holding a TCP port.
_kill_port() {
local port="$1"
local pids
pids=$(ss -tlnp "sport = :${port}" 2>/dev/null | grep -oP '(?<=pid=)\d+' | sort -u)
[[ -z "$pids" ]] && return 0
for pid in $pids; do
warn "Killing stray PID ${pid} on port ${port}"
kill "$pid" 2>/dev/null || true
done
}
# Wait for a port to stop accepting connections (i.e. fully released).
_wait_for_port_free() {
local port="$1"
for _i in $(seq 1 30); do
sleep 0.3
(echo "" >/dev/tcp/127.0.0.1/"$port") 2>/dev/null || return 0
done
warn "Port ${port} still occupied after 9 s — trying SIGKILL"
_kill_port "$port"
sleep 1
(echo "" >/dev/tcp/127.0.0.1/"$port") 2>/dev/null && warn "Port ${port} still in use!" || true
}
_kill_pid_file() { _kill_pid_file() {
local pid_file="$1" label="$2" local pid_file="$1" label="$2"
if [[ -f "$pid_file" ]]; then if [[ -f "$pid_file" ]]; then
@ -48,7 +92,7 @@ _kill_pid_file() {
rm -f "$pid_file" rm -f "$pid_file"
fi fi
else else
warn "$label not running." warn "No PID file for $label."
fi fi
} }
@ -85,9 +129,9 @@ usage() {
echo -e " ${GREEN}dev${NC} uvicorn --reload (:${API_PORT}) + Vite HMR (:${VITE_PORT})" echo -e " ${GREEN}dev${NC} uvicorn --reload (:${API_PORT}) + Vite HMR (:${VITE_PORT})"
echo "" echo ""
echo " Data:" echo " Data:"
echo -e " ${GREEN}ingest PATH [DB]${NC} Ingest a log file or corpus directory" echo -e " ${GREEN}glean PATH [DB]${NC} Glean a log file or corpus directory"
echo -e " ${GREEN}ingest-plex [HOST]${NC} Pull Plex log from Cass (or HOST) and ingest" echo -e " ${GREEN}glean-plex [HOST]${NC} Pull Plex log from Cass (or HOST) and glean"
echo -e " ${GREEN}ingest-qbit [HOST]${NC} Pull qBittorrent log locally or from HOST via SSH" echo -e " ${GREEN}glean-qbit [HOST]${NC} Pull qBittorrent log locally or from HOST via SSH"
echo -e " ${GREEN}build-fts${NC} Rebuild the FTS search index" echo -e " ${GREEN}build-fts${NC} Rebuild the FTS search index"
echo "" echo ""
echo " Tests:" echo " Tests:"
@ -99,8 +143,8 @@ usage() {
echo " Examples:" echo " Examples:"
echo " ./manage.sh start" echo " ./manage.sh start"
echo " ./manage.sh dev" echo " ./manage.sh dev"
echo " ./manage.sh ingest corpus/raw/" echo " ./manage.sh glean corpus/raw/"
echo " ./manage.sh ingest corpus/raw/ data/custom.db" echo " ./manage.sh glean corpus/raw/ data/custom.db"
echo "" echo ""
} }
@ -123,7 +167,9 @@ case "$CMD" in
success "SPA built → web/dist/" success "SPA built → web/dist/"
info "Starting on port ${API_PORT}" info "Starting on port ${API_PORT}"
TURNSTONE_DB="$DB" nohup "$PYTHON" -m uvicorn app.rest:app \ info " DB: ${DB}"
info " Patterns: ${PATTERN_DIR}"
TURNSTONE_DB="$DB" TURNSTONE_PATTERNS="$PATTERN_DIR" nohup "$PYTHON" -m uvicorn app.rest:app \
--host 0.0.0.0 --port "$API_PORT" \ --host 0.0.0.0 --port "$API_PORT" \
>> "${LOG_DIR}/api.log" 2>&1 & >> "${LOG_DIR}/api.log" 2>&1 &
echo $! > "$API_PID_FILE" echo $! > "$API_PID_FILE"
@ -133,6 +179,8 @@ case "$CMD" in
stop) stop)
_kill_pid_file "$API_PID_FILE" "Turnstone" _kill_pid_file "$API_PID_FILE" "Turnstone"
_kill_port "$API_PORT"
_wait_for_port_free "$API_PORT"
;; ;;
restart) restart)
@ -192,15 +240,15 @@ case "$CMD" in
(cd web && npm run dev -- --port "$VITE_PORT") (cd web && npm run dev -- --port "$VITE_PORT")
;; ;;
ingest) glean)
if [[ $# -lt 1 ]]; then if [[ $# -lt 1 ]]; then
error "Usage: ./manage.sh ingest <file_or_dir> [DB_PATH]" error "Usage: ./manage.sh glean <file_or_dir> [DB_PATH]"
fi fi
info "Ingesting $1${2:-$DB}" info "Gleaning $1${2:-$DB}"
"$PYTHON" scripts/ingest_corpus.py "$1" "${2:-$DB}" "$PYTHON" scripts/glean_corpus.py "$1" "${2:-$DB}"
;; ;;
ingest-plex) glean-plex)
PLEX_HOST="${1:-cass}" PLEX_HOST="${1:-cass}"
PLEX_LOG_DIR="/var/lib/plexmediaserver/Library/Application Support/Plex Media Server/Logs" PLEX_LOG_DIR="/var/lib/plexmediaserver/Library/Application Support/Plex Media Server/Logs"
TMP_DIR="/tmp/turnstone-plex-$$" TMP_DIR="/tmp/turnstone-plex-$$"
@ -225,16 +273,16 @@ case "$CMD" in
ssh "$PLEX_HOST" "cat '${remote_path}'" > "$local_path" ssh "$PLEX_HOST" "cat '${remote_path}'" > "$local_path"
done done
info "Ingesting ${#REMOTE_LOGS[@]} log file(s) into ${DB}" info "Gleaning ${#REMOTE_LOGS[@]} log file(s) into ${DB}"
for f in "$TMP_DIR"/*.log; do for f in "$TMP_DIR"/*.log; do
"$PYTHON" scripts/ingest_corpus.py "$f" "$DB" "$PYTHON" scripts/glean_corpus.py "$f" "$DB"
done done
rm -rf "$TMP_DIR" rm -rf "$TMP_DIR"
info "Done. Restarting server…" info "Done. Restarting server…"
exec bash "$0" restart exec bash "$0" restart
;; ;;
ingest-qbit) glean-qbit)
QBIT_HOST="${1:-}" QBIT_HOST="${1:-}"
# Default log locations in priority order # Default log locations in priority order
QBIT_LOG_PATHS=( QBIT_LOG_PATHS=(
@ -277,8 +325,8 @@ case "$CMD" in
info "${LOCAL_LOG}" info "${LOCAL_LOG}"
fi fi
info "Ingesting into ${DB}" info "Gleaning into ${DB}"
"$PYTHON" scripts/ingest_corpus.py "${TMP_DIR}"/*.log "$DB" "$PYTHON" scripts/glean_corpus.py "${TMP_DIR}"/*.log "$DB"
rm -rf "$TMP_DIR" rm -rf "$TMP_DIR"
info "Done. Restarting server…" info "Done. Restarting server…"
exec bash "$0" restart exec bash "$0" restart

View file

@ -2,83 +2,101 @@
# Each matched pattern name is stored on RetrievedEntry.matched_patterns and # Each matched pattern name is stored on RetrievedEntry.matched_patterns and
# used to boost retrieval relevance for diagnostic queries. # used to boost retrieval relevance for diagnostic queries.
# #
# Add domain-specific patterns here. Patterns are applied in order; multiple # domain: groups patterns into service health domains for triage-level summaries.
# can match a single entry. # Valid domains: service_health | networking | auth | storage | memory |
# kernel | power | web_proxy | media | gpu | audio
#
# Patterns are applied in order; multiple can match a single entry.
patterns: patterns:
- name: service_restart - name: service_restart
pattern: "(restarting|restart requested|service.*start)" pattern: "(restarting|restart requested|service.*start)"
severity: WARN severity: WARN
domain: service_health
description: Service restart detected description: Service restart detected
- name: connection_lost - name: connection_lost
pattern: "(connection (lost|dropped|refused|timed? out)|disconnect(ed)?)" pattern: "(connection (lost|dropped|refused|timed? out)|disconnect(ed)?)"
severity: ERROR severity: ERROR
domain: networking
description: Network or device connection failure description: Network or device connection failure
- name: auth_failure - name: auth_failure
pattern: "(auth(entication)? (failed?|error|denied)|permission denied|unauthorized)" pattern: "(auth(entication)? (failed?|error|denied)|permission denied|unauthorized)"
severity: ERROR severity: ERROR
domain: auth
description: Authentication or authorization failure description: Authentication or authorization failure
- name: oom - name: oom
pattern: "(out of memory|OOM|killed process|cannot allocate)" pattern: "(out of memory|OOM|killed process|cannot allocate)"
severity: CRITICAL severity: CRITICAL
domain: memory
description: Out-of-memory condition description: Out-of-memory condition
- name: segfault - name: segfault
pattern: "(segmentation fault|segfault|SIGSEGV|core dump)" pattern: "(segmentation fault|segfault|SIGSEGV|core dump)"
severity: CRITICAL severity: CRITICAL
domain: kernel
description: Process crash or memory corruption description: Process crash or memory corruption
- name: disk_full - name: disk_full
pattern: "(no space left|disk full|filesystem.*full|ENOSPC)" pattern: "(no space left|disk full|filesystem.*full|ENOSPC)"
severity: ERROR severity: ERROR
domain: storage
description: Storage capacity exhausted description: Storage capacity exhausted
- name: timeout - name: timeout
pattern: "(timed? out|deadline exceeded|operation timed?)" pattern: "(timed? out|deadline exceeded|operation timed?)"
severity: WARN severity: WARN
domain: networking
description: Operation timeout description: Operation timeout
- name: caddy_tls_error - name: caddy_tls_error
pattern: "(acme|certificate|tls).*(error|fail|invalid|expired|renew)" pattern: "(acme|certificate|tls).*(error|fail|invalid|expired|renew)"
severity: ERROR severity: ERROR
domain: web_proxy
description: Caddy TLS or certificate error description: Caddy TLS or certificate error
- name: caddy_config_error - name: caddy_config_error
pattern: "(config|caddyfile|directive).*(error|invalid|unknown|unrecognized)" pattern: "(config|caddyfile|directive).*(error|invalid|unknown|unrecognized)"
severity: ERROR severity: ERROR
domain: web_proxy
description: Caddy configuration error description: Caddy configuration error
- name: caddy_auth_error - name: caddy_auth_error
pattern: "(forward_auth|basicauth|basic_auth).*(error|fail|denied|invalid|unreachable)" pattern: "(forward_auth|basicauth|basic_auth).*(error|fail|denied|invalid|unreachable)"
severity: ERROR severity: ERROR
domain: web_proxy
description: Caddy authentication middleware failure description: Caddy authentication middleware failure
- name: caddy_upstream_error - name: caddy_upstream_error
pattern: "(upstream|backend|reverse.proxy).*(error|fail|unreachable|refused|timeout)" pattern: "(upstream|backend|reverse.proxy).*(error|fail|unreachable|refused|timeout)"
severity: ERROR severity: ERROR
domain: web_proxy
description: Caddy upstream/backend failure description: Caddy upstream/backend failure
- name: service_update - name: service_update
pattern: "(upgraded?|updated?|installing|dpkg|apt|package).*(caddy|nginx|apache|proxy)" pattern: "(upgraded?|updated?|installing|dpkg|apt|package).*(caddy|nginx|apache|proxy)"
severity: INFO severity: INFO
domain: web_proxy
description: Web server package update detected description: Web server package update detected
- name: power_failure - name: power_failure
pattern: "(power (fail|loss|outage|cut)|ups|battery|shutdown.*power|lost power)" pattern: "(power (fail|loss|outage|cut)|ups|battery|shutdown.*power|lost power)"
severity: CRITICAL severity: CRITICAL
domain: power
description: Power failure or UPS event description: Power failure or UPS event
- name: network_interface - name: network_interface
pattern: "(eth[0-9]|ens[0-9]|enp[0-9]|wlan[0-9]).*(down|up|carrier|link)" pattern: "(eth[0-9]|ens[0-9]|enp[0-9]|wlan[0-9]).*(down|up|carrier|link)"
severity: WARN severity: WARN
domain: networking
description: Network interface state change description: Network interface state change
- name: ip_change - name: ip_change
pattern: "(new ip|ip.*(changed|assigned|address)|dhcp.*(ack|offer|bound|renew))" pattern: "(new ip|ip.*(changed|assigned|address)|dhcp.*(ack|offer|bound|renew))"
severity: INFO severity: INFO
domain: networking
description: IP address change or DHCP event description: IP address change or DHCP event
# ── System / journald patterns ───────────────────────────────────────────── # ── System / journald patterns ─────────────────────────────────────────────
@ -86,46 +104,55 @@ patterns:
- name: systemd_fail - name: systemd_fail
pattern: "(Failed to start|failed with result|entered failed state|start request repeated too quickly|Main process exited)" pattern: "(Failed to start|failed with result|entered failed state|start request repeated too quickly|Main process exited)"
severity: ERROR severity: ERROR
domain: service_health
description: systemd service failed to start or crashed description: systemd service failed to start or crashed
- name: oom_kill - name: oom_kill
pattern: "(Killed process|oom.kill|oom_kill_process|Out of memory: Kill|memory cgroup out of memory)" pattern: "(Killed process|oom.kill|oom_kill_process|Out of memory: Kill|memory cgroup out of memory)"
severity: CRITICAL severity: CRITICAL
domain: memory
description: Kernel OOM killer terminated a process description: Kernel OOM killer terminated a process
- name: disk_hw_error - name: disk_hw_error
pattern: "(ata[0-9]|sd[a-z]|nvme[0-9]).*(error|failed|reset|timeout|exception|EH|FAILED COMMAND)" pattern: "(ata[0-9]|sd[a-z]|nvme[0-9]).*(error|failed|reset|timeout|exception|EH|FAILED COMMAND)"
severity: ERROR severity: ERROR
domain: storage
description: Storage device hardware error or reset description: Storage device hardware error or reset
- name: fs_error - name: fs_error
pattern: "(EXT4-fs error|XFS.*error|BTRFS.*error|I/O error|blk_update_request.*error|buffer I/O error)" pattern: "(EXT4-fs error|XFS.*error|BTRFS.*error|I/O error|blk_update_request.*error|buffer I/O error)"
severity: ERROR severity: ERROR
domain: storage
description: Filesystem or block I/O error description: Filesystem or block I/O error
- name: kernel_error - name: kernel_error
pattern: "(kernel: BUG|kernel panic|Oops:|general protection fault|Call Trace|RIP:.*[0-9a-f]{16})" pattern: "(kernel: BUG|kernel panic|Oops:|general protection fault|Call Trace|RIP:.*[0-9a-f]{16})"
severity: CRITICAL severity: CRITICAL
domain: kernel
description: Kernel bug, panic, or oops — system may be unstable description: Kernel bug, panic, or oops — system may be unstable
- name: ssh_brute - name: ssh_brute
pattern: "(Failed password|Invalid user|authentication failure|Connection closed by authenticating user).*(sshd|ssh)" pattern: "(Failed password|Invalid user|authentication failure|Connection closed by authenticating user).*(sshd|ssh)"
severity: WARN severity: WARN
domain: auth
description: SSH authentication failure — possible brute force description: SSH authentication failure — possible brute force
- name: container_crash - name: container_crash
pattern: "(container.*exited|oci runtime.*error|podman.*error|docker.*error|container.*killed|OCI.*failed)" pattern: "(container.*exited|oci runtime.*error|podman.*error|docker.*error|container.*killed|OCI.*failed)"
severity: ERROR severity: ERROR
domain: service_health
description: Container runtime error or unexpected exit description: Container runtime error or unexpected exit
- name: smart_error - name: smart_error
pattern: "(smartd|SMART.*error|reallocated sector|pending sector|uncorrectable sector|Current_Pending_Sector)" pattern: "(smartd|SMART.*error|reallocated sector|pending sector|uncorrectable sector|Current_Pending_Sector)"
severity: CRITICAL severity: CRITICAL
domain: storage
description: SMART disk health warning — potential drive failure description: SMART disk health warning — potential drive failure
- name: nfs_error - name: nfs_error
pattern: "(nfs.*error|nfs.*timeout|RPC.*timed out|nfs4.*server.*not responding|mount.*nfs.*failed)" pattern: "(nfs.*error|nfs.*timeout|RPC.*timed out|nfs4.*server.*not responding|mount.*nfs.*failed)"
severity: ERROR severity: ERROR
domain: networking
description: NFS mount or RPC timeout description: NFS mount or RPC timeout
# Add device/service-specific patterns below this line: # Add device/service-specific patterns below this line:
@ -133,49 +160,156 @@ patterns:
- name: qbit_tracker_error - name: qbit_tracker_error
pattern: "(tracker|announce).*(not working|error|fail|unreachable|timeout|refused|invalid)" pattern: "(tracker|announce).*(not working|error|fail|unreachable|timeout|refused|invalid)"
severity: WARN severity: WARN
domain: media
description: qBittorrent tracker connection or announce failure description: qBittorrent tracker connection or announce failure
- name: qbit_port_bind - name: qbit_port_bind
pattern: "(couldn't? listen|bind.*fail|port.*in use|listening.*fail)" pattern: "(couldn't? listen|bind.*fail|port.*in use|listening.*fail)"
severity: CRITICAL severity: CRITICAL
domain: media
description: qBittorrent failed to bind listen port — firewall or port conflict description: qBittorrent failed to bind listen port — firewall or port conflict
- name: qbit_disk_error - name: qbit_disk_error
pattern: "(cannot (write|open|create)|disk.*error|i/o error|file.*fail|write.*fail)" pattern: "(cannot (write|open|create)|disk.*error|i/o error|file.*fail|write.*fail)"
severity: ERROR severity: ERROR
domain: media
description: qBittorrent disk write or file access failure description: qBittorrent disk write or file access failure
- name: qbit_hash_fail - name: qbit_hash_fail
pattern: "(hash.*(check|fail|mismatch)|recheck|piece.*fail)" pattern: "(hash.*(check|fail|mismatch)|recheck|piece.*fail)"
severity: WARN severity: WARN
domain: media
description: qBittorrent torrent hash verification failure — possible corrupt data description: qBittorrent torrent hash verification failure — possible corrupt data
- name: qbit_peer_ban - name: qbit_peer_ban
pattern: "(peer.*ban|banned.*peer|blocked.*peer)" pattern: "(peer.*ban|banned.*peer|blocked.*peer)"
severity: INFO severity: INFO
domain: media
description: qBittorrent peer banned (encryption enforcement or bad actor) description: qBittorrent peer banned (encryption enforcement or bad actor)
- name: qbit_download_complete - name: qbit_download_complete
pattern: "(download.*complet|torrent.*finish|has finished downloading)" pattern: "(download.*complet|torrent.*finish|has finished downloading)"
severity: INFO severity: INFO
domain: media
description: qBittorrent torrent download completed description: qBittorrent torrent download completed
- name: qbit_ratio_limit - name: qbit_ratio_limit
pattern: "(ratio.*reach|seeding.*limit|stop.*seeding|upload.*limit)" pattern: "(ratio.*reach|seeding.*limit|stop.*seeding|upload.*limit)"
severity: INFO severity: INFO
domain: media
description: qBittorrent seeding ratio or time limit reached description: qBittorrent seeding ratio or time limit reached
- name: qbit_session_error - name: qbit_session_error
pattern: "(session.*error|couldn't? resume|resume.*fail|torrent.*error)" pattern: "(session.*error|couldn't? resume|resume.*fail|torrent.*error)"
severity: ERROR severity: ERROR
domain: media
description: qBittorrent session or resume data error description: qBittorrent session or resume data error
- name: plex_eae_failure - name: plex_eae_failure
pattern: "(EAE timeout|EAE not running|eac3_eae.*error reading output|Error submitting packet to decoder.*I/O error)" pattern: "(EAE timeout|EAE not running|eac3_eae.*error reading output|Error submitting packet to decoder.*I/O error)"
severity: ERROR severity: ERROR
domain: media
description: Plex EasyAudioEncoder (EAC3 Dolby audio transcoder) crashed — service restart required description: Plex EasyAudioEncoder (EAC3 Dolby audio transcoder) crashed — service restart required
# - name: ext_device_device_error # - name: ext_device_error
# pattern: "ERR-\d{4}" # pattern: "ERR-\d{4}"
# severity: ERROR # severity: ERROR
# description: EXT_DEVICE device error code # description: vendor device structured error code
# ── VPN / tunnel patterns ──────────────────────────────────────────────────
- name: vpn_tunnel_fail
pattern: "(wg-quick@|wireguard|spirit-city-tunnel|cf-orch-tunnel|cf-tunnel|openvpn|vpn).*(failed|error|exit.code|timeout|connection reset)"
severity: ERROR
domain: networking
description: VPN or WireGuard tunnel service failed — remote node may be unreachable
- name: vpn_handshake
pattern: "(handshake|peer.*allowed|WireGuard|wg-quick).*(initiating|complete|timeout|fail|retrying)"
severity: WARN
domain: networking
description: WireGuard peer handshake event — track for timeout/retry patterns
- name: dns_degraded
pattern: "(degraded feature set|DNS.*fall.?back|resolver.*fail|NXDOMAIN|DNS.*timeout|SERVFAIL)"
severity: WARN
domain: networking
description: DNS resolver degradation or fallback — often precedes connectivity failures
# ── GPU / NVIDIA driver patterns ───────────────────────────────────────────
- name: nvidia_api_mismatch
pattern: "(NVRM: API mismatch|nvidia.*version mismatch|driver.*mismatch|kernel module.*mismatch)"
severity: ERROR
domain: gpu
description: NVIDIA kernel module version does not match userspace driver — GPU ops will fail until driver reinstalled
- name: nvidia_xid
pattern: "(NVRM: Xid|Xid.*(error|critical)|GPU.*Xid)"
severity: CRITICAL
domain: gpu
description: NVIDIA Xid error — GPU hardware fault or driver crash (check nvidia-smi error code)
- name: nvidia_gpu_reset
pattern: "(nvidia.*reset|GPU.*reset|NVRM.*reset|nvml.*error|NVLink.*fail)"
severity: ERROR
domain: gpu
description: NVIDIA GPU reset or NVLink fault — possible hardware instability
# ── Power / thermal patterns ───────────────────────────────────────────────
- name: acpi_error
pattern: "(ACPI.*failed|ACPI.*error|ACPI.*_DSM|acpi.*_PPC|ACPI BIOS Error)"
severity: WARN
domain: kernel
description: ACPI firmware evaluation failure — often harmless but can indicate BIOS/power management issues
- name: thermal_throttle
pattern: "(CPU.*throttl|thermal throttl|Package temp|TjMax|temperature.*critical|No RAPL|RAPL.*not available)"
severity: WARN
domain: power
description: CPU/GPU thermal throttling or thermal management subsystem unavailable
- name: undervoltage
pattern: "(under.?voltage|brownout|voltage.*(low|critical)|power supply.*insufficient)"
severity: ERROR
domain: power
description: Undervoltage event — instability risk, check PSU and cable connections
# ── Audio / PipeWire / ALSA ──────────────────────────────────────────────────
- name: pipewire_overflow
pattern: "(OVERFLOW channel|stream.*OVERFLOW|protocol.pulse.*OVERFLOW)"
severity: WARN
domain: audio
description: PipeWire-Pulse stream buffer overflow — client not draining audio fast enough; usually indicates a quantum/period-size mismatch or CPU scheduling issue
- name: pipewire_underrun
pattern: "(pw\\.node.*underrun|spa\\.alsa.*underrun|alsa.*underrun|UNDERRUN)"
severity: WARN
domain: audio
description: PipeWire/ALSA buffer underrun (xrun) — audio thread missed its deadline; increase quantum or period-size for the affected device
- name: alsa_xrun
pattern: "(ALSA.*[Xx][Rr][Uu][Nn]|alsa.*xrun|snd_pcm.*xrun|pcm.*underrun|pcm.*overrun)"
severity: WARN
domain: audio
description: ALSA xrun (hardware buffer overrun/underrun) — increase api.alsa.period-size via WirePlumber rule or raise clock.min-quantum
- name: pipewire_quantum_mismatch
pattern: "(quantum.*mismatch|rate.*mismatch|sample.rate.*mismatch|resampl.*fail|can.*t adapt quantum)"
severity: WARN
domain: audio
description: PipeWire quantum or sample-rate mismatch between nodes — check for mixed 44100/48000 streams; may need per-device WirePlumber rules
- name: pipewire_node_error
pattern: "(pw\\.node.*error|node.*ERROR|pipewire.*failed to set|spa\\.alsa.*error|alsa_sink.*error|alsa_source.*error)"
severity: ERROR
domain: audio
description: PipeWire node error — device may be unavailable or misconfigured
- name: pipewire_jackdbus_missing
pattern: "(jackdbus.*reply|jackaudio.*service.*not.*provided|org\\.jackaudio\\.service)"
severity: INFO
domain: audio
description: PipeWire JACK D-Bus probe — JACK not running; benign on non-JACK systems, fires once per PipeWire restart

View file

@ -0,0 +1,55 @@
# Turnstone log sources — Heimdall cluster glean.
# Covers: Heimdall (local), Navi, Sif, Cass, Strahl (SSH-collected),
# Docker services on Heimdall, and network device syslog.
#
# Collected by scripts/collect_cluster_logs.sh before each glean run.
# All paths are container-side (/data/ = bind-mount of /devl/turnstone-cluster/data/).
#
# Cron (collect + glean, every 15 min):
# */15 * * * * bash /Library/Development/CircuitForge/turnstone/scripts/collect_cluster_logs.sh && \
# docker exec turnstone-cluster python scripts/glean_corpus.py \
# --sources /patterns/sources-cluster.yaml --db /data/turnstone.db \
# >> /var/log/turnstone-cluster-glean.log 2>&1
sources:
# ── Heimdall (local) ─────────────────────────────────────────────────────────
- id: heimdall-journal
path: /data/heimdall-journal.jsonl
- id: heimdall-dmesg
path: /data/heimdall-dmesg.txt
# ── Remote cluster nodes (SSH-collected journals) ────────────────────────────
- id: navi-journal
path: /data/navi-journal.jsonl
- id: sif-journal
path: /data/sif-journal.jsonl
- id: cass-journal
path: /data/cass-journal.jsonl
- id: strahl-journal
path: /data/strahl-journal.jsonl
# ── Docker services on Heimdall ──────────────────────────────────────────────
- id: docker-cf-orch-coordinator
path: /data/docker-cf-orch-coordinator.jsonl
- id: docker-cf-web
path: /data/docker-cf-web.jsonl
- id: docker-cf-directus
path: /data/docker-cf-directus.jsonl
- id: docker-caddy-proxy
path: /data/docker-caddy-proxy.jsonl
# ── Network syslog (router, switches, UniFi APs) ─────────────────────────────
# Written by syslog-receiver.service (UDP 5140 → /devl/turnstone-cluster/data/network-syslog.txt).
# Configure devices to send syslog to Heimdall:5140.
# UniFi: Settings → System → Remote Logging → Syslog Host = <YOUR_HOST_IP>:5140
# Ubiquiti EdgeRouter: set system syslog host <YOUR_HOST_IP> facility all level debug
# Managed switches: varies by vendor — target <YOUR_HOST_IP> UDP 5140
- id: network-syslog
path: /data/network-syslog.txt

View file

@ -0,0 +1,50 @@
# Turnstone log sources — example node (Docker/Podman, self-hosted media stack)
#
# Copy this file to your patterns directory and edit for your setup.
# Container paths: /opt and /var/log are bind-mounted read-only.
# journal-export.jsonl is written to /data/ by export_journal.sh (run via cron before glean).
#
# Add or remove sources freely. Missing paths are skipped with a warning.
sources:
# ── System ────────────────────────────────────────────────────────────────
# Requires: cron job to run export_journal.sh before each glean.
# Example cron (every 15 min — edit paths for your install):
# */15 * * * * /opt/turnstone/scripts/export_journal.sh \
# /opt/turnstone-data/
- id: system-journal
path: /data/journal-export.jsonl
- id: dmesg
path: /data/dmesg-export.txt
# ── Servarr stack ─────────────────────────────────────────────────────────
- id: sonarr
path: /opt/sonarr/config/logs/sonarr.0.txt
- id: radarr
path: /opt/radarr/config/logs/radarr.0.txt
- id: bazarr
path: /opt/bazarr/config/log/bazarr.log
- id: prowlarr
path: /opt/prowlarr/config/logs/prowlarr.0.txt
# ── Media server / tracking ────────────────────────────────────────────────
- id: tautulli
path: /opt/tautulli/config/logs/plex_websocket.log
# ── Download automation ────────────────────────────────────────────────────
- id: autoscan
path: /opt/autoscan/config/autoscan.log
# ── Web / proxy ────────────────────────────────────────────────────────────
- id: organizr-nginx
path: /opt/organizr/log/nginx/error.log
- id: organizr-app
path: /opt/organizr/www/organizr/server.log
- id: nextcloud-nginx
path: /opt/nextcloud/config/log/nginx/error.log

View file

@ -1,8 +1,8 @@
# Turnstone log sources — edit this file to add or remove services. # Turnstone log sources — edit this file to add or remove services.
# NOTE: the system-journal entry requires export_journal.sh to run on the HOST # NOTE: the system-journal entry requires export_journal.sh to run on the HOST
# before the container ingest step. See crontab setup instructions in the README. # before the container glean step. See crontab setup instructions in the README.
# Run ingest manually: # Run glean manually:
# sudo podman exec turnstone python scripts/ingest_corpus.py \ # sudo podman exec turnstone python scripts/glean_corpus.py \
# --sources /patterns/sources.yaml --db /data/turnstone.db # --sources /patterns/sources.yaml --db /data/turnstone.db
# #
# Paths here are container-side paths under the /opt bind mount. # Paths here are container-side paths under the /opt bind mount.
@ -12,7 +12,7 @@
sources: sources:
# ── System (exported by export_journal.sh on the host) ─────────────────── # ── System (exported by export_journal.sh on the host) ───────────────────
# journal-export.jsonl and dmesg-export.txt are written to /opt/turnstone/data/ # journal-export.jsonl and dmesg-export.txt are written to /opt/turnstone/data/
# by the export script before each ingest run. # by the export script before each glean run.
- id: system-journal - id: system-journal
path: /data/journal-export.jsonl path: /data/journal-export.jsonl
@ -70,3 +70,27 @@ sources:
- id: jellyseerr - id: jellyseerr
path: /opt/jellyseerr/config/logs/jellyseerr.log path: /opt/jellyseerr/config/logs/jellyseerr.log
# ── MQTT / IoT (live — subscribe mode, no path needed) ───────────────────
# Requires: pip install circuitforge-core[mqtt]
# These sources are handled by the live MQTT subscriber task (not batch glean).
# Uncomment and configure to enable.
#
# Meshtastic MQTT bridge (node must have MQTT uplink enabled):
# - id: meshtastic-home
# type: mqtt
# broker_host: 10.1.10.5 # IP of your local MQTT broker (e.g. Mosquitto on Huginn)
# broker_port: 1883
# topics:
# - msh/# # all Meshtastic regions; use msh/us-east/# to narrow
#
# Generic IoT sensors:
# - id: iot-home
# type: mqtt
# broker_host: localhost
# broker_port: 1883
# topics:
# - home/+/temperature
# - home/+/humidity
# - home/+/motion
# severity: INFO

46
patterns/telemetry.yaml Normal file
View file

@ -0,0 +1,46 @@
version: 1
rules:
- name: samsung_ads
domains:
- samsungads.com
- samsungcloudsolution.com
- samsungrm.net
- samsungacr.com
category: samsung
description: Samsung Smart TV advertising and telemetry
- name: belkin_wemo
domains:
- api.xbcs.net
- wemo.belkin.com
- statistics.belkin.com
category: belkin
description: Belkin/WeMo smart device telemetry
- name: roku_telemetry
domains:
- logs.roku.com
- scribe.logs.roku.com
category: roku
description: Roku device telemetry
- name: lg_telemetry
domains:
- us.lgappstv.com
- lgtvcommon.com
- lgtvsdp.com
category: lg
description: LG Smart TV telemetry
- name: amazon_iot
domains:
- device-metrics-us.amazon.com
category: amazon
description: Amazon device telemetry
- name: ad_networks
domains:
- doubleclick.net
- googleads.g.doubleclick.net
category: advertising
description: Common advertising networks served to IoT devices

View file

@ -2,7 +2,7 @@
# podman-standalone.sh — Turnstone rootful Podman setup (no Compose) # podman-standalone.sh — Turnstone rootful Podman setup (no Compose)
# #
# For hosts running system Podman (non-rootless) with systemd. # For hosts running system Podman (non-rootless) with systemd.
# Turnstone is a diagnostic log intelligence layer — ingest service logs, # Turnstone is a diagnostic log intelligence layer — glean service logs,
# search by symptom, and view incidents in a lightweight web UI. # search by symptom, and view incidents in a lightweight web UI.
# #
# ── Prerequisites ──────────────────────────────────────────────────────────── # ── Prerequisites ────────────────────────────────────────────────────────────
@ -28,25 +28,25 @@
# sudo systemctl daemon-reload # sudo systemctl daemon-reload
# sudo systemctl enable --now turnstone # sudo systemctl enable --now turnstone
# #
# ── Ingesting logs ──────────────────────────────────────────────────────────── # ── Gleaning logs ─────────────────────────────────────────────────────────────
# All service logs under /opt are accessible inside the container. # All service logs under /opt are accessible inside the container.
# Sources are configured in patterns/sources.yaml (bind-mounted at /patterns/). # Sources are configured in patterns/sources.yaml (bind-mounted at /patterns/).
# #
# To ingest all sources (run manually or via cron): # To glean all sources (run manually or via cron):
# #
# sudo podman exec turnstone python scripts/ingest_corpus.py \ # sudo podman exec turnstone python scripts/glean_corpus.py \
# --sources /patterns/sources.yaml --db /data/turnstone.db # --sources /patterns/sources.yaml --db /data/turnstone.db
# #
# Example cron (every 15 minutes, add to root's crontab with: sudo crontab -e): # Example cron (every 15 minutes, add to root's crontab with: sudo crontab -e):
# */15 * * * * podman exec turnstone python scripts/ingest_corpus.py \ # */15 * * * * podman exec turnstone python scripts/glean_corpus.py \
# --sources /patterns/sources.yaml --db /data/turnstone.db >> /var/log/turnstone-ingest.log 2>&1 # --sources /patterns/sources.yaml --db /data/turnstone.db >> /var/log/turnstone-glean.log 2>&1
# #
# To add a new log source: edit /opt/turnstone/patterns/sources.yaml — no restart needed. # To add a new log source: edit /opt/turnstone/patterns/sources.yaml — no restart needed.
# #
# ── Adding Caddy reverse proxy ──────────────────────────────────────────────── # ── Adding Caddy reverse proxy ────────────────────────────────────────────────
# Add to /etc/caddy/Caddyfile: # Add to /etc/caddy/Caddyfile:
# #
# turnstone.example-node.tv { # turnstone.your-domain.example {
# import protected # import protected
# reverse_proxy 10.0.0.10:8534 # reverse_proxy 10.0.0.10:8534
# import cloudflare # import cloudflare
@ -59,10 +59,14 @@
# #
set -euo pipefail set -euo pipefail
REPO_DIR=/opt/turnstone # Auto-detect repo from script location — works whether cloned to /opt/turnstone
DATA_DIR=/opt/turnstone/data # or to /Library/Development/CircuitForge/turnstone or any other path.
PATTERNS_DIR=/opt/turnstone/patterns REPO_DIR="${TURNSTONE_REPO_DIR:-$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)}"
TZ=America/Los_Angeles # Data and patterns live OUTSIDE the repo so they survive git pulls.
DATA_DIR="${TURNSTONE_DATA_DIR:-/opt/turnstone-data}"
PATTERNS_DIR="${TURNSTONE_PATTERNS_DIR:-${DATA_DIR}/patterns}"
HF_CACHE_DIR="${TURNSTONE_HF_CACHE:-${DATA_DIR}/hf-cache}"
TZ="${TZ:-America/Los_Angeles}"
# ── Bundle push configuration ──────────────────────────────────────────────── # ── Bundle push configuration ────────────────────────────────────────────────
# Set TURNSTONE_BUNDLE_ENDPOINT before running this script to enable the # Set TURNSTONE_BUNDLE_ENDPOINT before running this script to enable the
@ -71,12 +75,39 @@ TZ=America/Los_Angeles
# export TURNSTONE_BUNDLE_ENDPOINT=https://turnstone.circuitforge.tech/turnstone/api/bundles # export TURNSTONE_BUNDLE_ENDPOINT=https://turnstone.circuitforge.tech/turnstone/api/bundles
# bash /opt/turnstone/podman-standalone.sh # bash /opt/turnstone/podman-standalone.sh
# #
# ── Orchard submission (opt-in telemetry) ────────────────────────────────────
# Set TURNSTONE_SUBMIT_ENDPOINT to push pattern-matched log entries to a CF
# receiving instance after each glean run. Only matched entries are sent —
# no raw log content. Used to build Avocet training data.
#
# export TURNSTONE_SUBMIT_ENDPOINT=https://harvest.circuitforge.tech/your-node-id
# bash /opt/turnstone/podman-standalone.sh
#
# TURNSTONE_SOURCE_HOST is auto-detected from `hostname` — override if needed. # TURNSTONE_SOURCE_HOST is auto-detected from `hostname` — override if needed.
#
# ── Multi-agent diagnose pipeline ────────────────────────────────────────────
# The 5-stage ML pipeline requires three env vars and a writable HF cache dir:
#
# TURNSTONE_MULTI_AGENT_DIAGNOSE=true — enable the pipeline
# GPU_SERVER_URL=http://<orch-host>:7700 — cf-orch coordinator or Ollama base URL
#
# ML models are downloaded on first diagnose run and cached in HF_CACHE_DIR.
# On a CPU-only host (no GPU) set TURNSTONE_EMBED_DEVICE=cpu (default).
#
# If your host has no WireGuard to Heimdall — use the public cf-orch endpoint:
# export GPU_SERVER_URL=https://orch.circuitforge.tech
# export TURNSTONE_MULTI_AGENT_DIAGNOSE=true
# sudo bash /opt/turnstone/podman-standalone.sh
#
# For WireGuard-connected Docker hosts — WireGuard reaches Heimdall LAN directly,
# use docker-standalone.sh (not this script — Docker host):
# export GPU_SERVER_URL=http://<YOUR_HOST_IP>:7700
# export TURNSTONE_MULTI_AGENT_DIAGNOSE=true
# bash ~/turnstone/docker-standalone.sh
# ── Turnstone container ─────────────────────────────────────────────────────── # ── Turnstone container ───────────────────────────────────────────────────────
# Image is built locally — no registry auto-update label. # Image is built locally — no registry auto-update label.
# To update: sudo podman build -t localhost/turnstone:latest /opt/turnstone # Run this script after every `git pull` to rebuild and redeploy.
# sudo podman restart turnstone
# #
# /opt is mounted read-only so all service logs under /opt/*/config/logs/ are # /opt is mounted read-only so all service logs under /opt/*/config/logs/ are
# accessible without per-service mounts. Add new sources to patterns/sources.yaml # accessible without per-service mounts. Add new sources to patterns/sources.yaml
@ -84,6 +115,27 @@ TZ=America/Los_Angeles
# #
# Must be run as root (sudo bash podman-standalone.sh) — rootful Podman only. # Must be run as root (sudo bash podman-standalone.sh) — rootful Podman only.
# #
# Bootstrap data and patterns dirs if this is a first run
mkdir -p "${DATA_DIR}" "${PATTERNS_DIR}" "${HF_CACHE_DIR}"
# Copy default patterns if the dir is empty (first run only)
if [ -z "$(ls -A "${PATTERNS_DIR}")" ]; then
cp "${REPO_DIR}/patterns/default.yaml" "${PATTERNS_DIR}/"
# Copy host-specific sources if present, otherwise copy the generic template
HOST_SOURCES="${REPO_DIR}/patterns/sources-$(hostname).yaml"
if [ -f "${HOST_SOURCES}" ]; then
cp "${HOST_SOURCES}" "${PATTERNS_DIR}/sources.yaml"
echo "==> Installed host-specific sources: ${HOST_SOURCES}"
else
cp "${REPO_DIR}/patterns/sources.yaml" "${PATTERNS_DIR}/"
echo "==> Installed default sources.yaml — edit ${PATTERNS_DIR}/sources.yaml for this host"
fi
fi
# Build image from current source (bakes app/ code into the image)
echo "Building Turnstone image..."
podman build -t localhost/turnstone:latest "${REPO_DIR}"
# Remove existing container if present (safe re-run) # Remove existing container if present (safe re-run)
podman rm -f turnstone 2>/dev/null || true podman rm -f turnstone 2>/dev/null || true
@ -93,13 +145,25 @@ podman run -d \
--net=host \ --net=host \
-v "${DATA_DIR}:/data:Z" \ -v "${DATA_DIR}:/data:Z" \
-v "${PATTERNS_DIR}:/patterns:Z" \ -v "${PATTERNS_DIR}:/patterns:Z" \
-v "${HF_CACHE_DIR}:/hf-cache:Z" \
-v /opt:/opt:ro \ -v /opt:/opt:ro \
-v /var/log:/var/log:ro \ -v /var/log:/var/log:ro \
-e TURNSTONE_DB=/data/turnstone.db \ -e TURNSTONE_DB=/data/turnstone.db \
-e TURNSTONE_SOURCE_HOST="$(hostname)" \ -e TURNSTONE_SOURCE_HOST="$(hostname)" \
-e TURNSTONE_BUNDLE_ENDPOINT="${TURNSTONE_BUNDLE_ENDPOINT:-}" \ -e TURNSTONE_BUNDLE_ENDPOINT="${TURNSTONE_BUNDLE_ENDPOINT:-}" \
-e TURNSTONE_SUBMIT_ENDPOINT="${TURNSTONE_SUBMIT_ENDPOINT:-}" \
-e PYTHONUNBUFFERED=1 \ -e PYTHONUNBUFFERED=1 \
-e TZ="${TZ}" \ -e TZ="${TZ}" \
-e TURNSTONE_MULTI_AGENT_DIAGNOSE="${TURNSTONE_MULTI_AGENT_DIAGNOSE:-false}" \
-e GPU_SERVER_URL="${GPU_SERVER_URL:-}" \
-e HF_HOME=/hf-cache \
-e TURNSTONE_AUTO_INCIDENT="${TURNSTONE_AUTO_INCIDENT:-true}" \
-e TURNSTONE_AUTO_INCIDENT_THRESHOLD="${TURNSTONE_AUTO_INCIDENT_THRESHOLD:-5}" \
-e TURNSTONE_AUTO_INCIDENT_WINDOW="${TURNSTONE_AUTO_INCIDENT_WINDOW:-600}" \
-e TURNSTONE_CLASSIFIER_MODEL="${TURNSTONE_CLASSIFIER_MODEL:-byviz/bylastic_classification_logs}" \
-e TURNSTONE_EMBED_BACKEND="${TURNSTONE_EMBED_BACKEND:-sentence_transformers}" \
-e TURNSTONE_EMBED_MODEL="${TURNSTONE_EMBED_MODEL:-sentence-transformers/all-MiniLM-L6-v2}" \
-e TURNSTONE_EMBED_DEVICE="${TURNSTONE_EMBED_DEVICE:-cpu}" \
--health-cmd="curl -f http://localhost:8534/turnstone/health || exit 1" \ --health-cmd="curl -f http://localhost:8534/turnstone/health || exit 1" \
--health-interval=30s \ --health-interval=30s \
--health-timeout=10s \ --health-timeout=10s \
@ -111,18 +175,26 @@ echo ""
echo "Turnstone is starting up." echo "Turnstone is starting up."
echo " UI: http://localhost:8534/turnstone/" echo " UI: http://localhost:8534/turnstone/"
echo "" echo ""
# Regenerate systemd unit so it references the freshly-built image.
# The --new flag means systemd re-creates the container on each start
# rather than binding to a specific container ID.
if [ -d /etc/systemd/system ]; then
echo "Regenerating systemd unit..."
podman generate systemd --new --name turnstone \
| tee /etc/systemd/system/turnstone.service > /dev/null
systemctl daemon-reload
systemctl enable turnstone.service 2>/dev/null || true
echo " systemd unit updated — run: sudo systemctl restart turnstone.service"
echo ""
fi
echo "Check container health with:" echo "Check container health with:"
echo " sudo podman ps" echo " sudo podman ps"
echo " sudo podman logs turnstone" echo " sudo podman logs turnstone"
echo "" echo ""
echo "To register as a systemd service:" echo "To glean all sources now:"
echo " sudo podman generate systemd --new --name turnstone \\" echo " sudo podman exec turnstone python scripts/glean_corpus.py \\"
echo " | sudo tee /etc/systemd/system/turnstone.service"
echo " sudo systemctl daemon-reload"
echo " sudo systemctl enable --now turnstone"
echo ""
echo "To ingest all sources now:"
echo " sudo podman exec turnstone python scripts/ingest_corpus.py \\"
echo " --sources /patterns/sources.yaml --db /data/turnstone.db" echo " --sources /patterns/sources.yaml --db /data/turnstone.db"
echo "" echo ""
echo "To add a new source: edit /opt/turnstone/patterns/sources.yaml — no restart needed." echo "To add a new source: edit /opt/turnstone/patterns/sources.yaml — no restart needed."

View file

@ -1,7 +1,20 @@
fastapi>=0.110.0 fastapi>=0.110.0
uvicorn[standard]>=0.27.0 uvicorn[standard]>=0.27.0
# Postgres backend — optional; SQLite is used when DATABASE_URL is unset
psycopg[binary,pool]>=3.1.0
pydantic>=2.0.0 pydantic>=2.0.0
pyyaml>=6.0 pyyaml>=6.0
aiofiles>=23.0.0 aiofiles>=23.0.0
python-multipart>=0.0.9 python-multipart>=0.0.9
dateparser>=1.2.0 dateparser>=1.2.0
httpx>=0.27.0
paramiko
# Multi-agent diagnose pipeline — ML deps
# classifier.py and suppressor.py have ImportError guards and fall back gracefully,
# but these are included unconditionally so container images are fully capable.
# Install CPU-only torch to avoid pulling the ~2GB CUDA wheel into the image.
--extra-index-url https://download.pytorch.org/whl/cpu
torch>=2.2.0
transformers>=4.40.0
sentence-transformers>=3.0.0

View file

@ -1,4 +1,4 @@
"""CLI: build (or update) the FTS5 full-text search index after ingest.""" """CLI: build (or update) the FTS5 full-text search index after glean."""
from __future__ import annotations from __future__ import annotations
import sys import sys
@ -13,7 +13,7 @@ if __name__ == "__main__":
if not db_path.exists(): if not db_path.exists():
print(f"ERROR: database not found: {db_path}", file=sys.stderr) print(f"ERROR: database not found: {db_path}", file=sys.stderr)
print("Run ingest first: python scripts/ingest_corpus.py", file=sys.stderr) print("Run glean first: python scripts/glean_corpus.py", file=sys.stderr)
sys.exit(1) sys.exit(1)
print(f"Building FTS index for {db_path} ...") print(f"Building FTS index for {db_path} ...")

View file

@ -0,0 +1,167 @@
#!/usr/bin/env bash
# Collect logs from all CircuitForge cluster nodes into Turnstone.
#
# Local Heimdall sources (journal, live-watched Docker containers, network syslog)
# are handled by the Turnstone live watcher — no collection needed for those.
#
# This script handles:
# - Remote node SSH journals (navi, sif, cass, strahl)
# - Docker container logs from all nodes (auto-discovered)
# - Plex logs from Cass (native install, no Docker)
#
# Triggered by: turnstone-cluster-collect.timer (every 15 min)
# Manual run: bash /Library/Development/CircuitForge/turnstone/scripts/collect_cluster_logs.sh
set -euo pipefail
DATA_DIR=/devl/turnstone-cluster/data
WINDOW="20 minutes ago"
SSH_OPTS="-o ConnectTimeout=5 -o BatchMode=yes -o StrictHostKeyChecking=no"
PYTHON=/devl/miniconda3/envs/cf/bin/python
INGEST="${PYTHON} /Library/Development/CircuitForge/turnstone/scripts/glean_corpus.py"
DB=/devl/turnstone-cluster/data/turnstone.db
LOG=/devl/turnstone-cluster/data/glean.log
mkdir -p "${DATA_DIR}"
# ── Helpers ───────────────────────────────────────────────────────────────────
# Collect docker logs from a container into a JSONL file.
# Usage: _docker_logs <container> <outfile> [docker_cmd_prefix...]
_docker_logs() {
local cname="$1" outfile="$2"
shift 2
"$@" docker logs --since 20m "${cname}" 2>&1 | \
python3 -c "
import sys, json
src = '${cname}'
for line in sys.stdin:
line = line.rstrip()
if not line: continue
print(json.dumps({'MESSAGE': line, 'SYSLOG_IDENTIFIER': src, '_TRANSPORT': 'docker', 'PRIORITY': '6'}))
" > "${outfile}" 2>/dev/null || : > "${outfile}"
}
# ── Remote cluster node journals ──────────────────────────────────────────────
declare -A NODES=(
[navi]="${DATA_DIR}/navi-journal.jsonl"
[sif]="${DATA_DIR}/sif-journal.jsonl"
[cass]="${DATA_DIR}/cass-journal.jsonl"
[strahl]="${DATA_DIR}/strahl-journal.jsonl"
[muninn]="${DATA_DIR}/muninn-journal.jsonl"
)
for node in "${!NODES[@]}"; do
outfile="${NODES[$node]}"
if ssh ${SSH_OPTS} "${node}" true 2>/dev/null; then
ssh ${SSH_OPTS} "${node}" \
"journalctl --output=json --priority=0..5 --since '${WINDOW}' --no-pager 2>/dev/null || true" \
> "${outfile}" 2>/dev/null || { echo "${node}: ssh failed, skipping"; : > "${outfile}"; }
echo "${node}: $(wc -l < "${outfile}") journal entries"
else
echo "${node}: unreachable, skipping"
: > "${outfile}"
fi
done
# ── Heimdall Docker containers (non-live-watched) ─────────────────────────────
# The live watcher already tails: cf-orch-coordinator, cf-web, cf-directus, caddy-proxy
LIVE_WATCHED="cf-orch-coordinator cf-web cf-directus caddy-proxy"
HEIMDALL_DIR="${DATA_DIR}/docker-heimdall"
mkdir -p "${HEIMDALL_DIR}"
while IFS= read -r cname; do
[[ " ${LIVE_WATCHED} " == *" ${cname} "* ]] && continue
_docker_logs "${cname}" "${HEIMDALL_DIR}/${cname}.jsonl"
done < <(docker ps --format '{{.Names}}')
echo "heimdall docker: $(ls "${HEIMDALL_DIR}"/*.jsonl 2>/dev/null | wc -l) containers"
# ── Navi Docker containers ────────────────────────────────────────────────────
NAVI_DIR="${DATA_DIR}/docker-navi"
mkdir -p "${NAVI_DIR}"
if ssh ${SSH_OPTS} navi true 2>/dev/null; then
while IFS= read -r cname; do
[[ -z "${cname}" ]] && continue
ssh ${SSH_OPTS} navi "docker logs --since 20m '${cname}' 2>&1" | \
python3 -c "
import sys, json
src = 'navi/${cname}'
for line in sys.stdin:
line = line.rstrip()
if not line: continue
print(json.dumps({'MESSAGE': line, 'SYSLOG_IDENTIFIER': src, '_TRANSPORT': 'docker', 'PRIORITY': '6'}))
" > "${NAVI_DIR}/${cname}.jsonl" 2>/dev/null || : > "${NAVI_DIR}/${cname}.jsonl"
done < <(ssh ${SSH_OPTS} navi "docker ps --format '{{.Names}}'" 2>/dev/null)
echo "navi docker: $(ls "${NAVI_DIR}"/*.jsonl 2>/dev/null | wc -l) containers"
else
echo "navi: unreachable, skipping docker logs"
fi
# ── Navi qBittorrent app logs (volume-mounted files, not in docker logs) ──────
# qBit writes rich per-torrent events to a file inside the compose volume.
# These are NOT captured by `docker logs` — must be pulled directly.
QBIT_LOG_BASE="/opt/containers/arr"
for instance in qbit-tb0 qbit-tb1 qbit-tb2; do
remote_log="${QBIT_LOG_BASE}/${instance}/qBittorrent/logs/qbittorrent.log"
local_out="${NAVI_DIR}/${instance}-app.log"
if ssh ${SSH_OPTS} navi "test -f '${remote_log}'" 2>/dev/null; then
ssh ${SSH_OPTS} navi "cat '${remote_log}'" > "${local_out}" 2>/dev/null || : > "${local_out}"
else
: > "${local_out}"
fi
done
echo "navi qbit app logs: $(cat "${NAVI_DIR}"/qbit-tb*.log 2>/dev/null | wc -l) lines"
# ── Strahl Docker containers ──────────────────────────────────────────────────
STRAHL_DIR="${DATA_DIR}/docker-strahl"
mkdir -p "${STRAHL_DIR}"
if ssh ${SSH_OPTS} strahl true 2>/dev/null; then
while IFS= read -r cname; do
[[ -z "${cname}" ]] && continue
ssh ${SSH_OPTS} strahl "docker logs --since 20m '${cname}' 2>&1" | \
python3 -c "
import sys, json
src = 'strahl/${cname}'
for line in sys.stdin:
line = line.rstrip()
if not line: continue
print(json.dumps({'MESSAGE': line, 'SYSLOG_IDENTIFIER': src, '_TRANSPORT': 'docker', 'PRIORITY': '6'}))
" > "${STRAHL_DIR}/${cname}.jsonl" 2>/dev/null || : > "${STRAHL_DIR}/${cname}.jsonl"
done < <(ssh ${SSH_OPTS} strahl "docker ps --format '{{.Names}}'" 2>/dev/null)
echo "strahl docker: $(ls "${STRAHL_DIR}"/*.jsonl 2>/dev/null | wc -l) containers"
else
echo "strahl: unreachable, skipping docker logs"
fi
# ── Cass — Plex logs (native install, no Docker) ─────────────────────────────
PLEX_DIR="${DATA_DIR}/plex-cass"
PLEX_LOG_DIR="/var/lib/plexmediaserver/Library/Application Support/Plex Media Server/Logs"
mkdir -p "${PLEX_DIR}"
if ssh ${SSH_OPTS} cass true 2>/dev/null; then
while IFS= read -r remote_path; do
[[ -z "${remote_path}" ]] && continue
local_name="$(basename "${remote_path}" | tr ' ' '_' | tr '[:upper:]' '[:lower:]')"
ssh ${SSH_OPTS} cass "cat '${remote_path}'" > "${PLEX_DIR}/${local_name}" 2>/dev/null || true
done < <(ssh ${SSH_OPTS} cass "ls '${PLEX_LOG_DIR}'/*.log 2>/dev/null" 2>/dev/null)
echo "cass plex: $(ls "${PLEX_DIR}"/*.log 2>/dev/null | wc -l) log files"
else
echo "cass: unreachable, skipping plex logs"
fi
# ── Ingest everything ─────────────────────────────────────────────────────────
{
# Remote journals (explicit source IDs via YAML)
${INGEST} --sources /devl/turnstone-cluster/patterns/sources-cluster.yaml --db "${DB}"
# Docker and Plex logs (source IDs derived from filenames by directory glean)
for dir in "${HEIMDALL_DIR}" "${NAVI_DIR}" "${STRAHL_DIR}" "${PLEX_DIR}"; do
[[ -d "${dir}" ]] && ls "${dir}"/*.jsonl "${dir}"/*.log 2>/dev/null | grep -q . && \
${INGEST} "${dir}" "${DB}" || true
done
} >> "${LOG}" 2>&1
echo "collect_cluster_logs: done"

107
scripts/docker-cluster.sh Normal file
View file

@ -0,0 +1,107 @@
#!/usr/bin/env bash
# docker-cluster.sh — Turnstone cluster monitoring instance on Heimdall.
#
# Local sources (Heimdall journal, Docker containers, network syslog) are
# tailed live by the built-in watcher (watch.yaml) — no periodic collection needed.
#
# Remote node journals (navi, sif, cass, strahl) are collected by a
# systemd timer every 15 minutes and ingested via ingest_corpus.py.
# Install the timer:
# sudo cp scripts/turnstone-cluster-collect.{service,timer} /etc/systemd/system/
# sudo systemctl daemon-reload && sudo systemctl enable --now turnstone-cluster-collect.timer
#
# ── Prerequisites ────────────────────────────────────────────────────────────
# SSH key access to navi, sif, cass, strahl (test: ssh <node> hostname)
#
# ── Run ───────────────────────────────────────────────────────────────────────
# bash /Library/Development/CircuitForge/turnstone/scripts/docker-cluster.sh
#
# ── Caddy reverse proxy (add to /devl/caddy-proxy/Caddyfile) ─────────────────
# turnstone.heimdall.lan {
# reverse_proxy 127.0.0.1:8534
# }
# Then: docker restart caddy-proxy
#
# ── Ports ────────────────────────────────────────────────────────────────────
# Turnstone UI → http://heimdall:8534/turnstone/
#
set -euo pipefail
REPO_DIR=/Library/Development/CircuitForge/turnstone
DATA_DIR=/devl/turnstone-cluster/data
PATTERNS_DIR=/devl/turnstone-cluster/patterns
PORT=8534
TZ=America/Los_Angeles
# LLM: route to local cf-orch coordinator (same host, host network).
LLM_URL="${TURNSTONE_LLM_URL:-http://127.0.0.1:7700}"
LLM_MODEL="${TURNSTONE_LLM_MODEL:-llama3.1:8b}"
LLM_API_KEY="${TURNSTONE_LLM_API_KEY:-}"
mkdir -p "${DATA_DIR}" "${PATTERNS_DIR}"
# Keep default.yaml in cluster patterns dir up to date with the repo copy.
cp "${REPO_DIR}/patterns/default.yaml" "${PATTERNS_DIR}/default.yaml"
# ── Seed LLM preferences (only if not already configured) ────────────────────
PREFS_FILE="${DATA_DIR}/preferences.json"
if [ ! -f "${PREFS_FILE}" ]; then
python3 -c "
import json
prefs = {
'llm_url': '${LLM_URL}',
'llm_model': '${LLM_MODEL}',
'llm_api_key': '${LLM_API_KEY}',
}
print(json.dumps(prefs))
" > "${PREFS_FILE}"
echo "Seeded ${PREFS_FILE} (llm_url=${LLM_URL}, model=${LLM_MODEL})"
else
echo "Preferences already exist at ${PREFS_FILE} — skipping seed"
fi
# Touch network-syslog.txt so the file watcher has something to tail
# before the syslog receiver writes to it.
touch "${DATA_DIR}/network-syslog.txt"
# ── Build image ───────────────────────────────────────────────────────────────
echo "Building Turnstone image..."
docker build -t circuitforge/turnstone:latest "${REPO_DIR}"
# ── Deploy container ──────────────────────────────────────────────────────────
docker rm -f turnstone-cluster 2>/dev/null || true
docker run -d \
--name=turnstone-cluster \
--restart=unless-stopped \
--net=host \
-v "${DATA_DIR}:/data" \
-v "${PATTERNS_DIR}:/patterns:ro" \
-v /var/run/docker.sock:/var/run/docker.sock \
-v /run/systemd/journal:/run/systemd/journal:ro \
-e TURNSTONE_DB=/data/turnstone.db \
-e TURNSTONE_PATTERNS=/patterns \
-e TURNSTONE_SOURCE_HOST="heimdall-cluster" \
-e TURNSTONE_BUNDLE_ENDPOINT="${TURNSTONE_BUNDLE_ENDPOINT:-}" \
-e PYTHONUNBUFFERED=1 \
-e TZ="${TZ}" \
--health-cmd="curl -f http://localhost:${PORT}/turnstone/health || exit 1" \
--health-interval=30s \
--health-timeout=10s \
--health-start-period=20s \
--health-retries=3 \
circuitforge/turnstone:latest
echo ""
echo "Turnstone cluster is starting up."
echo " UI: http://heimdall:${PORT}/turnstone/"
echo " Live watching: Heimdall journal + Docker containers + network syslog"
echo " Remote nodes: install the systemd timer for periodic SSH collection"
echo ""
echo " sudo cp ${REPO_DIR}/scripts/turnstone-cluster-collect.{service,timer} /etc/systemd/system/"
echo " sudo systemctl daemon-reload && sudo systemctl enable --now turnstone-cluster-collect.timer"
echo ""
echo "Check container:"
echo " docker ps --filter name=turnstone-cluster"
echo " docker logs turnstone-cluster"
echo " curl http://localhost:${PORT}/turnstone/api/watch/status"

194
scripts/export_corpus.py Normal file
View file

@ -0,0 +1,194 @@
"""Export ERROR/CRITICAL log entries and labeled incidents to Avocet corpus endpoint.
Run periodically alongside ingest_corpus.py (same cron schedule recommended).
Watermarks are stored as plain text files next to the DB:
corpus_watermark.txt last exported log_entries rowid
incident_watermark.txt last exported incident created_at timestamp
Required env vars:
AVOCET_CORPUS_ENDPOINT URL to POST batches to
AVOCET_CONSENT_TOKEN Per-node consent token (issued by CF)
Optional env vars:
TURNSTONE_DB Path to turnstone.db (default: /data/turnstone.db)
TURNSTONE_SOURCE_HOST Node identifier (default: system hostname)
Exit codes:
0 success (including no-op when endpoint not configured)
1 configuration error or failed POST
"""
from __future__ import annotations
import json
import logging
import os
import socket
import sqlite3
import sys
import uuid
from datetime import datetime, timezone
from pathlib import Path
import httpx
logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")
log = logging.getLogger(__name__)
sys.path.insert(0, str(Path(__file__).parent.parent))
BATCH_LIMIT = 500
BATCH_VERSION = 1
DB_PATH = Path(os.environ.get("TURNSTONE_DB", "/data/turnstone.db"))
CORPUS_ENDPOINT = os.environ.get("AVOCET_CORPUS_ENDPOINT", "")
CONSENT_TOKEN = os.environ.get("AVOCET_CONSENT_TOKEN", "")
SOURCE_HOST = os.environ.get("TURNSTONE_SOURCE_HOST", socket.gethostname())
def _watermark_path(db_path: Path, name: str) -> Path:
return db_path.parent / name
def read_rowid_watermark(db_path: Path) -> int:
path = _watermark_path(db_path, "corpus_watermark.txt")
if path.exists():
try:
return int(path.read_text().strip())
except (ValueError, OSError):
pass
return 0
def write_rowid_watermark(db_path: Path, rowid: int) -> None:
_watermark_path(db_path, "corpus_watermark.txt").write_text(str(rowid))
def read_ts_watermark(db_path: Path) -> str:
path = _watermark_path(db_path, "incident_watermark.txt")
if path.exists():
return path.read_text().strip() or "1970-01-01T00:00:00"
return "1970-01-01T00:00:00"
def write_ts_watermark(db_path: Path, ts: str) -> None:
_watermark_path(db_path, "incident_watermark.txt").write_text(ts)
def post_batch(endpoint: str, token: str, payload: dict) -> None:
resp = httpx.post(
endpoint,
json=payload,
headers={"Authorization": f"Bearer {token}"},
timeout=30.0,
)
resp.raise_for_status()
def export_raw_entries(conn: sqlite3.Connection, db_path: Path, endpoint: str, token: str) -> int:
last_rowid = read_rowid_watermark(db_path)
rows = conn.execute(
"SELECT rowid, id, source_id, timestamp_iso, ingest_time, severity, matched_patterns, text "
"FROM log_entries "
"WHERE severity IN ('ERROR', 'CRITICAL') AND rowid > ? "
"ORDER BY rowid LIMIT ?",
(last_rowid, BATCH_LIMIT),
).fetchall()
if not rows:
log.info("No new ERROR/CRITICAL entries since rowid %d", last_rowid)
return 0
entries = [
{
"entry_id": row["id"],
"source_id": row["source_id"],
"timestamp_iso": row["timestamp_iso"],
"ingest_time": row["ingest_time"],
"severity": row["severity"],
"matched_patterns": json.loads(row["matched_patterns"] or "[]"),
"text": row["text"],
}
for row in rows
]
new_watermark = rows[-1]["rowid"]
payload = {
"batch_version": BATCH_VERSION,
"batch_id": str(uuid.uuid4()),
"pushed_at": datetime.now(timezone.utc).isoformat(),
"source_host": SOURCE_HOST,
"batch_type": "raw_entries",
"watermark_from": last_rowid,
"watermark_to": new_watermark,
"entries": entries,
}
post_batch(endpoint, token, payload)
write_rowid_watermark(db_path, new_watermark)
log.info("Exported %d entries (rowid %d%d)", len(rows), last_rowid, new_watermark)
return len(rows)
def export_incidents(conn: sqlite3.Connection, db_path: Path, endpoint: str, token: str) -> int:
last_ts = read_ts_watermark(db_path)
rows = conn.execute(
"SELECT id, label, issue_type, started_at, ended_at, notes, created_at, severity "
"FROM incidents WHERE created_at > ? ORDER BY created_at LIMIT ?",
(last_ts, BATCH_LIMIT),
).fetchall()
if not rows:
log.info("No new incidents since %s", last_ts)
return 0
incidents = [dict(row) for row in rows]
new_watermark = rows[-1]["created_at"]
payload = {
"batch_version": BATCH_VERSION,
"batch_id": str(uuid.uuid4()),
"pushed_at": datetime.now(timezone.utc).isoformat(),
"source_host": SOURCE_HOST,
"batch_type": "incident_bundles",
"watermark_from": last_ts,
"watermark_to": new_watermark,
"entries": incidents,
}
post_batch(endpoint, token, payload)
write_ts_watermark(db_path, new_watermark)
log.info("Exported %d incidents (up to %s)", len(rows), new_watermark)
return len(rows)
def main() -> int:
if not CORPUS_ENDPOINT:
log.info("AVOCET_CORPUS_ENDPOINT not set — skipping corpus export")
return 0
if not CONSENT_TOKEN:
log.error("AVOCET_CONSENT_TOKEN not set")
return 1
if not DB_PATH.exists():
log.error("DB not found: %s", DB_PATH)
return 1
conn = sqlite3.connect(str(DB_PATH))
conn.row_factory = sqlite3.Row
try:
entry_count = export_raw_entries(conn, DB_PATH, CORPUS_ENDPOINT, CONSENT_TOKEN)
incident_count = export_incidents(conn, DB_PATH, CORPUS_ENDPOINT, CONSENT_TOKEN)
log.info("Done — %d entries, %d incidents exported", entry_count, incident_count)
return 0
except httpx.HTTPStatusError as exc:
log.error("HTTP %s from Avocet: %s", exc.response.status_code, exc.response.text[:200])
return 1
except Exception as exc:
log.error("Export failed: %s", exc)
return 1
finally:
conn.close()
if __name__ == "__main__":
sys.exit(main())

View file

@ -1,5 +1,5 @@
#!/usr/bin/env bash #!/usr/bin/env bash
# Export recent system messages to files the Turnstone container can ingest. # Export recent system messages to files the Turnstone container can glean.
# #
# Exports: # Exports:
# journal-export.jsonl — journald (if journalctl is available) # journal-export.jsonl — journald (if journalctl is available)
@ -11,11 +11,11 @@
# Usage (standalone): # Usage (standalone):
# sudo bash /opt/turnstone/scripts/export_journal.sh # sudo bash /opt/turnstone/scripts/export_journal.sh
# #
# Cron (combined with ingest): # Cron (combined with glean):
# */15 * * * * bash /opt/turnstone/scripts/export_journal.sh && \ # */15 * * * * bash /opt/turnstone/scripts/export_journal.sh && \
# podman exec turnstone python scripts/ingest_corpus.py \ # podman exec turnstone python scripts/ingest_corpus.py \
# --sources /patterns/sources.yaml --db /data/turnstone.db \ # --sources /patterns/sources.yaml --db /data/turnstone.db \
# >> /var/log/turnstone-ingest.log 2>&1 # >> /var/log/turnstone-glean.log 2>&1
set -euo pipefail set -euo pipefail

383
scripts/gen_corpus.py Normal file
View file

@ -0,0 +1,383 @@
"""Synthetic log corpus generator.
Produces realistic-but-entirely-artificial log files for demos, load tests,
and parser regression suites no production data required.
Usage:
python scripts/gen_corpus.py --days 7 --out /tmp/demo-corpus/
python scripts/gen_corpus.py --days 1 --out /tmp/test-run/ --seed 42 --error-rate 0.15
python scripts/gen_corpus.py --help
Output tree:
<out>/journald/system.jsonl systemd/kernel journald JSON
<out>/docker/services.jsonl containerised app stdout
<out>/qbittorrent/qbt.log hotio-format qBittorrent log
<out>/ext_device/device.log vendor device plaintext log
"""
from __future__ import annotations
import argparse
import json
import random
import sys
from datetime import datetime, timedelta, timezone
from pathlib import Path
from typing import Callable
# ── Severity distribution ──────────────────────────────────────────────────────
_SYSLOG_PRIORITY = {
"CRITICAL": "2",
"ERROR": "3",
"WARN": "4",
"INFO": "6",
"DEBUG": "7",
}
_SEVERITY_WEIGHTS = {
"INFO": 0.70,
"DEBUG": 0.10,
"WARN": 0.12,
"ERROR": 0.06,
"CRITICAL": 0.02,
}
def _pick_severity(rng: random.Random, error_rate: float) -> str:
"""Return a severity string, boosting ERROR/CRITICAL by error_rate."""
weights = dict(_SEVERITY_WEIGHTS)
boost = error_rate * 0.08 # distribute extra weight to error tiers
weights["ERROR"] += boost
weights["CRITICAL"] += boost / 2
weights["INFO"] -= boost * 1.2
weights["DEBUG"] -= boost * 0.3
choices = list(weights.keys())
probs = [max(0.0, weights[k]) for k in choices]
return rng.choices(choices, weights=probs, k=1)[0]
# ── Timestamp helpers ──────────────────────────────────────────────────────────
def _ts_seq(start: datetime, end: datetime, rng: random.Random) -> list[datetime]:
"""Return a sorted list of random timestamps between start and end."""
total_seconds = (end - start).total_seconds()
# Roughly 1 event every ~4 seconds on average across all sources
count = int(total_seconds / 4)
offsets = sorted(rng.uniform(0, total_seconds) for _ in range(count))
return [start + timedelta(seconds=o) for o in offsets]
def _micros(dt: datetime) -> str:
"""Journald __REALTIME_TIMESTAMP: microseconds since epoch, as string."""
return str(int(dt.timestamp() * 1_000_000))
# ── Message libraries ──────────────────────────────────────────────────────────
_JOURNALD_UNITS = [
"sshd.service", "nginx.service", "docker.service", "systemd-resolved.service",
"cron.service", "systemd-journald.service", "NetworkManager.service",
"turnstone.service", "podman.service", "fail2ban.service",
]
_JOURNALD_MESSAGES: dict[str, list[str]] = {
"INFO": [
"Started {unit}.",
"Listening on {port}/tcp.",
"Reloaded configuration for {unit}.",
"New connection from {ip}:{port}",
"Session opened for user {user} by (uid=0)",
"Accepted publickey for {user} from {ip} port {port}",
"System time synchronized from NTP server {ip}",
"Unit {unit} entered active state.",
"Loaded kernel module {module}.",
"DNS query resolved: {host} -> {ip}",
],
"DEBUG": [
"Polling interval set to {n}ms",
"Cache hit for key '{key}'",
"Heartbeat OK from {host}",
"Timer {n} fired",
"Worker {n} idle",
],
"WARN": [
"High memory usage on {unit}: {pct}% used",
"Slow DNS response ({ms}ms) for {host}",
"Deprecated option '{key}' in config — will be removed in next release",
"Retrying connection to {host} (attempt {n}/5)",
"Journal size limit reached, rotating",
"Disk usage at {pct}% on /dev/sda1",
],
"ERROR": [
"Failed to start {unit}: exit code {n}",
"Connection refused to {host}:{port}",
"Segmentation fault in {unit} (core dumped)",
"Authentication failure for user {user} from {ip}",
"Timeout waiting for {unit} to become ready",
"Failed to bind {port}/tcp: address already in use",
],
"CRITICAL": [
"Kernel panic — not syncing: {msg}",
"Out of memory: killed process {n} ({unit})",
"Hardware error on /dev/sda1: I/O error",
"Disk quota exceeded on /home for user {user}",
"Critical service {unit} failed; system may be unstable",
],
}
_DOCKER_SERVICES = [
"caddy", "postgres", "redis", "turnstone", "avocet",
"prometheus", "grafana", "loki", "minio", "vllm",
]
_DOCKER_MESSAGES: dict[str, list[str]] = {
"INFO": [
"level=info msg=\"Server listening on 0.0.0.0:{port}\"",
"level=info msg=\"Connected to database at {host}:5432\"",
'level=info msg="GET /api/health 200 {ms}ms" user={user}',
'level=info msg="POST /api/v1/jobs 201 {ms}ms"',
"INFO: Worker pool size: {n}",
"INFO: Cache warmed — {n} entries loaded",
"INFO: Startup complete in {ms}ms",
"INFO: Scheduled job '{key}' executed successfully",
],
"DEBUG": [
"DEBUG: SQL query took {ms}ms: SELECT * FROM {key}",
"DEBUG: Redis HIT for key {key}",
"level=debug msg=\"span {key} completed\" duration={ms}ms",
"DEBUG: Trace ID {key}: handler returned 200",
],
"WARN": [
"level=warn msg=\"Slow query ({ms}ms) on table {key}\"",
"WARN: Connection pool at {pct}% capacity",
"WARN: Rate limit approaching for client {ip}",
"WARN: Deprecated endpoint /v1/{key} called by {ip}",
"level=warn msg=\"GC pause {ms}ms — possible memory pressure\"",
],
"ERROR": [
"level=error msg=\"Unhandled exception in handler '{key}'\" err={msg}",
"ERROR: Database connection lost: {msg}",
"level=error msg=\"Failed to acquire lock on {key} after {ms}ms\"",
"ERROR: HTTP 500 POST /api/v1/{key}: internal server error",
"ERROR: Redis NOAUTH: authentication required",
],
"CRITICAL": [
"level=critical msg=\"Panic: nil pointer dereference in {key}\"",
"CRITICAL: Fatal: cannot open database: {msg}",
"CRITICAL: OOM killer invoked — process {n} terminated",
],
}
_QBT_MESSAGES: dict[str, list[str]] = {
"INFO": [
"Successfully listening on IP: 0.0.0.0; port: {port}",
"Torrent '{key}' added to download queue",
"Download of '{key}' complete ({n} MB)",
"Seeding '{key}' at {n} KB/s",
"Tracker '{host}' working, {n} seeds",
"Peer {ip} connected to torrent '{key}'",
"Free disk space: {n} GB",
],
"WARN": [
"Tracker '{host}' is not working (retrying)",
"Slow download speed ({n} KB/s) for '{key}'",
"Too many open files — reducing connection limit",
"DHT bootstrap failed, retrying in {n}s",
],
"CRITICAL": [
"Not enough space on disk to download '{key}'",
"File I/O error for torrent '{key}': {msg}",
"Unable to bind listen port {port}",
],
}
_EXT_DEVICE_CODES: dict[str, list[str]] = {
"INFO": [
"SYS-0100 Device boot complete, firmware v{n}.{n}.{n}",
"SYS-0101 Sensor array calibration OK",
"NET-0200 Link established on interface eth{n}",
"CFG-0300 Configuration loaded from flash",
"HW-0400 Fan speed nominal: {n} RPM",
],
"WARN": [
"NET-0210 Link quality degraded: RSSI -{n} dBm",
"HW-0410 Fan speed elevated: {n} RPM (threshold: {n} RPM)",
"CFG-0310 Unknown config key '{key}' ignored",
"SYS-0110 Watchdog near timeout — {n}ms remaining",
],
"ERROR": [
"ERR-1001 Sensor read failure on channel {n}: timeout",
"ERR-1002 I2C bus {n} NACK from address 0x{key}",
"ERR-2001 Network tx queue overflow — dropped {n} packets",
"ERR-3001 Flash write error at sector {n}",
],
"CRITICAL": [
"ERR-9001 Thermal runaway detected — initiating shutdown",
"ERR-9002 Supply voltage out of range: {n}mV",
"ERR-9003 Memory parity error at address 0x{key}",
],
}
# ── Template substitution ──────────────────────────────────────────────────────
_HOSTS = ["node1", "node2", "node3", "node4", "gateway", "remotehost"]
_USERS = ["alan", "root", "deployer", "backup", "nobody"]
_MODULES = ["btrfs", "xfs", "nf_conntrack", "ip6table_filter", "overlay"]
def _fill(template: str, rng: random.Random) -> str:
"""Replace {placeholder} tokens with plausible random values."""
def _sub(m: re.Match) -> str:
import re
key = m.group(1)
if key == "ip": return f"10.{rng.randint(0,255)}.{rng.randint(0,255)}.{rng.randint(1,254)}"
if key == "port": return str(rng.randint(1024, 65535))
if key == "n": return str(rng.randint(1, 9999))
if key == "pct": return str(rng.randint(50, 99))
if key == "ms": return str(rng.randint(1, 5000))
if key == "unit": return rng.choice(_JOURNALD_UNITS)
if key == "user": return rng.choice(_USERS)
if key == "host": return rng.choice(_HOSTS)
if key == "module": return rng.choice(_MODULES)
if key == "msg": return rng.choice(["unexpected EOF", "connection reset", "no such file"])
if key == "key": return rng.choice(["auth", "jobs", "cache", "index", "sessions", "queue"])
return m.group(0)
import re
return re.sub(r"\{(\w+)\}", _sub, template)
def _pick_msg(library: dict[str, list[str]], severity: str, rng: random.Random) -> str:
candidates = library.get(severity) or library.get("INFO", ["log entry"])
return _fill(rng.choice(candidates), rng)
# ── Per-format generators ──────────────────────────────────────────────────────
def gen_journald(path: Path, start: datetime, end: datetime, rng: random.Random, error_rate: float) -> int:
"""Emit journald JSON lines (-o json format)."""
lines = 0
hostname = rng.choice(_HOSTS)
with path.open("w") as fh:
for dt in _ts_seq(start, end, rng):
severity = _pick_severity(rng, error_rate)
unit = rng.choice(_JOURNALD_UNITS)
msg = _pick_msg(_JOURNALD_MESSAGES, severity, rng)
entry = {
"__REALTIME_TIMESTAMP": _micros(dt),
"MESSAGE": msg,
"PRIORITY": _SYSLOG_PRIORITY.get(severity, "6"),
"_HOSTNAME": hostname,
"_SYSTEMD_UNIT": unit,
"SYSLOG_IDENTIFIER": unit.replace(".service", ""),
}
fh.write(json.dumps(entry) + "\n")
lines += 1
return lines
def gen_docker(path: Path, start: datetime, end: datetime, rng: random.Random, error_rate: float) -> int:
"""Emit Docker-format JSON lines (SOURCE + MESSAGE envelope)."""
lines = 0
with path.open("w") as fh:
for dt in _ts_seq(start, end, rng):
severity = _pick_severity(rng, error_rate)
service = rng.choice(_DOCKER_SERVICES)
msg = _pick_msg(_DOCKER_MESSAGES, severity, rng)
entry = {
"SOURCE": f"docker:{service}",
"MESSAGE": msg,
}
fh.write(json.dumps(entry) + "\n")
lines += 1
return lines
def gen_qbittorrent(path: Path, start: datetime, end: datetime, rng: random.Random, error_rate: float) -> int:
"""Emit hotio-format qBittorrent plaintext log."""
_CODE = {"INFO": "N", "WARN": "W", "CRITICAL": "C", "ERROR": "C", "DEBUG": "N"}
lines = 0
with path.open("w") as fh:
for dt in _ts_seq(start, end, rng):
severity = _pick_severity(rng, error_rate)
msg = _pick_msg(_QBT_MESSAGES, severity, rng)
code = _CODE.get(severity, "N")
ts_str = dt.strftime("%Y-%m-%dT%H:%M:%S")
fh.write(f"({code}) {ts_str} - {msg}\n")
lines += 1
return lines
def gen_ext_device(path: Path, start: datetime, end: datetime, rng: random.Random, error_rate: float) -> int:
"""Emit vendor device plaintext log (ISO timestamp + level + ERR/SYS/NET code + message)."""
lines = 0
with path.open("w") as fh:
for dt in _ts_seq(start, end, rng):
severity = _pick_severity(rng, error_rate)
msg = _pick_msg(_EXT_DEVICE_CODES, severity, rng)
ts_str = dt.strftime("%Y-%m-%dT%H:%M:%S")
fh.write(f"{ts_str} [{severity}] {msg}\n")
lines += 1
return lines
# ── Orchestration ──────────────────────────────────────────────────────────────
_GENERATORS: list[tuple[str, str, Callable]] = [
("journald", "system.jsonl", gen_journald),
("docker", "services.jsonl", gen_docker),
("qbittorrent", "qbt.log", gen_qbittorrent),
("ext_device", "device.log", gen_ext_device),
]
def generate(
out: Path,
days: int,
seed: int | None,
error_rate: float,
reference_time: datetime | None = None,
) -> dict[str, int]:
rng = random.Random(seed)
end = reference_time or datetime.now(tz=timezone.utc)
start = end - timedelta(days=days)
totals: dict[str, int] = {}
for subdir, filename, gen_fn in _GENERATORS:
dest = out / subdir / filename
dest.parent.mkdir(parents=True, exist_ok=True)
# Each source gets its own seeded sub-RNG so streams are independent
sub_rng = random.Random(rng.randint(0, 2**31))
count = gen_fn(dest, start, end, sub_rng, error_rate)
totals[str(dest.relative_to(out))] = count
print(f" {dest.relative_to(out)}: {count:,} lines")
return totals
# ── CLI ────────────────────────────────────────────────────────────────────────
def main(argv: list[str] | None = None) -> int:
parser = argparse.ArgumentParser(
description="Generate a synthetic Turnstone log corpus for demos and testing."
)
parser.add_argument("--days", type=int, default=7, help="Days of history to generate (default: 7)")
parser.add_argument("--out", type=Path, required=True, help="Output directory")
parser.add_argument("--seed", type=int, default=None, help="RNG seed for reproducibility")
parser.add_argument("--error-rate", type=float, default=0.05, help="Error injection rate 0.0-1.0 (default: 0.05)")
args = parser.parse_args(argv)
if not 0.0 <= args.error_rate <= 1.0:
print("ERROR: --error-rate must be between 0.0 and 1.0", file=sys.stderr)
return 1
args.out.mkdir(parents=True, exist_ok=True)
print(f"Generating {args.days}-day corpus → {args.out} (seed={args.seed}, error_rate={args.error_rate})")
totals = generate(args.out, args.days, args.seed, args.error_rate)
total_lines = sum(totals.values())
print(f"Done — {total_lines:,} total log lines across {len(totals)} files")
return 0
if __name__ == "__main__":
sys.exit(main())

View file

@ -1,11 +1,15 @@
"""CLI: ingest a log file or corpus directory into the Turnstone SQLite database. """CLI: glean a log file or corpus directory into the Turnstone SQLite database.
Usage: Usage:
# Single file or directory (legacy) # Single file or directory (legacy)
python scripts/ingest_corpus.py <file_or_dir> [db_path] python scripts/glean_corpus.py <file_or_dir> [db_path] [--force]
# Sources config (multi-service) # Sources config (multi-service)
python scripts/ingest_corpus.py --sources <sources.yaml> [--db <db_path>] python scripts/glean_corpus.py --sources <sources.yaml> [--db <db_path>] [--force]
Options:
--force Bypass fingerprint checks and re-glean all files, re-applying
all patterns. Use after updating patterns/default.yaml.
""" """
from __future__ import annotations from __future__ import annotations
@ -17,7 +21,7 @@ logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")
sys.path.insert(0, str(Path(__file__).parent.parent)) sys.path.insert(0, str(Path(__file__).parent.parent))
from app.ingest.pipeline import ingest, ingest_file, ingest_sources from app.glean.pipeline import glean_dir, glean_file, glean_sources
def _print_stats(stats: dict[str, int]) -> None: def _print_stats(stats: dict[str, int]) -> None:
@ -33,33 +37,36 @@ if __name__ == "__main__":
if not args: if not args:
print( print(
"Usage:\n" "Usage:\n"
" ingest_corpus.py <file_or_dir> [db_path]\n" " glean_corpus.py <file_or_dir> [db_path] [--force]\n"
" ingest_corpus.py --sources <sources.yaml> [--db <db_path>]", " glean_corpus.py --sources <sources.yaml> [--db <db_path>] [--force]",
file=sys.stderr, file=sys.stderr,
) )
sys.exit(1) sys.exit(1)
force = "--force" in args
args = [a for a in args if a != "--force"]
if args[0] == "--sources": if args[0] == "--sources":
if len(args) < 2: if len(args) < 2:
print("Usage: ingest_corpus.py --sources <sources.yaml> [--db <db_path>]", file=sys.stderr) print("Usage: glean_corpus.py --sources <sources.yaml> [--db <db_path>] [--force]", file=sys.stderr)
sys.exit(1) sys.exit(1)
sources_file = Path(args[1]) sources_file = Path(args[1])
db_path = Path("data/turnstone.db") db_path = Path("data/turnstone.db")
if "--db" in args: if "--db" in args:
db_path = Path(args[args.index("--db") + 1]) db_path = Path(args[args.index("--db") + 1])
db_path.parent.mkdir(parents=True, exist_ok=True) db_path.parent.mkdir(parents=True, exist_ok=True)
print(f"Ingesting sources from {sources_file}{db_path}") print(f"Gleaning sources from {sources_file}{db_path}")
stats = ingest_sources(sources_file, db_path) stats = glean_sources(sources_file, db_path, force=force)
_print_stats(stats) _print_stats(stats)
else: else:
target = Path(args[0]) target = Path(args[0])
db_path = Path(args[1]) if len(args) > 1 else Path("data/turnstone.db") db_path = Path(args[1]) if len(args) > 1 else Path("data/turnstone.db")
db_path.parent.mkdir(parents=True, exist_ok=True) db_path.parent.mkdir(parents=True, exist_ok=True)
print(f"Ingesting {target}{db_path}") print(f"Gleaning {target}{db_path}")
if target.is_file(): if target.is_file():
stats = ingest_file(target, db_path) stats = glean_file(target, db_path, force=force)
elif target.is_dir(): elif target.is_dir():
stats = ingest(target, db_path) stats = glean_dir(target, db_path, force=force)
else: else:
print(f"Error: {target} is not a file or directory", file=sys.stderr) print(f"Error: {target} is not a file or directory", file=sys.stderr)
sys.exit(1) sys.exit(1)

266
scripts/harvest_docs.py Normal file
View file

@ -0,0 +1,266 @@
#!/usr/bin/env python3
"""harvest_docs.py — Bulk-upload documentation into Turnstone's context RAG.
Reads a YAML manifest that describes which files or directories to upload,
then POSTs each file to the Turnstone /api/context/docs endpoint.
Usage:
# From a manifest file
python harvest_docs.py --manifest manifests/my-cluster.yaml
# Explicit files (no manifest needed)
python harvest_docs.py --base-url http://localhost:8534 file1.md dir/file2.yaml
# Dry run — show what would be uploaded without sending
python harvest_docs.py --manifest manifests/my-cluster.yaml --dry-run
Manifest format (YAML):
base_url: http://localhost:8534 # optional; overridden by --base-url
sources:
- path: /absolute/path/to/file.md
label: friendly-name # optional; overrides filename in DB
- path: /absolute/path/to/dir/
include: ["*.md", "*.yaml"] # glob patterns; default: see INCLUDE_EXTS
exclude: ["CLAUDE*", "SESSION_*", "*_keys*"]
recursive: false # default false
"""
from __future__ import annotations
import argparse
import fnmatch
import sys
import urllib.request
import urllib.error
from pathlib import Path
try:
import yaml
_HAS_YAML = True
except ImportError:
_HAS_YAML = False
# File extensions included when walking a directory with no explicit `include`.
INCLUDE_EXTS = {".md", ".yaml", ".yml", ".txt", ".conf", ".rst"}
# Default exclude patterns applied to every directory source (unless overridden).
DEFAULT_EXCLUDES = [
"CLAUDE*",
"SESSION_*",
"HANDOFF_*",
"*.key",
"*.pem",
"*.crt",
"node_modules",
".git",
"__pycache__",
]
UPLOAD_PATH = "/turnstone/api/context/docs"
# ---------------------------------------------------------------------------
# File collection
# ---------------------------------------------------------------------------
def _matches_any(name: str, patterns: list[str]) -> bool:
return any(fnmatch.fnmatch(name, p) for p in patterns)
def _collect_from_dir(
root: Path,
include: list[str],
exclude: list[str],
recursive: bool,
) -> list[Path]:
pattern = "**/*" if recursive else "*"
candidates: list[Path] = []
for p in root.glob(pattern):
if not p.is_file():
continue
# Exclude any path component that matches an exclude pattern
if any(_matches_any(part, exclude) for part in p.parts):
continue
if include:
if not _matches_any(p.name, include):
continue
else:
if p.suffix.lower() not in INCLUDE_EXTS:
continue
candidates.append(p)
return sorted(candidates)
def resolve_sources(sources: list[dict]) -> list[tuple[Path, str]]:
"""Return list of (path, label) pairs from a manifest sources list."""
results: list[tuple[Path, str]] = []
for entry in sources:
raw_path = entry.get("path", "")
p = Path(raw_path).expanduser().resolve()
label: str = entry.get("label", "")
include: list[str] = entry.get("include", [])
exclude: list[str] = entry.get("exclude", DEFAULT_EXCLUDES)
recursive: bool = entry.get("recursive", False)
if not p.exists():
print(f" [WARN] path not found, skipping: {p}", file=sys.stderr)
continue
if p.is_file():
results.append((p, label or p.name))
elif p.is_dir():
found = _collect_from_dir(p, include, exclude, recursive)
for f in found:
results.append((f, f.name))
else:
print(f" [WARN] not a file or directory, skipping: {p}", file=sys.stderr)
return results
# ---------------------------------------------------------------------------
# Upload
# ---------------------------------------------------------------------------
def _build_multipart(boundary: bytes, filename: str, content: bytes) -> bytes:
"""Build a minimal multipart/form-data body for a single file field."""
lines: list[bytes] = [
b"--" + boundary,
f'Content-Disposition: form-data; name="file"; filename="{filename}"'.encode(),
b"Content-Type: application/octet-stream",
b"",
content,
b"--" + boundary + b"--",
b"",
]
return b"\r\n".join(lines)
def upload_file(base_url: str, path: Path, label: str) -> dict:
"""POST a file to Turnstone's context doc endpoint. Returns response dict."""
url = base_url.rstrip("/") + UPLOAD_PATH
content = path.read_bytes()
filename = label or path.name
boundary = b"----TurnstoneHarvest"
body = _build_multipart(boundary, filename, content)
content_type = f"multipart/form-data; boundary={boundary.decode()}"
req = urllib.request.Request(
url,
data=body,
headers={"Content-Type": content_type},
method="POST",
)
try:
with urllib.request.urlopen(req, timeout=30) as resp:
import json
return json.loads(resp.read())
except urllib.error.HTTPError as e:
body_text = e.read().decode(errors="replace")
return {"error": f"HTTP {e.code}: {body_text[:200]}"}
except Exception as exc:
return {"error": str(exc)}
# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------
def main() -> None:
parser = argparse.ArgumentParser(
description="Bulk-upload docs into Turnstone context RAG.",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=__doc__,
)
parser.add_argument(
"--manifest", "-m",
metavar="FILE",
help="YAML manifest describing sources to upload",
)
parser.add_argument(
"--base-url", "-u",
default="http://localhost:8534",
metavar="URL",
help="Turnstone base URL (default: http://localhost:8534)",
)
parser.add_argument(
"--dry-run", "-n",
action="store_true",
help="Show files that would be uploaded without actually uploading",
)
parser.add_argument(
"files",
nargs="*",
metavar="FILE",
help="Explicit files to upload (alternative to --manifest)",
)
args = parser.parse_args()
base_url = args.base_url
sources: list[tuple[Path, str]] = []
if args.manifest:
if not _HAS_YAML:
print("ERROR: PyYAML is required for --manifest. Run: pip install pyyaml", file=sys.stderr)
sys.exit(1)
manifest_path = Path(args.manifest).expanduser().resolve()
if not manifest_path.exists():
print(f"ERROR: manifest not found: {manifest_path}", file=sys.stderr)
sys.exit(1)
data = yaml.safe_load(manifest_path.read_text())
base_url = args.base_url if args.base_url != "http://localhost:8534" else data.get("base_url", base_url)
sources = resolve_sources(data.get("sources", []))
for raw in args.files:
p = Path(raw).expanduser().resolve()
if not p.exists():
print(f" [WARN] not found, skipping: {p}", file=sys.stderr)
continue
if p.is_file():
sources.append((p, p.name))
else:
print(f" [WARN] {p} is a directory; use a manifest with recursive:true for directory sources", file=sys.stderr)
if not sources:
print("No files to upload. Pass --manifest or explicit file paths.")
sys.exit(0)
print(f"Turnstone: {base_url}")
print(f"Files to upload: {len(sources)}")
if args.dry_run:
print("\n[DRY RUN] Would upload:")
print()
ok = 0
failed = 0
for path, label in sources:
size_kb = path.stat().st_size / 1024
if args.dry_run:
print(f" {label} ({size_kb:.1f} KB) ← {path}")
ok += 1
continue
print(f" Uploading {label} ({size_kb:.1f} KB)…", end=" ", flush=True)
result = upload_file(base_url, path, label)
if "error" in result:
print(f"FAILED — {result['error']}")
failed += 1
else:
chunks = result.get("chunks_written", result.get("chunks_created", "?"))
facts = result.get("facts_written", 0)
extra = f", {facts} facts" if facts else ""
print(f"OK ({chunks} chunks{extra})")
ok += 1
print()
if args.dry_run:
print(f"Dry run complete. {ok} file(s) would be uploaded.")
else:
print(f"Done. {ok} uploaded, {failed} failed.")
if failed:
sys.exit(1)
if __name__ == "__main__":
main()

View file

@ -0,0 +1,38 @@
# Turnstone context doc manifest — example / template
# Run: python scripts/harvest_docs.py --manifest scripts/manifests/example.yaml
#
# Copy this file, adjust paths and patterns for your environment.
# Keep manifests in version control alongside your docs so ingestion config
# is auditable and reproducible.
# Turnstone URL (can be overridden with --base-url on the command line)
base_url: http://localhost:8534
sources:
# ── Single file ────────────────────────────────────────────────────────────
- path: /path/to/runbooks/service-restart.md
label: runbook-service-restart.md # name stored in context DB (optional)
# ── Directory — include specific extensions, exclude sensitive patterns ─────
- path: /path/to/runbooks/
include: ["*.md", "*.yaml"] # only these extensions
exclude: # skip these filename patterns
- "CLAUDE*" # Claude session prompts
- "SESSION_*" # session summaries
- "HANDOFF_*" # handoff notes
- "*.key" # private keys
- "*.pem"
recursive: false # set true to walk subdirectories
# ── Recursive directory walk ───────────────────────────────────────────────
- path: /path/to/docs/
include: ["*.md"]
exclude:
- "CLAUDE*"
- "*.key"
- "node_modules"
- ".git"
recursive: true
# ── Minimal entry (defaults: INCLUDE_EXTS filter, DEFAULT_EXCLUDES applied) -
- path: /path/to/infrastructure.md

View file

@ -0,0 +1,53 @@
# Turnstone context doc manifest — Heimdall home lab cluster
# Run: python scripts/harvest_docs.py --manifest scripts/manifests/heimdall-devops.yaml
#
# Sections:
# infrastructure/ — network topology, machine specs, service ports
# runbooks/ — incident postmortems and operational procedures
# tdarr/ — media transcoding failure modes and recovery
#
# Files intentionally excluded from this manifest:
# - WireGuard .conf files and KEYS.txt (contain private keys)
# - SESSION_* and HANDOFF_* files (Claude session prompts, not operational docs)
# - CLAUDE.md files (Claude context prompts, not operational docs)
# - Raw tdarr scan data (tdarr/data/*.txt — scan output, not prose)
# - projects/helmet-3d, projects/mycroft-precise (unrelated to cluster ops)
# - collapse-stack/ (resilience planning, not daily log triage material)
# - bastion/sdcard-config, bastion/rpi-config (one-time setup artifacts)
base_url: http://localhost:8534
sources:
# ── Service inventory (most immediately useful for log attribution) ────────
- path: /Library/Development/CircuitForge/circuitforge-infra/inventory/services.md
label: service-inventory.md
# ── Infrastructure topology (partially outdated — note added at top of file)
- path: /Library/Development/CircuitForge/circuitforge-infra/infrastructure/docs/INFRASTRUCTURE.md
label: infrastructure-topology.md
- path: /Library/Development/CircuitForge/circuitforge-infra/infrastructure/docs/GPU_CLUSTERING.md
label: gpu-clustering.md
- path: /Library/Development/CircuitForge/circuitforge-infra/infrastructure/ssh_configs/PROXYJUMP_CONFIG.md
label: ssh-proxyjump-config.md
# ── Runbooks ───────────────────────────────────────────────────────────────
- path: /Library/Development/CircuitForge/circuitforge-infra/runbooks/cf-orch-coordinator.md
label: runbook-cf-orch-coordinator.md
- path: /Library/Development/CircuitForge/circuitforge-infra/runbooks/docker-nfs-boot-race-and-image-security.md
label: runbook-docker-nfs-boot-race.md
- path: /Library/Development/CircuitForge/circuitforge-infra/runbooks/PIHOLE_DNS_HANDOFF.md
label: runbook-pihole-dns.md
# ── Media server / Tdarr ───────────────────────────────────────────────────
- path: /Library/Development/devl/Devops/tdarr/docs/TDARR_RECOVERY_README.md
label: tdarr-recovery.md
- path: /Library/Development/devl/Devops/tdarr/docs/NVENC_CORRUPTION_DETECTION.md
label: tdarr-nvenc-corruption.md
- path: /Library/Development/devl/Devops/tdarr/docs/TDARR_ROBUST_WORKFLOW.md
label: tdarr-robust-workflow.md

View file

@ -0,0 +1,204 @@
#!/usr/bin/env python3
"""One-shot migration: copy data from existing SQLite DBs into Postgres.
Usage:
DATABASE_URL=postgresql://... python scripts/migrate_sqlite_to_postgres.py \
--main-db data/turnstone.db \
--context-db data/turnstone-context.db \
--incidents-db data/turnstone-incidents.db \
[--tenant-id heimdall]
The script is idempotent: rows already present in Postgres (same id) are skipped.
It must be run ONCE per node after deploying the shared Postgres backend.
Prerequisites:
pip install 'psycopg[binary,pool]'
Set DATABASE_URL to the target Postgres connection string.
"""
from __future__ import annotations
import argparse
import os
import sqlite3
import sys
from pathlib import Path
# Allow running from the project root without installing the package
sys.path.insert(0, str(Path(__file__).parent.parent))
def _pg_connect():
import psycopg # type: ignore[import]
url = os.environ.get("DATABASE_URL")
if not url:
print("ERROR: DATABASE_URL not set", file=sys.stderr)
sys.exit(1)
return psycopg.connect(url, autocommit=False)
def _ensure_schema_pg() -> None:
from app.db.schema import ensure_schema, ensure_context_schema, ensure_incidents_schema
from pathlib import Path
ensure_schema(Path("/dev/null")) # db_path ignored for Postgres
ensure_context_schema(Path("/dev/null"))
ensure_incidents_schema(Path("/dev/null"))
print("Postgres schema verified")
def _migrate_table(
src_conn: sqlite3.Connection,
dst_conn,
table: str,
tenant_id: str,
columns: list[str],
conflict_cols: list[str],
) -> int:
"""Copy rows from SQLite table to Postgres. Returns rows inserted."""
# Check if source table exists
try:
rows = src_conn.execute(f"SELECT * FROM {table} LIMIT 0").fetchall() # noqa: S608
except sqlite3.OperationalError:
print(f" {table}: not found in SQLite — skipping")
return 0
# Fetch all rows
src_conn.row_factory = sqlite3.Row
rows = src_conn.execute(f"SELECT * FROM {table}").fetchall() # noqa: S608
if not rows:
print(f" {table}: empty — skipping")
return 0
# Build INSERT ... ON CONFLICT DO NOTHING
col_list = ", ".join(columns)
placeholders = ", ".join("%s" for _ in columns)
conflict = ", ".join(conflict_cols)
sql = (
f"INSERT INTO {table} ({col_list}) VALUES ({placeholders}) " # noqa: S608
f"ON CONFLICT ({conflict}) DO NOTHING"
)
inserted = 0
with dst_conn.cursor() as cur:
for row in rows:
# Build values: inject tenant_id if not present in source row
vals = []
for col in columns:
if col == "tenant_id":
try:
val = row["tenant_id"] or tenant_id
except (IndexError, KeyError):
val = tenant_id
else:
try:
vals.append(row[col])
except (IndexError, KeyError):
vals.append(None)
continue
vals.append(val)
cur.execute(sql, vals)
inserted += cur.rowcount
dst_conn.commit()
print(f" {table}: {inserted}/{len(rows)} rows inserted ({len(rows) - inserted} skipped)")
return inserted
def main() -> None:
parser = argparse.ArgumentParser(description="Migrate Turnstone SQLite → Postgres")
parser.add_argument("--main-db", default="data/turnstone.db")
parser.add_argument("--context-db", default="data/turnstone-context.db")
parser.add_argument("--incidents-db", default="data/turnstone-incidents.db")
parser.add_argument("--tenant-id", default=None, help="Override tenant ID (default: socket.gethostname())")
args = parser.parse_args()
if args.tenant_id:
os.environ["TURNSTONE_TENANT_ID"] = args.tenant_id
import socket
tenant_id = os.environ.get("TURNSTONE_TENANT_ID") or socket.gethostname()
print(f"Migrating as tenant_id={tenant_id!r}")
# Ensure Postgres schema exists first
os.environ.setdefault("DATABASE_URL", "") # schema functions check this
_ensure_schema_pg()
pg = _pg_connect()
total = 0
# ── Main DB ───────────────────────────────────────────────────────────────
main_path = Path(args.main_db)
if main_path.exists():
print(f"\nMigrating main DB: {main_path}")
src = sqlite3.connect(str(main_path))
src.row_factory = sqlite3.Row
total += _migrate_table(src, pg, "log_entries", tenant_id,
columns=["tenant_id", "id", "source_id", "sequence", "timestamp_raw",
"timestamp_iso", "ingest_time", "severity", "repeat_count",
"out_of_order", "matched_patterns", "text"],
conflict_cols=["tenant_id", "id"])
total += _migrate_table(src, pg, "glean_fingerprints", tenant_id,
columns=["tenant_id", "path", "mtime", "size", "gleaned_at"],
conflict_cols=["tenant_id", "path"])
total += _migrate_table(src, pg, "blocklist_candidates", tenant_id,
columns=["id", "tenant_id", "domain_or_ip", "source_device_ip", "source_device_name",
"first_seen", "last_seen", "hit_count", "status", "pushed_at",
"log_evidence", "matched_rule", "llm_score", "llm_reason"],
conflict_cols=["id"])
src.close()
else:
print(f"Main DB not found at {main_path} — skipping")
# ── Context DB ────────────────────────────────────────────────────────────
ctx_path = Path(args.context_db)
if ctx_path.exists():
print(f"\nMigrating context DB: {ctx_path}")
src = sqlite3.connect(str(ctx_path))
total += _migrate_table(src, pg, "context_facts", tenant_id,
columns=["id", "tenant_id", "category", "key", "value", "source", "created_at"],
conflict_cols=["id"])
total += _migrate_table(src, pg, "context_documents", tenant_id,
columns=["id", "tenant_id", "filename", "doc_type", "full_text", "file_size", "uploaded_at"],
conflict_cols=["id"])
total += _migrate_table(src, pg, "context_chunks", tenant_id,
columns=["id", "tenant_id", "document_id", "chunk_index", "text"],
conflict_cols=["id"])
src.close()
else:
print(f"Context DB not found at {ctx_path} — skipping")
# ── Incidents DB ──────────────────────────────────────────────────────────
inc_path = Path(args.incidents_db)
if inc_path.exists():
print(f"\nMigrating incidents DB: {inc_path}")
src = sqlite3.connect(str(inc_path))
total += _migrate_table(src, pg, "incidents", tenant_id,
columns=["id", "tenant_id", "label", "issue_type", "started_at", "ended_at",
"notes", "created_at", "severity"],
conflict_cols=["id"])
total += _migrate_table(src, pg, "received_bundles", tenant_id,
columns=["id", "tenant_id", "source_host", "issue_type", "label", "severity",
"started_at", "bundled_at", "entry_count", "bundle_json"],
conflict_cols=["id"])
total += _migrate_table(src, pg, "sent_bundles", tenant_id,
columns=["id", "tenant_id", "incident_id", "exported_at", "sanitized",
"entry_count", "bundle_json"],
conflict_cols=["id"])
src.close()
else:
print(f"Incidents DB not found at {inc_path} — skipping")
pg.close()
print(f"\nDone. Total rows inserted: {total}")
if __name__ == "__main__":
main()

View file

@ -0,0 +1,96 @@
#!/usr/bin/env python3
"""UDP syslog receiver for Turnstone cluster monitoring.
Listens on UDP port 5140 (non-privileged) and appends received messages
to /devl/turnstone-cluster/data/network-syslog.txt for the Turnstone
live watcher to tail.
Each line written is:
<source_ip> <raw_syslog_message>
This preserves the original syslog content while adding the sender IP so
Turnstone's syslog ingestor can tag entries by device.
Usage:
python3 syslog_receiver.py [--port 5140] [--output /path/to/network-syslog.txt]
Installed as: turnstone-syslog-receiver.service (see adjacent .service file)
"""
from __future__ import annotations
import argparse
import asyncio
import logging
import signal
import sys
from pathlib import Path
logger = logging.getLogger("syslog-receiver")
DEFAULT_PORT = 5140
DEFAULT_OUTPUT = "/devl/turnstone-cluster/data/network-syslog.txt"
class SyslogReceiverProtocol(asyncio.DatagramProtocol):
def __init__(self, output_path: Path) -> None:
self._output_path = output_path
self._fh = output_path.open("a", buffering=1) # line-buffered
def datagram_received(self, data: bytes, addr: tuple[str, int]) -> None:
try:
message = data.decode("utf-8", errors="replace").rstrip("\r\n")
except Exception:
return
if not message:
return
# RFC 3164 messages already include the sending hostname — write raw.
try:
self._fh.write(f"{message}\n")
except OSError as exc:
logger.error("Write failed: %s", exc)
def error_received(self, exc: Exception) -> None:
logger.warning("Socket error: %s", exc)
def connection_lost(self, exc: Exception | None) -> None:
self._fh.flush()
self._fh.close()
async def run(port: int, output_path: Path) -> None:
output_path.parent.mkdir(parents=True, exist_ok=True)
loop = asyncio.get_running_loop()
transport, protocol = await loop.create_datagram_endpoint(
lambda: SyslogReceiverProtocol(output_path),
local_addr=("0.0.0.0", port),
)
logger.info("Listening on UDP :%d%s", port, output_path)
stop = loop.create_future()
for sig in (signal.SIGINT, signal.SIGTERM):
loop.add_signal_handler(sig, stop.set_result, None)
await stop
transport.close()
logger.info("Syslog receiver stopped.")
def main() -> None:
parser = argparse.ArgumentParser(description="UDP syslog receiver for Turnstone")
parser.add_argument("--port", type=int, default=DEFAULT_PORT)
parser.add_argument("--output", default=DEFAULT_OUTPUT)
args = parser.parse_args()
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(message)s",
stream=sys.stdout,
)
asyncio.run(run(args.port, Path(args.output)))
if __name__ == "__main__":
main()

Some files were not shown because too many files have changed in this diff Show more