feat: context-aware dynamic log discovery — scan /var/log and /opt ranked by problem context and reference corpus #23

Closed
opened 2026-05-17 14:24:04 -07:00 by pyr0ball · 0 comments
Owner

Summary

When connecting to an unfamiliar system (especially over SSH), Turnstone should not require the user to know where log files live. It should scan candidate paths, rank them against the active problem context and reference corpus, and pre-select the most relevant ones — with the user able to confirm, adjust, or add manually.

Problem

Hard-coded or manually specified source paths break down on unfamiliar systems. A tech connecting to a remote host for the first time should not need to know the layout of that system to start pulling useful logs.

Approach: context-aware path ranking

User describes problem  +  Active reference corpus
          │                          │
          └──────────┬───────────────┘
                     ▼
          Context keywords / embedding
                     │
                     ▼
          Remote (or local) filesystem scan
            /var/log/**
            /opt/**/logs/
            /opt/**/*.log
                     │
                     ▼
          Score each candidate path
            - Corpus cite: reference doc mentions this path → highest priority
            - Name match: path tokens overlap with problem context keywords
            - Recency: modified within the ingest window → boost
            - Size: non-empty file → boost; empty → demote
                     │
                     ▼
          Ranked list with pre-selections shown to user
          User accepts, deselects, or adds paths manually

Confidence tiers

Tier Source Default behavior
High Path explicitly cited in reference corpus Auto-selected, shown first
Medium Path name or parent dir matches context keywords Pre-selected, shown with match reason
Low Recently modified, no context match Listed, not pre-selected
Skip Empty, binary, or not modified in >30 days Hidden unless user expands

Corpus as path oracle

At reference corpus ingest time, extract filesystem path patterns from chunk text and store them as chunk metadata. At discovery time this becomes a fast metadata lookup — no LLM inference required. Reference docs the user already carries for diagnosis serve double duty as a site map for log collection, with no extra configuration.

Implementation notes

  • Remote scan command (SSH or local subprocess):
    find /var/log /opt -type f -name "*.log" -newer /tmp/.ts_turnstone 2>/dev/null
    
    Create the timestamp file for the recency window before scanning; avoids stat-ing every file.
  • Path extraction from corpus: scan indexed chunks for filesystem path patterns (/[a-z][^\s"]+\.log and directory patterns) at ingest time; store as metadata on the chunk
  • Keyword matching for Medium tier: bag-of-words match against filename and parent directory tokens — no embedding needed at the discovery step
  • Semantic embedding fallback: only if keyword match returns zero Medium-tier candidates
  • Works for both local and SSH-remote sources — the scan command is the same either way

Acceptance Criteria

  • GET /api/sources/{id}/discover — trigger a scan and return ranked candidate paths
  • Corpus ingest pipeline extracts and stores filesystem path mentions as chunk metadata
  • Discovery ranks paths using the three-tier model (corpus cite / keyword / recency)
  • Discovery results shown in the UI as a selectable list with pre-selections and match reasons
  • User can accept defaults, deselect, or manually add paths not in the list
  • Works for both local sources and SSH remote sources (see Turnstone#22)
  • Empty, binary, and stale files hidden by default with an "expand" option
  • No regression on manually configured sources — discovery is opt-in per source
  • Turnstone#21 — reference doc layer (corpus that feeds the path oracle)
  • Turnstone#22 — SSH remote ingest (primary use case driving this feature)
## Summary When connecting to an unfamiliar system (especially over SSH), Turnstone should not require the user to know where log files live. It should scan candidate paths, rank them against the active problem context and reference corpus, and pre-select the most relevant ones — with the user able to confirm, adjust, or add manually. ## Problem Hard-coded or manually specified source paths break down on unfamiliar systems. A tech connecting to a remote host for the first time should not need to know the layout of that system to start pulling useful logs. ## Approach: context-aware path ranking ``` User describes problem + Active reference corpus │ │ └──────────┬───────────────┘ ▼ Context keywords / embedding │ ▼ Remote (or local) filesystem scan /var/log/** /opt/**/logs/ /opt/**/*.log │ ▼ Score each candidate path - Corpus cite: reference doc mentions this path → highest priority - Name match: path tokens overlap with problem context keywords - Recency: modified within the ingest window → boost - Size: non-empty file → boost; empty → demote │ ▼ Ranked list with pre-selections shown to user User accepts, deselects, or adds paths manually ``` ## Confidence tiers | Tier | Source | Default behavior | |------|--------|------------------| | High | Path explicitly cited in reference corpus | Auto-selected, shown first | | Medium | Path name or parent dir matches context keywords | Pre-selected, shown with match reason | | Low | Recently modified, no context match | Listed, not pre-selected | | Skip | Empty, binary, or not modified in >30 days | Hidden unless user expands | ## Corpus as path oracle At reference corpus ingest time, extract filesystem path patterns from chunk text and store them as chunk metadata. At discovery time this becomes a fast metadata lookup — no LLM inference required. Reference docs the user already carries for diagnosis serve double duty as a site map for log collection, with no extra configuration. ## Implementation notes - Remote scan command (SSH or local subprocess): ``` find /var/log /opt -type f -name "*.log" -newer /tmp/.ts_turnstone 2>/dev/null ``` Create the timestamp file for the recency window before scanning; avoids stat-ing every file. - Path extraction from corpus: scan indexed chunks for filesystem path patterns (`/[a-z][^\s"]+\.log` and directory patterns) at ingest time; store as metadata on the chunk - Keyword matching for Medium tier: bag-of-words match against filename and parent directory tokens — no embedding needed at the discovery step - Semantic embedding fallback: only if keyword match returns zero Medium-tier candidates - Works for both local and SSH-remote sources — the scan command is the same either way ## Acceptance Criteria - [ ] `GET /api/sources/{id}/discover` — trigger a scan and return ranked candidate paths - [ ] Corpus ingest pipeline extracts and stores filesystem path mentions as chunk metadata - [ ] Discovery ranks paths using the three-tier model (corpus cite / keyword / recency) - [ ] Discovery results shown in the UI as a selectable list with pre-selections and match reasons - [ ] User can accept defaults, deselect, or manually add paths not in the list - [ ] Works for both local sources and SSH remote sources (see Turnstone#22) - [ ] Empty, binary, and stale files hidden by default with an "expand" option - [ ] No regression on manually configured sources — discovery is opt-in per source ## Related - Turnstone#21 — reference doc layer (corpus that feeds the path oracle) - Turnstone#22 — SSH remote ingest (primary use case driving this feature)
pyr0ball added this to the beta milestone 2026-06-01 15:09:59 -07:00
pyr0ball modified the milestone from beta to (deleted) 2026-06-05 11:40:09 -07:00
pyr0ball modified the milestone from beta to beta 2026-06-14 12:15:45 -07:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/turnstone#23
No description provided.