Circuit-Forge/turnstone

Fork 0

feat: reference doc layer — ingest local structured docs (Markdown, PDF, DOCX) for context-aware diagnosis #21

New issue

Closed

opened 2026-05-17 11:33:28 -07:00 by pyr0ball · 2 comments

pyr0ball commented

2026-05-17 11:33:28 -07:00

Owner

Summary

Add a reference document ingestion layer to Turnstone, analogous to how Pagepiper does RAG on local PDFs. When Turnstone identifies a log pattern or anomaly, it should be able to cross-reference an indexed corpus of local docs (runbooks, MkDocs sites, Markdown reference material, PDF/DOCX technical documents) to surface specific recommended actions rather than generic responses.

Motivation

Currently Turnstone diagnoses based on log content alone. Many recurring patterns have known resolutions documented somewhere — a MkDocs runbook, a service CLAUDE.md, a troubleshooting guide — but that knowledge is not connected to the diagnosis pipeline. This feature closes that gap: known patterns map to specific runbook sections, and unknown patterns at least get relevant context injected into the LLM prompt.

PDF and DOCX support is a hard requirement for enterprise use cases where reference material arrives as proprietary formatted documents rather than plain text.

How it should work

Log event / anomaly detected
  → Pattern matcher checks event against known signatures
  → Vector search over reference corpus for relevant sections
  → Ranked doc chunks injected as context for diagnosis
  → Recommendation cites source doc and section

The reference layer is read-only and separate from the log index. Docs are reference material, not event streams — they should not trigger anomaly detection.

Supported corpus formats

Format	Extension(s)	Library	Notes
Markdown / plain text	`.md`, `.txt`, `.rst`	stdlib	Primary format — MkDocs sites, READMEs, runbooks
PDF	`.pdf`	`pymupdf` (preferred) or `pdfminer`	Preserve section structure and heading hierarchy where possible
DOCX	`.docx`	`python-docx`	Use heading styles as chunk boundaries
ODT	`.odt`	`odfpy` or stdlib `zipfile` + `ElementTree`	OpenDocument Text — LibreOffice/OpenOffice; same XML-in-zip structure as DOCX
RTF	`.rtf`	`striprtf`	Rich Text Format — common on macOS (TextEdit default, Mail attachments)
Apple Pages	`.pages`	stdlib `zipfile` + `ElementTree`	Pages files are zip archives containing `index.xml`; extract body text from `sf:p` elements

Format detection is automatic from file extension. Unknown formats are skipped with a warning logged.

Implementation note — ODT: Peregrine already has a working ODT parser using stdlib zipfile + ElementTree only (no odfpy dependency). Port that approach before reaching for odfpy.

Implementation note — Pages: The .pages format is a zip containing index.xml (older format) or Index/Document.iwa (newer Protobuf-based format, 2013+). The XML path is simpler to implement first; the Protobuf path is a stretch goal. Fallback: if neither parse succeeds, log a warning and skip.

Corpus sources (extensible, not hardcoded)

Turnstone should support any local directory as a corpus. First target for validation:

circuitforge-ops/docs/ops/ — service inventory, runbooks, restart procedures, log locations

Other sites this pattern applies to:

Any MkDocs or static doc site on the local filesystem
Product README trees
PDF/DOCX/ODT technical reference material
Apple Pages documents from macOS users
RTF files exported from any word processor

Implementation reference

Pagepiper solves the same problem for PDF rulebooks — same architecture applies here:

Chunk and embed docs at ingest time
Store vectors in a local vector DB alongside the chunk text
At diagnosis time: embed the log context, retrieve top-k chunks, inject as prompt context
Incremental re-index on file change (watch or manual trigger)

Acceptance Criteria

POST /api/corpus/sources — register a local directory as a reference corpus
Ingest pipeline: chunk Markdown, PDF, DOCX, ODT, RTF, and Pages docs; embed; store in local vector DB
PDF extraction: text + section headers preserved as chunk boundaries
DOCX extraction: heading hierarchy used as chunk boundaries
ODT extraction: heading styles used as chunk boundaries (port from Peregrine ODT parser)
RTF extraction: strip formatting, extract plain text
Pages extraction: XML path implemented; Protobuf path (newer format) as stretch goal
Diagnosis pipeline queries reference corpus when analyzing log events
Recommendations cite source doc and section
Corpus entries are filterable/browsable separately from log entries
Incremental re-index on file change (inotify or poll)
Generic fallback when no relevant docs found (no regression on current behavior)
Unknown/unsupported formats logged as warnings and skipped cleanly

Out of scope for this ticket

Live web scraping of remote sites (post-launch backlog)
Automatic corpus discovery (operator registers sources explicitly)
Format-specific parsing beyond text extraction (table extraction, embedded images)
Newer Apple Pages Protobuf format (stretch goal, not blocking)

## Summary Add a reference document ingestion layer to Turnstone, analogous to how Pagepiper does RAG on local PDFs. When Turnstone identifies a log pattern or anomaly, it should be able to cross-reference an indexed corpus of local docs (runbooks, MkDocs sites, Markdown reference material, PDF/DOCX technical documents) to surface specific recommended actions rather than generic responses. ## Motivation Currently Turnstone diagnoses based on log content alone. Many recurring patterns have known resolutions documented somewhere — a MkDocs runbook, a service CLAUDE.md, a troubleshooting guide — but that knowledge is not connected to the diagnosis pipeline. This feature closes that gap: known patterns map to specific runbook sections, and unknown patterns at least get relevant context injected into the LLM prompt. PDF and DOCX support is a hard requirement for enterprise use cases where reference material arrives as proprietary formatted documents rather than plain text. ## How it should work ``` Log event / anomaly detected → Pattern matcher checks event against known signatures → Vector search over reference corpus for relevant sections → Ranked doc chunks injected as context for diagnosis → Recommendation cites source doc and section ``` The reference layer is **read-only** and separate from the log index. Docs are reference material, not event streams — they should not trigger anomaly detection. ## Supported corpus formats | Format | Extension(s) | Library | Notes | |--------|-------------|---------|-------| | Markdown / plain text | `.md`, `.txt`, `.rst` | stdlib | Primary format — MkDocs sites, READMEs, runbooks | | PDF | `.pdf` | `pymupdf` (preferred) or `pdfminer` | Preserve section structure and heading hierarchy where possible | | DOCX | `.docx` | `python-docx` | Use heading styles as chunk boundaries | | ODT | `.odt` | `odfpy` or stdlib `zipfile` + `ElementTree` | OpenDocument Text — LibreOffice/OpenOffice; same XML-in-zip structure as DOCX | | RTF | `.rtf` | `striprtf` | Rich Text Format — common on macOS (TextEdit default, Mail attachments) | | Apple Pages | `.pages` | stdlib `zipfile` + `ElementTree` | Pages files are zip archives containing `index.xml`; extract body text from `sf:p` elements | Format detection is automatic from file extension. Unknown formats are skipped with a warning logged. **Implementation note — ODT:** Peregrine already has a working ODT parser using stdlib `zipfile` + `ElementTree` only (no `odfpy` dependency). Port that approach before reaching for `odfpy`. **Implementation note — Pages:** The `.pages` format is a zip containing `index.xml` (older format) or `Index/Document.iwa` (newer Protobuf-based format, 2013+). The XML path is simpler to implement first; the Protobuf path is a stretch goal. Fallback: if neither parse succeeds, log a warning and skip. ## Corpus sources (extensible, not hardcoded) Turnstone should support any local directory as a corpus. First target for validation: - `circuitforge-ops/docs/ops/` — service inventory, runbooks, restart procedures, log locations Other sites this pattern applies to: - Any MkDocs or static doc site on the local filesystem - Product README trees - PDF/DOCX/ODT technical reference material - Apple Pages documents from macOS users - RTF files exported from any word processor ## Implementation reference Pagepiper solves the same problem for PDF rulebooks — same architecture applies here: - Chunk and embed docs at ingest time - Store vectors in a local vector DB alongside the chunk text - At diagnosis time: embed the log context, retrieve top-k chunks, inject as prompt context - Incremental re-index on file change (watch or manual trigger) ## Acceptance Criteria - [ ] `POST /api/corpus/sources` — register a local directory as a reference corpus - [ ] Ingest pipeline: chunk Markdown, PDF, DOCX, ODT, RTF, and Pages docs; embed; store in local vector DB - [ ] PDF extraction: text + section headers preserved as chunk boundaries - [ ] DOCX extraction: heading hierarchy used as chunk boundaries - [ ] ODT extraction: heading styles used as chunk boundaries (port from Peregrine ODT parser) - [ ] RTF extraction: strip formatting, extract plain text - [ ] Pages extraction: XML path implemented; Protobuf path (newer format) as stretch goal - [ ] Diagnosis pipeline queries reference corpus when analyzing log events - [ ] Recommendations cite source doc and section - [ ] Corpus entries are filterable/browsable separately from log entries - [ ] Incremental re-index on file change (inotify or poll) - [ ] Generic fallback when no relevant docs found (no regression on current behavior) - [ ] Unknown/unsupported formats logged as warnings and skipped cleanly ## Out of scope for this ticket - Live web scraping of remote sites (post-launch backlog) - Automatic corpus discovery (operator registers sources explicitly) - Format-specific parsing beyond text extraction (table extraction, embedded images) - Newer Apple Pages Protobuf format (stretch goal, not blocking)

pyr0ball changed title from ~~feat: ingest circuitforge-ops data as a corpus source~~ to feat: reference doc layer — ingest local structured docs for context-aware diagnosis

2026-05-17 12:05:55 -07:00

pyr0ball changed title from ~~feat: reference doc layer — ingest local structured docs for context-aware diagnosis~~ to feat: reference doc layer — ingest local structured docs (Markdown, PDF, DOCX) for context-aware diagnosis

2026-05-17 12:21:56 -07:00

pyr0ball commented

2026-05-26 23:05:53 -07:00

Author

Owner

This issue is also a prerequisite for the Enterprise POC Deliverable milestone (#90). PDF/DOCX ingestion is specifically needed for the site pilot reference document layer. See #43 (AVCX parser) for the companion format work.

pyr0ball commented

2026-05-28 08:20:31 -07:00

Author

Owner

Shipped as part of the context RAG system:

app/context/store.py — document + chunk CRUD
app/context/chunker.py — format detection, fact extraction, text chunking
app/glean/doc_upload.py — upload adapter
POST /turnstone/api/context/docs — REST endpoint
scripts/harvest_docs.py — generalized bulk-upload script with manifest support
scripts/manifests/heimdall-devops.yaml — cluster-specific manifest (10 docs, 44 chunks ingested)
scripts/manifests/example.yaml — template for other deployments

Docs are chunked and vector-retrieved at diagnose time. The RAG is live and load-bearing.

Shipped as part of the context RAG system: - `app/context/store.py` — document + chunk CRUD - `app/context/chunker.py` — format detection, fact extraction, text chunking - `app/glean/doc_upload.py` — upload adapter - `POST /turnstone/api/context/docs` — REST endpoint - `scripts/harvest_docs.py` — generalized bulk-upload script with manifest support - `scripts/manifests/heimdall-devops.yaml` — cluster-specific manifest (10 docs, 44 chunks ingested) - `scripts/manifests/example.yaml` — template for other deployments Docs are chunked and vector-retrieved at diagnose time. The RAG is live and load-bearing.

pyr0ball closed this issue

2026-05-28 08:20:31 -07:00

pyr0ball referenced this issue from a commit

2026-05-29 14:18:48 -07:00

feat(diagnose): tech-level post-processor, offline mode, API auth, context harvest

pyr0ball referenced this issue from a commit

2026-06-13 21:54:47 -07:00

feat(diagnose): tech-level post-processor, offline mode, API auth, context harvest

pyr0ball referenced this issue from a commit

2026-06-13 22:18:24 -07:00

feat(diagnose): tech-level post-processor, offline mode, API auth, context harvest