feat: reference doc layer — ingest local structured docs (Markdown, PDF, DOCX) for context-aware diagnosis #21

Closed
opened 2026-05-17 11:33:28 -07:00 by pyr0ball · 2 comments
Owner

Summary

Add a reference document ingestion layer to Turnstone, analogous to how Pagepiper does RAG on local PDFs. When Turnstone identifies a log pattern or anomaly, it should be able to cross-reference an indexed corpus of local docs (runbooks, MkDocs sites, Markdown reference material, PDF/DOCX technical documents) to surface specific recommended actions rather than generic responses.

Motivation

Currently Turnstone diagnoses based on log content alone. Many recurring patterns have known resolutions documented somewhere — a MkDocs runbook, a service CLAUDE.md, a troubleshooting guide — but that knowledge is not connected to the diagnosis pipeline. This feature closes that gap: known patterns map to specific runbook sections, and unknown patterns at least get relevant context injected into the LLM prompt.

PDF and DOCX support is a hard requirement for enterprise use cases where reference material arrives as proprietary formatted documents rather than plain text.

How it should work

Log event / anomaly detected
  → Pattern matcher checks event against known signatures
  → Vector search over reference corpus for relevant sections
  → Ranked doc chunks injected as context for diagnosis
  → Recommendation cites source doc and section

The reference layer is read-only and separate from the log index. Docs are reference material, not event streams — they should not trigger anomaly detection.

Supported corpus formats

Format Extension(s) Library Notes
Markdown / plain text .md, .txt, .rst stdlib Primary format — MkDocs sites, READMEs, runbooks
PDF .pdf pymupdf (preferred) or pdfminer Preserve section structure and heading hierarchy where possible
DOCX .docx python-docx Use heading styles as chunk boundaries
ODT .odt odfpy or stdlib zipfile + ElementTree OpenDocument Text — LibreOffice/OpenOffice; same XML-in-zip structure as DOCX
RTF .rtf striprtf Rich Text Format — common on macOS (TextEdit default, Mail attachments)
Apple Pages .pages stdlib zipfile + ElementTree Pages files are zip archives containing index.xml; extract body text from sf:p elements

Format detection is automatic from file extension. Unknown formats are skipped with a warning logged.

Implementation note — ODT: Peregrine already has a working ODT parser using stdlib zipfile + ElementTree only (no odfpy dependency). Port that approach before reaching for odfpy.

Implementation note — Pages: The .pages format is a zip containing index.xml (older format) or Index/Document.iwa (newer Protobuf-based format, 2013+). The XML path is simpler to implement first; the Protobuf path is a stretch goal. Fallback: if neither parse succeeds, log a warning and skip.

Corpus sources (extensible, not hardcoded)

Turnstone should support any local directory as a corpus. First target for validation:

  • circuitforge-ops/docs/ops/ — service inventory, runbooks, restart procedures, log locations

Other sites this pattern applies to:

  • Any MkDocs or static doc site on the local filesystem
  • Product README trees
  • PDF/DOCX/ODT technical reference material
  • Apple Pages documents from macOS users
  • RTF files exported from any word processor

Implementation reference

Pagepiper solves the same problem for PDF rulebooks — same architecture applies here:

  • Chunk and embed docs at ingest time
  • Store vectors in a local vector DB alongside the chunk text
  • At diagnosis time: embed the log context, retrieve top-k chunks, inject as prompt context
  • Incremental re-index on file change (watch or manual trigger)

Acceptance Criteria

  • POST /api/corpus/sources — register a local directory as a reference corpus
  • Ingest pipeline: chunk Markdown, PDF, DOCX, ODT, RTF, and Pages docs; embed; store in local vector DB
  • PDF extraction: text + section headers preserved as chunk boundaries
  • DOCX extraction: heading hierarchy used as chunk boundaries
  • ODT extraction: heading styles used as chunk boundaries (port from Peregrine ODT parser)
  • RTF extraction: strip formatting, extract plain text
  • Pages extraction: XML path implemented; Protobuf path (newer format) as stretch goal
  • Diagnosis pipeline queries reference corpus when analyzing log events
  • Recommendations cite source doc and section
  • Corpus entries are filterable/browsable separately from log entries
  • Incremental re-index on file change (inotify or poll)
  • Generic fallback when no relevant docs found (no regression on current behavior)
  • Unknown/unsupported formats logged as warnings and skipped cleanly

Out of scope for this ticket

  • Live web scraping of remote sites (post-launch backlog)
  • Automatic corpus discovery (operator registers sources explicitly)
  • Format-specific parsing beyond text extraction (table extraction, embedded images)
  • Newer Apple Pages Protobuf format (stretch goal, not blocking)
## Summary Add a reference document ingestion layer to Turnstone, analogous to how Pagepiper does RAG on local PDFs. When Turnstone identifies a log pattern or anomaly, it should be able to cross-reference an indexed corpus of local docs (runbooks, MkDocs sites, Markdown reference material, PDF/DOCX technical documents) to surface specific recommended actions rather than generic responses. ## Motivation Currently Turnstone diagnoses based on log content alone. Many recurring patterns have known resolutions documented somewhere — a MkDocs runbook, a service CLAUDE.md, a troubleshooting guide — but that knowledge is not connected to the diagnosis pipeline. This feature closes that gap: known patterns map to specific runbook sections, and unknown patterns at least get relevant context injected into the LLM prompt. PDF and DOCX support is a hard requirement for enterprise use cases where reference material arrives as proprietary formatted documents rather than plain text. ## How it should work ``` Log event / anomaly detected → Pattern matcher checks event against known signatures → Vector search over reference corpus for relevant sections → Ranked doc chunks injected as context for diagnosis → Recommendation cites source doc and section ``` The reference layer is **read-only** and separate from the log index. Docs are reference material, not event streams — they should not trigger anomaly detection. ## Supported corpus formats | Format | Extension(s) | Library | Notes | |--------|-------------|---------|-------| | Markdown / plain text | `.md`, `.txt`, `.rst` | stdlib | Primary format — MkDocs sites, READMEs, runbooks | | PDF | `.pdf` | `pymupdf` (preferred) or `pdfminer` | Preserve section structure and heading hierarchy where possible | | DOCX | `.docx` | `python-docx` | Use heading styles as chunk boundaries | | ODT | `.odt` | `odfpy` or stdlib `zipfile` + `ElementTree` | OpenDocument Text — LibreOffice/OpenOffice; same XML-in-zip structure as DOCX | | RTF | `.rtf` | `striprtf` | Rich Text Format — common on macOS (TextEdit default, Mail attachments) | | Apple Pages | `.pages` | stdlib `zipfile` + `ElementTree` | Pages files are zip archives containing `index.xml`; extract body text from `sf:p` elements | Format detection is automatic from file extension. Unknown formats are skipped with a warning logged. **Implementation note — ODT:** Peregrine already has a working ODT parser using stdlib `zipfile` + `ElementTree` only (no `odfpy` dependency). Port that approach before reaching for `odfpy`. **Implementation note — Pages:** The `.pages` format is a zip containing `index.xml` (older format) or `Index/Document.iwa` (newer Protobuf-based format, 2013+). The XML path is simpler to implement first; the Protobuf path is a stretch goal. Fallback: if neither parse succeeds, log a warning and skip. ## Corpus sources (extensible, not hardcoded) Turnstone should support any local directory as a corpus. First target for validation: - `circuitforge-ops/docs/ops/` — service inventory, runbooks, restart procedures, log locations Other sites this pattern applies to: - Any MkDocs or static doc site on the local filesystem - Product README trees - PDF/DOCX/ODT technical reference material - Apple Pages documents from macOS users - RTF files exported from any word processor ## Implementation reference Pagepiper solves the same problem for PDF rulebooks — same architecture applies here: - Chunk and embed docs at ingest time - Store vectors in a local vector DB alongside the chunk text - At diagnosis time: embed the log context, retrieve top-k chunks, inject as prompt context - Incremental re-index on file change (watch or manual trigger) ## Acceptance Criteria - [ ] `POST /api/corpus/sources` — register a local directory as a reference corpus - [ ] Ingest pipeline: chunk Markdown, PDF, DOCX, ODT, RTF, and Pages docs; embed; store in local vector DB - [ ] PDF extraction: text + section headers preserved as chunk boundaries - [ ] DOCX extraction: heading hierarchy used as chunk boundaries - [ ] ODT extraction: heading styles used as chunk boundaries (port from Peregrine ODT parser) - [ ] RTF extraction: strip formatting, extract plain text - [ ] Pages extraction: XML path implemented; Protobuf path (newer format) as stretch goal - [ ] Diagnosis pipeline queries reference corpus when analyzing log events - [ ] Recommendations cite source doc and section - [ ] Corpus entries are filterable/browsable separately from log entries - [ ] Incremental re-index on file change (inotify or poll) - [ ] Generic fallback when no relevant docs found (no regression on current behavior) - [ ] Unknown/unsupported formats logged as warnings and skipped cleanly ## Out of scope for this ticket - Live web scraping of remote sites (post-launch backlog) - Automatic corpus discovery (operator registers sources explicitly) - Format-specific parsing beyond text extraction (table extraction, embedded images) - Newer Apple Pages Protobuf format (stretch goal, not blocking)
pyr0ball changed title from feat: ingest circuitforge-ops data as a corpus source to feat: reference doc layer — ingest local structured docs for context-aware diagnosis 2026-05-17 12:05:55 -07:00
pyr0ball changed title from feat: reference doc layer — ingest local structured docs for context-aware diagnosis to feat: reference doc layer — ingest local structured docs (Markdown, PDF, DOCX) for context-aware diagnosis 2026-05-17 12:21:56 -07:00
Author
Owner

This issue is also a prerequisite for the Enterprise POC Deliverable milestone (#90). PDF/DOCX ingestion is specifically needed for the site pilot reference document layer. See #43 (AVCX parser) for the companion format work.

This issue is also a prerequisite for the Enterprise POC Deliverable milestone (#90). PDF/DOCX ingestion is specifically needed for the site pilot reference document layer. See #43 (AVCX parser) for the companion format work.
Author
Owner

Shipped as part of the context RAG system:

  • app/context/store.py — document + chunk CRUD
  • app/context/chunker.py — format detection, fact extraction, text chunking
  • app/glean/doc_upload.py — upload adapter
  • POST /turnstone/api/context/docs — REST endpoint
  • scripts/harvest_docs.py — generalized bulk-upload script with manifest support
  • scripts/manifests/heimdall-devops.yaml — cluster-specific manifest (10 docs, 44 chunks ingested)
  • scripts/manifests/example.yaml — template for other deployments

Docs are chunked and vector-retrieved at diagnose time. The RAG is live and load-bearing.

Shipped as part of the context RAG system: - `app/context/store.py` — document + chunk CRUD - `app/context/chunker.py` — format detection, fact extraction, text chunking - `app/glean/doc_upload.py` — upload adapter - `POST /turnstone/api/context/docs` — REST endpoint - `scripts/harvest_docs.py` — generalized bulk-upload script with manifest support - `scripts/manifests/heimdall-devops.yaml` — cluster-specific manifest (10 docs, 44 chunks ingested) - `scripts/manifests/example.yaml` — template for other deployments Docs are chunked and vector-retrieved at diagnose time. The RAG is live and load-bearing.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/turnstone#21
No description provided.