feat: reference doc layer — ingest local structured docs (Markdown, PDF, DOCX) for context-aware diagnosis #21
Labels
No labels
compliance
demo
deployment
docs
enhancement
parser
patterns
performance
security
ux
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Circuit-Forge/turnstone#21
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Add a reference document ingestion layer to Turnstone, analogous to how Pagepiper does RAG on local PDFs. When Turnstone identifies a log pattern or anomaly, it should be able to cross-reference an indexed corpus of local docs (runbooks, MkDocs sites, Markdown reference material, PDF/DOCX technical documents) to surface specific recommended actions rather than generic responses.
Motivation
Currently Turnstone diagnoses based on log content alone. Many recurring patterns have known resolutions documented somewhere — a MkDocs runbook, a service CLAUDE.md, a troubleshooting guide — but that knowledge is not connected to the diagnosis pipeline. This feature closes that gap: known patterns map to specific runbook sections, and unknown patterns at least get relevant context injected into the LLM prompt.
PDF and DOCX support is a hard requirement for enterprise use cases where reference material arrives as proprietary formatted documents rather than plain text.
How it should work
The reference layer is read-only and separate from the log index. Docs are reference material, not event streams — they should not trigger anomaly detection.
Supported corpus formats
.md,.txt,.rst.pdfpymupdf(preferred) orpdfminer.docxpython-docx.odtodfpyor stdlibzipfile+ElementTree.rtfstriprtf.pageszipfile+ElementTreeindex.xml; extract body text fromsf:pelementsFormat detection is automatic from file extension. Unknown formats are skipped with a warning logged.
Implementation note — ODT: Peregrine already has a working ODT parser using stdlib
zipfile+ElementTreeonly (noodfpydependency). Port that approach before reaching forodfpy.Implementation note — Pages: The
.pagesformat is a zip containingindex.xml(older format) orIndex/Document.iwa(newer Protobuf-based format, 2013+). The XML path is simpler to implement first; the Protobuf path is a stretch goal. Fallback: if neither parse succeeds, log a warning and skip.Corpus sources (extensible, not hardcoded)
Turnstone should support any local directory as a corpus. First target for validation:
circuitforge-ops/docs/ops/— service inventory, runbooks, restart procedures, log locationsOther sites this pattern applies to:
Implementation reference
Pagepiper solves the same problem for PDF rulebooks — same architecture applies here:
Acceptance Criteria
POST /api/corpus/sources— register a local directory as a reference corpusOut of scope for this ticket
feat: ingest circuitforge-ops data as a corpus sourceto feat: reference doc layer — ingest local structured docs for context-aware diagnosisfeat: reference doc layer — ingest local structured docs for context-aware diagnosisto feat: reference doc layer — ingest local structured docs (Markdown, PDF, DOCX) for context-aware diagnosisThis issue is also a prerequisite for the Enterprise POC Deliverable milestone (#90). PDF/DOCX ingestion is specifically needed for the site pilot reference document layer. See #43 (AVCX parser) for the companion format work.
Shipped as part of the context RAG system:
app/context/store.py— document + chunk CRUDapp/context/chunker.py— format detection, fact extraction, text chunkingapp/glean/doc_upload.py— upload adapterPOST /turnstone/api/context/docs— REST endpointscripts/harvest_docs.py— generalized bulk-upload script with manifest supportscripts/manifests/heimdall-devops.yaml— cluster-specific manifest (10 docs, 44 chunks ingested)scripts/manifests/example.yaml— template for other deploymentsDocs are chunked and vector-retrieved at diagnose time. The RAG is live and load-bearing.