Ingest pipeline scrape logs from shared dir into log corpus #67

New issue

Closed

opened 2026-05-17 11:23:21 -07:00 by pyr0ball · 0 comments

pyr0ball commented

2026-05-17 11:23:21 -07:00

Owner

Summary

Avocet should be able to ingest structured pipeline log files from the shared log directory /Library/Assets/logs/pipeline/ into the log corpus for Turnstone logreading model training. This is the pull-side companion to kiwi#141.

Shared log directory

/Library/Assets/logs/pipeline/

Files are JSONL, one per scrape run, named <script>_<ts>.jsonl. Each line is a structured log record:

{"ts": "...", "level": "INFO", "logger": "scripts.pipeline...", "msg": "...", "extra": {...}}

Implementation options

Manual import: CLI or admin endpoint that walks the dir and POSTs batches to the existing POST /api/log-corpus endpoint
Watched ingest task: Background task that polls /Library/Assets/logs/pipeline/ for new files and auto-ingests (similar to how the email corpus works)

Option 1 is simpler for now; Option 2 is better long-term if scrape runs happen frequently.

Label assignment

Pipeline log lines should get a default label (e.g. pipeline_scrape) so they are kept separate from app/service logs in the Turnstone training split. The label schema in app/data/log_corpus.py may need a new source type.

Notes

Kiwi side tracked in kiwi#141
The shared dir is NFS-mounted on both Heimdall and Sif at /Library/Assets/
Keep ingestion idempotent: track which files have already been ingested (by filename or content hash)

## Summary Avocet should be able to ingest structured pipeline log files from the shared log directory `/Library/Assets/logs/pipeline/` into the log corpus for Turnstone logreading model training. This is the pull-side companion to kiwi#141. ## Shared log directory ``` /Library/Assets/logs/pipeline/ ``` Files are JSONL, one per scrape run, named `<script>_<ts>.jsonl`. Each line is a structured log record: ```json {"ts": "...", "level": "INFO", "logger": "scripts.pipeline...", "msg": "...", "extra": {...}} ``` ## Implementation options 1. **Manual import**: CLI or admin endpoint that walks the dir and POSTs batches to the existing `POST /api/log-corpus` endpoint 2. **Watched ingest task**: Background task that polls `/Library/Assets/logs/pipeline/` for new files and auto-ingests (similar to how the email corpus works) Option 1 is simpler for now; Option 2 is better long-term if scrape runs happen frequently. ## Label assignment Pipeline log lines should get a default label (e.g. `pipeline_scrape`) so they are kept separate from app/service logs in the Turnstone training split. The label schema in `app/data/log_corpus.py` may need a new source type. ## Notes - Kiwi side tracked in kiwi#141 - The shared dir is NFS-mounted on both Heimdall and Sif at `/Library/Assets/` - Keep ingestion idempotent: track which files have already been ingested (by filename or content hash)