Pipeline scripts: write structured logs to shared dir for Turnstone training #141

Open
opened 2026-05-17 11:23:20 -07:00 by pyr0ball · 0 comments
Owner

Summary

Pipeline scrape scripts (discover_wayback.py, scrape_recipes.py, and future scrapers) should emit structured log lines to a shared directory so Avocet can ingest them as Turnstone logreading training data.

Shared log directory

/Library/Assets/logs/pipeline/

One JSONL file per run, named by script + timestamp:

discover_wayback_20260517T1100.jsonl
scrape_recipes_20260517T1102.jsonl

Log line schema

Each line should be a JSON object matching Turnstones expected format:

{"ts": "2026-05-17T11:02:34Z", "level": "INFO", "logger": "scripts.pipeline.purple_carrot.scrape_recipes", "msg": "Scraping slug thai-basil-fried-rice (12/158)", "extra": {"slug": "thai-basil-fried-rice", "idx": 12, "total": 158}}

Implementation

Add a _setup_pipeline_log(script_name) helper in scripts/pipeline/utils.py (create if needed):

  • Adds a logging.FileHandler pointing at /Library/Assets/logs/pipeline/<script_name>_<ts>.jsonl
  • Uses a JSON formatter so each line is parseable
  • Called at the top of main() in each pipeline script alongside the existing logging.basicConfig

Notes

  • Existing logging.basicConfig to stderr stays unchanged (human-readable dev output)
  • The shared dir is the longterm datastore — same mount as /Library/Assets/kiwi/pipeline/
  • Avocet ingestion tracked in avocet#67
## Summary Pipeline scrape scripts (discover_wayback.py, scrape_recipes.py, and future scrapers) should emit structured log lines to a shared directory so Avocet can ingest them as Turnstone logreading training data. ## Shared log directory ``` /Library/Assets/logs/pipeline/ ``` One JSONL file per run, named by script + timestamp: ``` discover_wayback_20260517T1100.jsonl scrape_recipes_20260517T1102.jsonl ``` ## Log line schema Each line should be a JSON object matching Turnstones expected format: ```json {"ts": "2026-05-17T11:02:34Z", "level": "INFO", "logger": "scripts.pipeline.purple_carrot.scrape_recipes", "msg": "Scraping slug thai-basil-fried-rice (12/158)", "extra": {"slug": "thai-basil-fried-rice", "idx": 12, "total": 158}} ``` ## Implementation Add a `_setup_pipeline_log(script_name)` helper in `scripts/pipeline/utils.py` (create if needed): - Adds a `logging.FileHandler` pointing at `/Library/Assets/logs/pipeline/<script_name>_<ts>.jsonl` - Uses a JSON formatter so each line is parseable - Called at the top of `main()` in each pipeline script alongside the existing `logging.basicConfig` ## Notes - Existing `logging.basicConfig` to stderr stays unchanged (human-readable dev output) - The shared dir is the longterm datastore — same mount as `/Library/Assets/kiwi/pipeline/` - Avocet ingestion tracked in avocet#67
pyr0ball added the
enhancement
backlog
labels 2026-06-01 12:11:31 -07:00
pyr0ball added this to the Post-Launch milestone 2026-06-01 12:11:31 -07:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/kiwi#141
No description provided.