avocet/app/data
pyr0ball 9bb88b168f feat(corpus): pipeline log ingest from shared dir (closes #67)
Pull-side companion to kiwi#141. Ingests structured JSONL pipeline logs
from /Library/Assets/logs/pipeline/ into the log corpus for Turnstone
logreading model training.

- app/data/log_corpus.py: add ingested_pipeline_files tracking table,
  _pipeline_ingest_dir() config helper, _ingest_one_file() parser, and
  POST /api/corpus/pipeline-ingest endpoint
- source_host = "pipeline_scrape"; source_id from logger field; extra
  dict stored as matched_patterns; batch_type = "pipeline_log"
- Idempotent by filename: skips files already in ingested_pipeline_files
- config/label_tool.yaml.example: add corpus section with pipeline_ingest_dir
  and push sources comment block
- tests/test_log_corpus.py: 8 new tests covering ingest, idempotency,
  non-JSONL filtering, malformed line resilience, incremental runs
2026-05-17 11:28:33 -07:00
..
__init__.py feat: extract label queue API into app/data/label.py 2026-05-01 18:48:14 -07:00
corrections.py fix: align train job/results API envelope, config_json key, progress SSE, dashboard model_key 2026-05-02 21:22:18 -07:00
fetch.py feat: extract fetch routes and IMAP helpers into app/data/fetch.py 2026-05-01 21:57:31 -07:00
imitate.py feat(imitate): task-model assignment routing via cf-orch 2026-05-17 11:23:55 -07:00
label.py feat: extract label queue API into app/data/label.py 2026-05-01 18:48:14 -07:00
log_corpus.py feat(corpus): pipeline log ingest from shared dir (closes #67) 2026-05-17 11:28:33 -07:00