kiwi/scripts
pyr0ball 56f942b3fd feat(pipeline): Purple Carrot scraper hardening + shared pipeline logging
scrape_recipes.py:
- Switch CDX to HTTPS (avoids HTTP 503 rate-limit bucket)
- Restrict product API CDX to 2019–2021 window (pre-HelloFresh instruction stripping)
- Replace inline CDX requests with _cdx_get() helper: retries on 429/503 with
  exponential backoff (15s, 30s, 60s, 120s)
- Increase HTML fallback CDX limit from 5 to 10 timestamps
- Bump CDX_DELAY 0.5s → 3.0s and REPLAY_DELAY 1.2s → 2.0s (polite scraping)
- Fix KeyError: 0 on hero_images dict (normalise dict to list before indexing)

discover_wayback.py:
- Switch CDX to HTTPS

scripts/pipeline/log_utils.py (new):
- attach_pipeline_log(script_name): adds a JSON FileHandler to the root logger
  writing to /Library/Assets/logs/pipeline/<script>_<ts>.jsonl for Avocet
  Turnstone training data ingestion (kiwi#141 / avocet#67)
2026-05-17 13:35:35 -07:00
..
pipeline feat(pipeline): Purple Carrot scraper hardening + shared pipeline logging 2026-05-17 13:35:35 -07:00
__init__.py feat: data pipeline -- USDA FDC ingredient index builder 2026-03-30 22:44:25 -07:00
backfill_keywords.py chore: commit in-progress work -- tag inferrer, imitate endpoint, hall-of-chaos easter egg, migration files, Dockerfile .env defense 2026-04-14 13:23:15 -07:00
backfill_texture_profiles.py feat: recipe engine — assembly templates, prep notes, FTS fixes, texture backfill 2026-04-02 22:12:35 -07:00
tag_sensory_profiles.py feat: sensory profile filter — texture/smell/noise filtering for Browse and Find (kiwi#51) 2026-04-24 09:47:48 -07:00