kiwi/scripts/pipeline
pyr0ball 56f942b3fd feat(pipeline): Purple Carrot scraper hardening + shared pipeline logging
scrape_recipes.py:
- Switch CDX to HTTPS (avoids HTTP 503 rate-limit bucket)
- Restrict product API CDX to 2019–2021 window (pre-HelloFresh instruction stripping)
- Replace inline CDX requests with _cdx_get() helper: retries on 429/503 with
  exponential backoff (15s, 30s, 60s, 120s)
- Increase HTML fallback CDX limit from 5 to 10 timestamps
- Bump CDX_DELAY 0.5s → 3.0s and REPLAY_DELAY 1.2s → 2.0s (polite scraping)
- Fix KeyError: 0 on hero_images dict (normalise dict to list before indexing)

discover_wayback.py:
- Switch CDX to HTTPS

scripts/pipeline/log_utils.py (new):
- attach_pipeline_log(script_name): adds a JSON FileHandler to the root logger
  writing to /Library/Assets/logs/pipeline/<script>_<ts>.jsonl for Avocet
  Turnstone training data ingestion (kiwi#141 / avocet#67)
2026-05-17 13:35:35 -07:00
..
purple_carrot feat(pipeline): Purple Carrot scraper hardening + shared pipeline logging 2026-05-17 13:35:35 -07:00
__init__.py feat: data pipeline -- USDA FDC ingredient index builder 2026-03-30 22:44:25 -07:00
backfill_meal_tags.py chore(pipeline): add fast targeted meal-tag backfill script 2026-04-27 13:00:58 -07:00
build_flavorgraph_index.py fix: data pipeline — R-vector parser, allrecipes dataset, unique recipe index 2026-03-31 21:36:13 -07:00
build_ingredient_index.py feat: recipe engine — assembly templates, prep notes, FTS fixes, texture backfill 2026-04-02 22:12:35 -07:00
build_recipe_index.py chore: commit in-progress work -- tag inferrer, imitate endpoint, hall-of-chaos easter egg, migration files, Dockerfile .env defense 2026-04-14 13:23:15 -07:00
derive_substitutions.py feat: cloud auth bypass, VRAM leasing, barcode EXIF fix, pipeline improvements 2026-04-01 16:06:23 -07:00
download_datasets.py feat: cloud auth bypass, VRAM leasing, barcode EXIF fix, pipeline improvements 2026-04-01 16:06:23 -07:00
estimate_recipe_nutrition.py feat: recipe engine — assembly templates, prep notes, FTS fixes, texture backfill 2026-04-02 22:12:35 -07:00
infer_recipe_tags.py feat(browse-counts): add pre-computed FTS counts cache with nightly refresh 2026-04-21 15:04:23 -07:00
log_utils.py feat(pipeline): Purple Carrot scraper hardening + shared pipeline logging 2026-05-17 13:35:35 -07:00