scrape_recipes.py:
- Switch CDX to HTTPS (avoids HTTP 503 rate-limit bucket)
- Restrict product API CDX to 2019–2021 window (pre-HelloFresh instruction stripping)
- Replace inline CDX requests with _cdx_get() helper: retries on 429/503 with
exponential backoff (15s, 30s, 60s, 120s)
- Increase HTML fallback CDX limit from 5 to 10 timestamps
- Bump CDX_DELAY 0.5s → 3.0s and REPLAY_DELAY 1.2s → 2.0s (polite scraping)
- Fix KeyError: 0 on hero_images dict (normalise dict to list before indexing)
discover_wayback.py:
- Switch CDX to HTTPS
scripts/pipeline/log_utils.py (new):
- attach_pipeline_log(script_name): adds a JSON FileHandler to the root logger
writing to /Library/Assets/logs/pipeline/<script>_<ts>.jsonl for Avocet
Turnstone training data ingestion (kiwi#141 / avocet#67)
discover_wayback.py — enumerates recipe slugs from archived menu API
(/api/v2/menus/<id>) and product API (/api/v1/products/*) plus
recipe-category HTML pages. Writes incremental JSONL manifest to
/Library/Assets/kiwi/pipeline/pc_slugs.jsonl.
scrape_recipes.py — fetches full recipe data per slug using three-tier
fallback: product API JSON (oldest captures first), HTML inline state
(__NEXT_DATA__ / __INITIAL_STATE__), and JSON-LD structured data.
Outputs recipes_purplecarrot.parquet in food.com columnar format so
build_recipe_index.py imports it unchanged. Includes SourceURL column
for recipe attribution UI (kiwi#139). Checkpoints every 50 recipes.
Initial discovery: 158 slugs from menu 1536 + product_api pass.
Re-run discover_wayback.py after archive.org stabilizes to pick up
older slugs from recipe-category pages.
Backlog: live Playwright scraper for post-Wayback recipes (kiwi#137).