Add weekly_harvest.sh wrapper that:
- Runs discover_current_menu.py to fetch this week's 23 active menu slugs
- Runs scrape_live.py with --resume to scrape only new slugs
- Appends timestamped output to /Library/Assets/kiwi/pipeline/logs/
Cron entry added to system crontab:
0 23 * * 0 (every Sunday 23:00)
Logs: /Library/Assets/kiwi/pipeline/logs/purple_carrot_harvest.log
Add three new scripts for Purple Carrot recipe pipeline:
- discover_current_menu.py: fetches this week's active menu slugs from
/plant-based-recipes using requests (server-rendered HTML, no JS needed).
Accumulates slugs across weekly runs for building a recipe corpus over time.
- discover_slugs_categories.py: crawls recipe-category listing pages with
?page=N pagination to discover historical slug inventory. Note: category
archive slugs (past menu items) 404 when scraped live; only use for
identifying currently-featured recipes per category.
- scrape_live.py: updated with --slugs-from flag (load slug inventory from
any parquet, not just the default Wayback one) and fresh-context-per-slug
pattern to bypass Cloudflare session-level bot detection (which fires on
the 2nd+ request in a shared browser context).
Discovery: the live site only renders full ingredient/instruction content for
recipes currently on the active weekly menu. 23/23 current menu recipes
scraped successfully (100% hit rate vs ~1% for archived slugs).
scrape_recipes.py:
- Switch CDX to HTTPS (avoids HTTP 503 rate-limit bucket)
- Restrict product API CDX to 2019–2021 window (pre-HelloFresh instruction stripping)
- Replace inline CDX requests with _cdx_get() helper: retries on 429/503 with
exponential backoff (15s, 30s, 60s, 120s)
- Increase HTML fallback CDX limit from 5 to 10 timestamps
- Bump CDX_DELAY 0.5s → 3.0s and REPLAY_DELAY 1.2s → 2.0s (polite scraping)
- Fix KeyError: 0 on hero_images dict (normalise dict to list before indexing)
discover_wayback.py:
- Switch CDX to HTTPS
scripts/pipeline/log_utils.py (new):
- attach_pipeline_log(script_name): adds a JSON FileHandler to the root logger
writing to /Library/Assets/logs/pipeline/<script>_<ts>.jsonl for Avocet
Turnstone training data ingestion (kiwi#141 / avocet#67)
discover_wayback.py — enumerates recipe slugs from archived menu API
(/api/v2/menus/<id>) and product API (/api/v1/products/*) plus
recipe-category HTML pages. Writes incremental JSONL manifest to
/Library/Assets/kiwi/pipeline/pc_slugs.jsonl.
scrape_recipes.py — fetches full recipe data per slug using three-tier
fallback: product API JSON (oldest captures first), HTML inline state
(__NEXT_DATA__ / __INITIAL_STATE__), and JSON-LD structured data.
Outputs recipes_purplecarrot.parquet in food.com columnar format so
build_recipe_index.py imports it unchanged. Includes SourceURL column
for recipe attribution UI (kiwi#139). Checkpoints every 50 recipes.
Initial discovery: 158 slugs from menu 1536 + product_api pass.
Re-run discover_wayback.py after archive.org stabilizes to pick up
older slugs from recipe-category pages.
Backlog: live Playwright scraper for post-Wayback recipes (kiwi#137).