kiwi/scripts/pipeline
pyr0ball 7cad503b35 feat(pipeline): Purple Carrot recipe corpus scraper via Wayback Machine
discover_wayback.py — enumerates recipe slugs from archived menu API
  (/api/v2/menus/<id>) and product API (/api/v1/products/*) plus
  recipe-category HTML pages. Writes incremental JSONL manifest to
  /Library/Assets/kiwi/pipeline/pc_slugs.jsonl.

scrape_recipes.py — fetches full recipe data per slug using three-tier
  fallback: product API JSON (oldest captures first), HTML inline state
  (__NEXT_DATA__ / __INITIAL_STATE__), and JSON-LD structured data.
  Outputs recipes_purplecarrot.parquet in food.com columnar format so
  build_recipe_index.py imports it unchanged. Includes SourceURL column
  for recipe attribution UI (kiwi#139). Checkpoints every 50 recipes.

Initial discovery: 158 slugs from menu 1536 + product_api pass.
Re-run discover_wayback.py after archive.org stabilizes to pick up
older slugs from recipe-category pages.

Backlog: live Playwright scraper for post-Wayback recipes (kiwi#137).
2026-05-17 09:16:35 -07:00
..
purple_carrot feat(pipeline): Purple Carrot recipe corpus scraper via Wayback Machine 2026-05-17 09:16:35 -07:00
__init__.py feat: data pipeline -- USDA FDC ingredient index builder 2026-03-30 22:44:25 -07:00
backfill_meal_tags.py chore(pipeline): add fast targeted meal-tag backfill script 2026-04-27 13:00:58 -07:00
build_flavorgraph_index.py fix: data pipeline — R-vector parser, allrecipes dataset, unique recipe index 2026-03-31 21:36:13 -07:00
build_ingredient_index.py feat: recipe engine — assembly templates, prep notes, FTS fixes, texture backfill 2026-04-02 22:12:35 -07:00
build_recipe_index.py chore: commit in-progress work -- tag inferrer, imitate endpoint, hall-of-chaos easter egg, migration files, Dockerfile .env defense 2026-04-14 13:23:15 -07:00
derive_substitutions.py feat: cloud auth bypass, VRAM leasing, barcode EXIF fix, pipeline improvements 2026-04-01 16:06:23 -07:00
download_datasets.py feat: cloud auth bypass, VRAM leasing, barcode EXIF fix, pipeline improvements 2026-04-01 16:06:23 -07:00
estimate_recipe_nutrition.py feat: recipe engine — assembly templates, prep notes, FTS fixes, texture backfill 2026-04-02 22:12:35 -07:00
infer_recipe_tags.py feat(browse-counts): add pre-computed FTS counts cache with nightly refresh 2026-04-21 15:04:23 -07:00