kiwi/scripts/pipeline
pyr0ball a9ab996bcc
Some checks are pending
CI / Backend (Python) (push) Waiting to run
CI / Frontend (Vue) (push) Waiting to run
Mirror / mirror (push) Waiting to run
feat(pipeline): purple carrot weekly menu scraper with CF bypass
Add three new scripts for Purple Carrot recipe pipeline:

- discover_current_menu.py: fetches this week's active menu slugs from
  /plant-based-recipes using requests (server-rendered HTML, no JS needed).
  Accumulates slugs across weekly runs for building a recipe corpus over time.

- discover_slugs_categories.py: crawls recipe-category listing pages with
  ?page=N pagination to discover historical slug inventory. Note: category
  archive slugs (past menu items) 404 when scraped live; only use for
  identifying currently-featured recipes per category.

- scrape_live.py: updated with --slugs-from flag (load slug inventory from
  any parquet, not just the default Wayback one) and fresh-context-per-slug
  pattern to bypass Cloudflare session-level bot detection (which fires on
  the 2nd+ request in a shared browser context).

Discovery: the live site only renders full ingredient/instruction content for
recipes currently on the active weekly menu. 23/23 current menu recipes
scraped successfully (100% hit rate vs ~1% for archived slugs).
2026-05-21 16:16:32 -07:00
..
purple_carrot feat(pipeline): purple carrot weekly menu scraper with CF bypass 2026-05-21 16:16:32 -07:00
__init__.py feat: data pipeline -- USDA FDC ingredient index builder 2026-03-30 22:44:25 -07:00
backfill_meal_tags.py chore(pipeline): add fast targeted meal-tag backfill script 2026-04-27 13:00:58 -07:00
build_flavorgraph_index.py fix: data pipeline — R-vector parser, allrecipes dataset, unique recipe index 2026-03-31 21:36:13 -07:00
build_ingredient_index.py feat: recipe engine — assembly templates, prep notes, FTS fixes, texture backfill 2026-04-02 22:12:35 -07:00
build_recipe_index.py chore: commit in-progress work -- tag inferrer, imitate endpoint, hall-of-chaos easter egg, migration files, Dockerfile .env defense 2026-04-14 13:23:15 -07:00
derive_substitutions.py feat: cloud auth bypass, VRAM leasing, barcode EXIF fix, pipeline improvements 2026-04-01 16:06:23 -07:00
download_datasets.py feat: cloud auth bypass, VRAM leasing, barcode EXIF fix, pipeline improvements 2026-04-01 16:06:23 -07:00
estimate_recipe_nutrition.py feat: recipe engine — assembly templates, prep notes, FTS fixes, texture backfill 2026-04-02 22:12:35 -07:00
infer_recipe_tags.py feat(browse-counts): add pre-computed FTS counts cache with nightly refresh 2026-04-21 15:04:23 -07:00
log_utils.py feat(pipeline): Purple Carrot scraper hardening + shared pipeline logging 2026-05-17 13:35:35 -07:00