discover_wayback.py — enumerates recipe slugs from archived menu API
(/api/v2/menus/<id>) and product API (/api/v1/products/*) plus
recipe-category HTML pages. Writes incremental JSONL manifest to
/Library/Assets/kiwi/pipeline/pc_slugs.jsonl.
scrape_recipes.py — fetches full recipe data per slug using three-tier
fallback: product API JSON (oldest captures first), HTML inline state
(__NEXT_DATA__ / __INITIAL_STATE__), and JSON-LD structured data.
Outputs recipes_purplecarrot.parquet in food.com columnar format so
build_recipe_index.py imports it unchanged. Includes SourceURL column
for recipe attribution UI (kiwi#139). Checkpoints every 50 recipes.
Initial discovery: 158 slugs from menu 1536 + product_api pass.
Re-run discover_wayback.py after archive.org stabilizes to pick up
older slugs from recipe-category pages.
Backlog: live Playwright scraper for post-Wayback recipes (kiwi#137).
backfill_meal_tags.py merges meal: tags from title-only matching
into existing inferred_tags without re-deriving all other signals.
~10x faster than infer_recipe_tags.py --force for meal-tag-only
updates: 3.19M recipes in ~5-10min vs ~2.5h for full re-derivation.
Multiple concurrent users browsing the 3.2M recipe corpus would cause FTS5 page
cache contention and slow per-request queries. Solution: pre-compute counts for
all category/subcategory keyword sets into a small SQLite cache.
- browse_counts_cache.py: refresh(), load_into_memory(), is_stale() helpers
- config.py: BROWSE_COUNTS_PATH setting (default DATA_DIR/browse_counts.db)
- main.py: warms in-memory cache on startup; runs nightly refresh task every 24h
- infer_recipe_tags.py: auto-refreshes cache after a successful tag run so the
app picks up updated FTS counts without a restart
- build_recipe_index.py: add _parse_r_vector() for food.com R format, add
_parse_allrecipes_text() for corbt/all-recipes text format, _row_to_fields()
dispatcher handles both columnar (food.com) and single-text (all-recipes)
- build_flavorgraph_index.py: switch from graph.json to nodes/edges CSVs
matching actual FlavorGraph repo structure
- download_datasets.py: switch recipe source to corbt/all-recipes (2.1M
recipes, 807MB) replacing near-empty AkashPS11/recipes_data_food.com
- 007_recipe_corpus.sql: add UNIQUE constraint on external_id to prevent
duplicate inserts on pipeline reruns