Circuit-Forge/kiwi - Forgejo: Beyond coding. We Forge.

Circuit-Forge/kiwi

Fork 0

Commit graph

Author	SHA1	Message	Date
pyr0ball	56f942b3fd	feat(pipeline): Purple Carrot scraper hardening + shared pipeline logging scrape_recipes.py: - Switch CDX to HTTPS (avoids HTTP 503 rate-limit bucket) - Restrict product API CDX to 2019–2021 window (pre-HelloFresh instruction stripping) - Replace inline CDX requests with _cdx_get() helper: retries on 429/503 with exponential backoff (15s, 30s, 60s, 120s) - Increase HTML fallback CDX limit from 5 to 10 timestamps - Bump CDX_DELAY 0.5s → 3.0s and REPLAY_DELAY 1.2s → 2.0s (polite scraping) - Fix KeyError: 0 on hero_images dict (normalise dict to list before indexing) discover_wayback.py: - Switch CDX to HTTPS scripts/pipeline/log_utils.py (new): - attach_pipeline_log(script_name): adds a JSON FileHandler to the root logger writing to /Library/Assets/logs/pipeline/<script>_<ts>.jsonl for Avocet Turnstone training data ingestion (kiwi#141 / avocet#67)	2026-05-17 13:35:35 -07:00
pyr0ball	7cad503b35	feat(pipeline): Purple Carrot recipe corpus scraper via Wayback Machine discover_wayback.py — enumerates recipe slugs from archived menu API (/api/v2/menus/<id>) and product API (/api/v1/products/*) plus recipe-category HTML pages. Writes incremental JSONL manifest to /Library/Assets/kiwi/pipeline/pc_slugs.jsonl. scrape_recipes.py — fetches full recipe data per slug using three-tier fallback: product API JSON (oldest captures first), HTML inline state (__NEXT_DATA__ / __INITIAL_STATE__), and JSON-LD structured data. Outputs recipes_purplecarrot.parquet in food.com columnar format so build_recipe_index.py imports it unchanged. Includes SourceURL column for recipe attribution UI (kiwi#139). Checkpoints every 50 recipes. Initial discovery: 158 slugs from menu 1536 + product_api pass. Re-run discover_wayback.py after archive.org stabilizes to pick up older slugs from recipe-category pages. Backlog: live Playwright scraper for post-Wayback recipes (kiwi#137).	2026-05-17 09:16:35 -07:00

Author

SHA1

Message

Date

pyr0ball

56f942b3fd

feat(pipeline): Purple Carrot scraper hardening + shared pipeline logging

scrape_recipes.py:
- Switch CDX to HTTPS (avoids HTTP 503 rate-limit bucket)
- Restrict product API CDX to 2019–2021 window (pre-HelloFresh instruction stripping)
- Replace inline CDX requests with _cdx_get() helper: retries on 429/503 with
  exponential backoff (15s, 30s, 60s, 120s)
- Increase HTML fallback CDX limit from 5 to 10 timestamps
- Bump CDX_DELAY 0.5s → 3.0s and REPLAY_DELAY 1.2s → 2.0s (polite scraping)
- Fix KeyError: 0 on hero_images dict (normalise dict to list before indexing)

discover_wayback.py:
- Switch CDX to HTTPS

scripts/pipeline/log_utils.py (new):
- attach_pipeline_log(script_name): adds a JSON FileHandler to the root logger
  writing to /Library/Assets/logs/pipeline/<script>_<ts>.jsonl for Avocet
  Turnstone training data ingestion (kiwi#141 / avocet#67)

2026-05-17 13:35:35 -07:00

pyr0ball

7cad503b35

feat(pipeline): Purple Carrot recipe corpus scraper via Wayback Machine

discover_wayback.py — enumerates recipe slugs from archived menu API
  (/api/v2/menus/<id>) and product API (/api/v1/products/*) plus
  recipe-category HTML pages. Writes incremental JSONL manifest to
  /Library/Assets/kiwi/pipeline/pc_slugs.jsonl.

scrape_recipes.py — fetches full recipe data per slug using three-tier
  fallback: product API JSON (oldest captures first), HTML inline state
  (__NEXT_DATA__ / __INITIAL_STATE__), and JSON-LD structured data.
  Outputs recipes_purplecarrot.parquet in food.com columnar format so
  build_recipe_index.py imports it unchanged. Includes SourceURL column
  for recipe attribution UI (kiwi#139). Checkpoints every 50 recipes.

Initial discovery: 158 slugs from menu 1536 + product_api pass.
Re-run discover_wayback.py after archive.org stabilizes to pick up
older slugs from recipe-category pages.

Backlog: live Playwright scraper for post-Wayback recipes (kiwi#137).

2026-05-17 09:16:35 -07:00

2 commits