Circuit-Forge/kiwi - Forgejo: Beyond coding. We Forge.

Circuit-Forge/kiwi

Fork 0

Commit graph

Author	SHA1	Message	Date
pyr0ball	a9ab996bcc	feat(pipeline): purple carrot weekly menu scraper with CF bypass Some checks are pending CI / Backend (Python) (push) Waiting to run Details CI / Frontend (Vue) (push) Waiting to run Details Mirror / mirror (push) Waiting to run Details Add three new scripts for Purple Carrot recipe pipeline: - discover_current_menu.py: fetches this week's active menu slugs from /plant-based-recipes using requests (server-rendered HTML, no JS needed). Accumulates slugs across weekly runs for building a recipe corpus over time. - discover_slugs_categories.py: crawls recipe-category listing pages with ?page=N pagination to discover historical slug inventory. Note: category archive slugs (past menu items) 404 when scraped live; only use for identifying currently-featured recipes per category. - scrape_live.py: updated with --slugs-from flag (load slug inventory from any parquet, not just the default Wayback one) and fresh-context-per-slug pattern to bypass Cloudflare session-level bot detection (which fires on the 2nd+ request in a shared browser context). Discovery: the live site only renders full ingredient/instruction content for recipes currently on the active weekly menu. 23/23 current menu recipes scraped successfully (100% hit rate vs ~1% for archived slugs).	2026-05-21 16:16:32 -07:00

Author

SHA1

Message

Date

pyr0ball

a9ab996bcc

feat(pipeline): purple carrot weekly menu scraper with CF bypass

CI / Backend (Python) (push) Waiting to run

Details

CI / Frontend (Vue) (push) Waiting to run

Details

Mirror / mirror (push) Waiting to run

Details

Add three new scripts for Purple Carrot recipe pipeline:

- discover_current_menu.py: fetches this week's active menu slugs from
  /plant-based-recipes using requests (server-rendered HTML, no JS needed).
  Accumulates slugs across weekly runs for building a recipe corpus over time.

- discover_slugs_categories.py: crawls recipe-category listing pages with
  ?page=N pagination to discover historical slug inventory. Note: category
  archive slugs (past menu items) 404 when scraped live; only use for
  identifying currently-featured recipes per category.

- scrape_live.py: updated with --slugs-from flag (load slug inventory from
  any parquet, not just the default Wayback one) and fresh-context-per-slug
  pattern to bypass Cloudflare session-level bot detection (which fires on
  the 2nd+ request in a shared browser context).

Discovery: the live site only renders full ingredient/instruction content for
recipes currently on the active weekly menu. 23/23 current menu recipes
scraped successfully (100% hit rate vs ~1% for archived slugs).

2026-05-21 16:16:32 -07:00

1 commit