feat(pipeline): live Purple Carrot scraper via Playwright virtual desktop #137

Open
opened 2026-05-17 09:00:09 -07:00 by pyr0ball · 0 comments
Owner

Context

Wayback Machine covers the bulk of archived PC recipes (see scripts/pipeline/purple_carrot/discover_wayback.py + scrape_recipes.py). Newer/unarchived recipes require the live site.

Problem

www.purplecarrot.com is behind Cloudflare challenge pages — curl/requests return 403. Playwright with a real browser (virtual desktop pattern already used in Magpie) handles the JS challenge.

Task

Build scripts/pipeline/purple_carrot/scrape_live.py:

  • Reuse the Magpie Playwright vdesktop launch pattern
  • Accept a URL list (recipes not found in Wayback manifest)
  • Parse the same recipe HTML structure as scrape_recipes.py
  • Output appended to recipes_purplecarrot.parquet

Blocked on

Wayback scraper completion first (to know which URLs are missing from archive). Post-launch — not needed for initial corpus build.

References

  • scripts/pipeline/purple_carrot/discover_wayback.py
  • scripts/pipeline/purple_carrot/scrape_recipes.py
  • Magpie Playwright vdesktop pattern
## Context Wayback Machine covers the bulk of archived PC recipes (see `scripts/pipeline/purple_carrot/discover_wayback.py` + `scrape_recipes.py`). Newer/unarchived recipes require the live site. ## Problem `www.purplecarrot.com` is behind Cloudflare challenge pages — curl/requests return 403. Playwright with a real browser (virtual desktop pattern already used in Magpie) handles the JS challenge. ## Task Build `scripts/pipeline/purple_carrot/scrape_live.py`: - Reuse the Magpie Playwright vdesktop launch pattern - Accept a URL list (recipes not found in Wayback manifest) - Parse the same recipe HTML structure as `scrape_recipes.py` - Output appended to `recipes_purplecarrot.parquet` ## Blocked on Wayback scraper completion first (to know which URLs are missing from archive). Post-launch — not needed for initial corpus build. ## References - `scripts/pipeline/purple_carrot/discover_wayback.py` - `scripts/pipeline/purple_carrot/scrape_recipes.py` - Magpie Playwright vdesktop pattern
pyr0ball added the
enhancement
label 2026-06-01 12:11:32 -07:00
pyr0ball added this to the Post-Launch milestone 2026-06-01 12:11:32 -07:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/kiwi#137
No description provided.