feat(pipeline): live Purple Carrot scraper via Playwright virtual desktop #137
Labels
No labels
accessibility
backlog
beta-feedback
bug
duplicate
enhancement
feature-request
help wanted
invalid
needs-design
needs-triage
question
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Circuit-Forge/kiwi#137
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
Wayback Machine covers the bulk of archived PC recipes (see
scripts/pipeline/purple_carrot/discover_wayback.py+scrape_recipes.py). Newer/unarchived recipes require the live site.Problem
www.purplecarrot.comis behind Cloudflare challenge pages — curl/requests return 403. Playwright with a real browser (virtual desktop pattern already used in Magpie) handles the JS challenge.Task
Build
scripts/pipeline/purple_carrot/scrape_live.py:scrape_recipes.pyrecipes_purplecarrot.parquetBlocked on
Wayback scraper completion first (to know which URLs are missing from archive). Post-launch — not needed for initial corpus build.
References
scripts/pipeline/purple_carrot/discover_wayback.pyscripts/pipeline/purple_carrot/scrape_recipes.py