From 56f942b3fd849b96a931ed5bd6a78319c914cdbb Mon Sep 17 00:00:00 2001 From: pyr0ball Date: Sun, 17 May 2026 13:35:35 -0700 Subject: [PATCH] feat(pipeline): Purple Carrot scraper hardening + shared pipeline logging MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit scrape_recipes.py: - Switch CDX to HTTPS (avoids HTTP 503 rate-limit bucket) - Restrict product API CDX to 2019–2021 window (pre-HelloFresh instruction stripping) - Replace inline CDX requests with _cdx_get() helper: retries on 429/503 with exponential backoff (15s, 30s, 60s, 120s) - Increase HTML fallback CDX limit from 5 to 10 timestamps - Bump CDX_DELAY 0.5s → 3.0s and REPLAY_DELAY 1.2s → 2.0s (polite scraping) - Fix KeyError: 0 on hero_images dict (normalise dict to list before indexing) discover_wayback.py: - Switch CDX to HTTPS scripts/pipeline/log_utils.py (new): - attach_pipeline_log(script_name): adds a JSON FileHandler to the root logger writing to /Library/Assets/logs/pipeline/