Circuit-Forge/kiwi - Forgejo: Beyond coding. We Forge.

Author	SHA1	Message	Date
pyr0ball	21a0664961	feat(pipeline): weekly Purple Carrot harvest script + cron Some checks are pending CI / Backend (Python) (push) Waiting to run Details CI / Frontend (Vue) (push) Waiting to run Details Mirror / mirror (push) Waiting to run Details Add weekly_harvest.sh wrapper that: - Runs discover_current_menu.py to fetch this week's 23 active menu slugs - Runs scrape_live.py with --resume to scrape only new slugs - Appends timestamped output to /Library/Assets/kiwi/pipeline/logs/ Cron entry added to system crontab: 0 23 * * 0 (every Sunday 23:00) Logs: /Library/Assets/kiwi/pipeline/logs/purple_carrot_harvest.log	2026-05-21 16:22:26 -07:00
pyr0ball	a9ab996bcc	feat(pipeline): purple carrot weekly menu scraper with CF bypass Some checks are pending CI / Backend (Python) (push) Waiting to run Details CI / Frontend (Vue) (push) Waiting to run Details Mirror / mirror (push) Waiting to run Details Add three new scripts for Purple Carrot recipe pipeline: - discover_current_menu.py: fetches this week's active menu slugs from /plant-based-recipes using requests (server-rendered HTML, no JS needed). Accumulates slugs across weekly runs for building a recipe corpus over time. - discover_slugs_categories.py: crawls recipe-category listing pages with ?page=N pagination to discover historical slug inventory. Note: category archive slugs (past menu items) 404 when scraped live; only use for identifying currently-featured recipes per category. - scrape_live.py: updated with --slugs-from flag (load slug inventory from any parquet, not just the default Wayback one) and fresh-context-per-slug pattern to bypass Cloudflare session-level bot detection (which fires on the 2nd+ request in a shared browser context). Discovery: the live site only renders full ingredient/instruction content for recipes currently on the active weekly menu. 23/23 current menu recipes scraped successfully (100% hit rate vs ~1% for archived slugs).	2026-05-21 16:16:32 -07:00
pyr0ball	56f942b3fd	feat(pipeline): Purple Carrot scraper hardening + shared pipeline logging scrape_recipes.py: - Switch CDX to HTTPS (avoids HTTP 503 rate-limit bucket) - Restrict product API CDX to 2019–2021 window (pre-HelloFresh instruction stripping) - Replace inline CDX requests with _cdx_get() helper: retries on 429/503 with exponential backoff (15s, 30s, 60s, 120s) - Increase HTML fallback CDX limit from 5 to 10 timestamps - Bump CDX_DELAY 0.5s → 3.0s and REPLAY_DELAY 1.2s → 2.0s (polite scraping) - Fix KeyError: 0 on hero_images dict (normalise dict to list before indexing) discover_wayback.py: - Switch CDX to HTTPS scripts/pipeline/log_utils.py (new): - attach_pipeline_log(script_name): adds a JSON FileHandler to the root logger writing to /Library/Assets/logs/pipeline/<script>_<ts>.jsonl for Avocet Turnstone training data ingestion (kiwi#141 / avocet#67)	2026-05-17 13:35:35 -07:00
pyr0ball	7cad503b35	feat(pipeline): Purple Carrot recipe corpus scraper via Wayback Machine discover_wayback.py — enumerates recipe slugs from archived menu API (/api/v2/menus/<id>) and product API (/api/v1/products/*) plus recipe-category HTML pages. Writes incremental JSONL manifest to /Library/Assets/kiwi/pipeline/pc_slugs.jsonl. scrape_recipes.py — fetches full recipe data per slug using three-tier fallback: product API JSON (oldest captures first), HTML inline state (__NEXT_DATA__ / __INITIAL_STATE__), and JSON-LD structured data. Outputs recipes_purplecarrot.parquet in food.com columnar format so build_recipe_index.py imports it unchanged. Includes SourceURL column for recipe attribution UI (kiwi#139). Checkpoints every 50 recipes. Initial discovery: 158 slugs from menu 1536 + product_api pass. Re-run discover_wayback.py after archive.org stabilizes to pick up older slugs from recipe-category pages. Backlog: live Playwright scraper for post-Wayback recipes (kiwi#137).	2026-05-17 09:16:35 -07:00
pyr0ball	d5a4b14400	chore(pipeline): add fast targeted meal-tag backfill script Some checks failed CI / Backend (Python) (push) Waiting to run Details CI / Frontend (Vue) (push) Waiting to run Details Mirror / mirror (push) Has been cancelled Details Release / release (push) Has been cancelled Details backfill_meal_tags.py merges meal: tags from title-only matching into existing inferred_tags without re-deriving all other signals. ~10x faster than infer_recipe_tags.py --force for meal-tag-only updates: 3.19M recipes in ~5-10min vs ~2.5h for full re-derivation.	2026-04-27 13:00:58 -07:00
pyr0ball	1a7a94a344	feat(browse-counts): add pre-computed FTS counts cache with nightly refresh Multiple concurrent users browsing the 3.2M recipe corpus would cause FTS5 page cache contention and slow per-request queries. Solution: pre-compute counts for all category/subcategory keyword sets into a small SQLite cache. - browse_counts_cache.py: refresh(), load_into_memory(), is_stale() helpers - config.py: BROWSE_COUNTS_PATH setting (default DATA_DIR/browse_counts.db) - main.py: warms in-memory cache on startup; runs nightly refresh task every 24h - infer_recipe_tags.py: auto-refreshes cache after a successful tag run so the app picks up updated FTS counts without a restart	2026-04-21 15:04:23 -07:00
pyr0ball	144d1dc6c4	chore: commit in-progress work -- tag inferrer, imitate endpoint, hall-of-chaos easter egg, migration files, Dockerfile .env defense - app/services/recipe/tag_inferrer.py: infer tags from recipe ingredient text - app/db/migrations/022_recipe_generic_flag.sql, 029_inferred_tags.sql: schema migrations - app/api/endpoints/imitate.py: recipe imitation endpoint stub - app/api/endpoints/community.py: hall-of-chaos easter egg endpoint - scripts/pipeline/infer_recipe_tags.py, backfill_keywords.py: pipeline scripts - scripts/pipeline/build_recipe_index.py: extended index builder - Dockerfile: explicit .env removal as defense-in-depth - frontend/src/components/FeedbackButton.vue: feedback UX improvements - frontend/src/style.css: minor style tweaks - app/cloud_session.py: cloud session improvements - tests/api/test_community_endpoints.py: additional test coverage	2026-04-14 13:23:15 -07:00
pyr0ball	1a493e0ad9	feat: recipe engine — assembly templates, prep notes, FTS fixes, texture backfill - Assembly template system (13 templates: burrito, fried rice, omelette, stir fry, pasta, sandwich, grain bowl, soup/stew, casserole, pancakes, porridge, pie, pudding) with role-based matching, whole-word single-keyword guard, deterministic titles via MD5 pantry hash - Prep-state stripping: strips 'melted butter' → 'butter' for coverage checks; reconstructs actionable states as 'Before you start:' cooking instructions (NutritionPanel prep_notes field + RecipesView.vue display block) - FTS5 fixes: always double-quote all terms; strip apostrophes to prevent syntax errors on brands like "Stouffer's"; 'plant-based' → bare 'based' crash - Bidirectional synonym expansion: alt-meat, alt-chicken, alt-beef, alt-pork mapped to canonical texture class; pantry expansion covers 'hamburger' from 'burger patties' etc. - Texture profile backfill script (378K ingredient_profiles rows) with macro-derived classification in priority order (fatty → creamy → starchy → firm → fibrous → tender → liquid → neutral); oats/legumes starchy-first fix - LLM prompt: ban flavoured/sweetened ingredients (vanilla yoghurt) from savoury - Migrations 014 (nutrition macros) + 015 (recipe FTS index) - Nutrition estimation pipeline script - gitignore MagicMock sqlite test artifacts	2026-04-02 22:12:35 -07:00
pyr0ball	33a5cdec37	feat: cloud auth bypass, VRAM leasing, barcode EXIF fix, pipeline improvements - cloud_session.py: CLOUD_AUTH_BYPASS_IPS with CIDR support; X-Real-IP for Docker bridge NAT-aware client IP resolution; local-dev DB path under CLOUD_DATA_ROOT for bypass sessions - compose.cloud.yml: thread CLOUD_AUTH_BYPASS_IPS from shell env; document Docker bridge CIDR requirement in .env.example - nginx.cloud.conf + nginx.conf: client_max_body_size 20m for barcode uploads - barcode_scanner.py: EXIF orientation correction (PIL ImageOps.exif_transpose) before cv2 decode; rotation coverage extended to [90, 180, 270, 45, 135] to catch sideways barcodes the 270° case was missing - llm_recipe.py: CF-core VRAM lease acquire/release wrapping LLMRouter calls - tasks/runner.py + config.py: COORDINATOR_URL + recipe_llm VRAM budget (4GB) - recipes.py: per-request Store creation inside asyncio.to_thread worker to avoid SQLite check_same_thread violations - download_datasets.py: HF_PARQUET_FILES strategy for repos without dataset builders (lishuyang/recipepairs direct parquet download) - derive_substitutions.py: use recipepairs_recipes.parquet for ingredient lookup; numpy array detection; JSON category parsing - test_build_flavorgraph_index.py: rewritten for CSV-based index format - pyproject.toml: add Pillow>=10.0 for EXIF rotation support	2026-04-01 16:06:23 -07:00
pyr0ball	77627cec23	fix: data pipeline — R-vector parser, allrecipes dataset, unique recipe index - build_recipe_index.py: add _parse_r_vector() for food.com R format, add _parse_allrecipes_text() for corbt/all-recipes text format, _row_to_fields() dispatcher handles both columnar (food.com) and single-text (all-recipes) - build_flavorgraph_index.py: switch from graph.json to nodes/edges CSVs matching actual FlavorGraph repo structure - download_datasets.py: switch recipe source to corbt/all-recipes (2.1M recipes, 807MB) replacing near-empty AkashPS11/recipes_data_food.com - 007_recipe_corpus.sql: add UNIQUE constraint on external_id to prevent duplicate inserts on pipeline reruns	2026-03-31 21:36:13 -07:00
pyr0ball	e44d36e32f	fix: pipeline scripts — connection safety, remove unused recipes_path arg, fix inserted counter, pre-load profile index	2026-03-30 23:10:52 -07:00
pyr0ball	bad6dd175c	feat: data pipeline -- recipe corpus + substitution pair derivation	2026-03-30 22:55:41 -07:00
pyr0ball	59b6a8265f	feat: data pipeline -- FlavorGraph molecule index builder	2026-03-30 22:46:53 -07:00
pyr0ball	97203313c1	feat: data pipeline -- USDA FDC ingredient index builder	2026-03-30 22:44:25 -07:00

14 commits