recipe_browser_fts: only 1.2K of 3.2M corpus recipes have category/keywords — browser returns sparse results #108

Closed
opened 2026-04-18 14:12:42 -07:00 by pyr0ball · 0 comments
Owner

Problem

The recipe_browser_fts table indexes category, keywords, and inferred_tags. Only ~1,215 of 3.19M corpus recipes have these columns populated (they come from the food.com subset loaded with category metadata). The other 3.1M recipes have empty category and keywords.

Result: browsing by domain/category finds at most ~40 recipes even for common categories like Breakfast.

Root cause

The corpus pipeline only populated category/keywords for a small slice of the dataset. The recipe_browser_fts FTS index is effectively sparse.

Fix options

  1. Re-run pipeline to populate category and keywords for all 3.19M recipes from their source metadata
  2. Expand FTS index to include title + ingredient_names so the full corpus is browsable by ingredient/title search
  3. Use inferred_tags — run the tag_inferrer pipeline on all 3.19M recipes to generate browsable tags from ingredients

Option 3 is the most principled: inferred tags are deterministic from nutrition/ingredient data and don't require the original source metadata.

Impact

recipes_fts (ingredient-based search) is fully populated (3.19M rows). Core recipe suggestion works. Only the browse-by-category feature is affected.

Notes

  • RECIPE_DB_PATH ATTACH fix (#102) is working correctly — this is a data quality issue, not a code bug
  • browser_domains.py has a comment noting that keyword lists need validation against the corpus before production deploy
## Problem The `recipe_browser_fts` table indexes `category`, `keywords`, and `inferred_tags`. Only ~1,215 of 3.19M corpus recipes have these columns populated (they come from the food.com subset loaded with category metadata). The other 3.1M recipes have empty category and keywords. Result: browsing by domain/category finds at most ~40 recipes even for common categories like Breakfast. ## Root cause The corpus pipeline only populated category/keywords for a small slice of the dataset. The `recipe_browser_fts` FTS index is effectively sparse. ## Fix options 1. **Re-run pipeline** to populate `category` and `keywords` for all 3.19M recipes from their source metadata 2. **Expand FTS index** to include `title` + `ingredient_names` so the full corpus is browsable by ingredient/title search 3. **Use inferred_tags** — run the `tag_inferrer` pipeline on all 3.19M recipes to generate browsable tags from ingredients Option 3 is the most principled: inferred tags are deterministic from nutrition/ingredient data and don't require the original source metadata. ## Impact `recipes_fts` (ingredient-based search) is fully populated (3.19M rows). Core recipe suggestion works. Only the browse-by-category feature is affected. ## Notes - `RECIPE_DB_PATH` ATTACH fix (#102) is working correctly — this is a data quality issue, not a code bug - `browser_domains.py` has a comment noting that keyword lists need validation against the corpus before production deploy
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/kiwi#108
No description provided.