Browse: audit and enrich domain keyword lists against actual corpus distribution #123

Closed
opened 2026-04-26 19:02:59 -07:00 by pyr0ball · 0 comments
Owner

Problem

app/services/recipe/browser_domains.py contains keyword lists for all four browse domains (cuisine, meal_type, dietary, main_ingredient). These were written as best-guesses before the corpus was fully analyzed and carry a prominent warning in the file header:

These are starter mappings based on the food.com dataset structure. Run SELECT category, count(*) FROM recipes GROUP BY category ORDER BY count(*) DESC LIMIT 50; against the corpus to verify coverage and refine keyword lists before the first production deploy.

This audit has not yet happened. The meal_type domain (tracked separately in #122) has demonstrated the problem: categories that appear empty in the UI because their keywords do not match the corpus vocabulary.

Scope

All four domains need to be audited:

cuisine

Verify that italian, mexican, asian, american, mediterranean, indian, european, latin american category keywords match how the corpus actually categorizes those recipes. The corpus may use different spellings or compound tags.

meal_type

Tracked in #122 — near-empty categories confirmed. Fix required urgently.

dietary

Keywords like vegetarian, vegan, gluten-free may match if the corpus uses dietary tags, but coverage may be low if the corpus does not tag dietary restrictions explicitly. May need inference from ingredient lists rather than keyword matching.

main_ingredient

Keywords like chicken, beef, pork, pasta, vegetables — these are likely to have reasonable overlap with corpus category values, but should be verified.

Tasks

  • Run corpus distribution queries (category + keywords columns)
  • Cross-reference current keyword lists against corpus vocabulary
  • Update keyword lists for each domain based on findings
  • For domains where keyword matching is insufficient (e.g. dietary), evaluate whether a secondary inference strategy is needed (e.g. ingredient-based tagging)
  • Verify each category returns at least 50 recipes after the update
  • Consider adding a browser telemetry query to identify consistently-empty categories in production (migration 020 already captures result_count per browse)

Notes

  • Do NOT renumber or remove existing keywords — only extend. Existing community tags reference these keywords.
  • The browser_telemetry table (migration 020) captures result_count per domain/category/page — query it to see which categories users are hitting that return nothing.
  • Changes to keyword lists do not require a migration.
## Problem `app/services/recipe/browser_domains.py` contains keyword lists for all four browse domains (cuisine, meal_type, dietary, main_ingredient). These were written as best-guesses before the corpus was fully analyzed and carry a prominent warning in the file header: > These are starter mappings based on the food.com dataset structure. Run `SELECT category, count(*) FROM recipes GROUP BY category ORDER BY count(*) DESC LIMIT 50;` against the corpus to verify coverage and refine keyword lists before the first production deploy. This audit has not yet happened. The meal_type domain (tracked separately in #122) has demonstrated the problem: categories that appear empty in the UI because their keywords do not match the corpus vocabulary. ## Scope All four domains need to be audited: ### cuisine Verify that `italian`, `mexican`, `asian`, `american`, `mediterranean`, `indian`, `european`, `latin american` category keywords match how the corpus actually categorizes those recipes. The corpus may use different spellings or compound tags. ### meal_type Tracked in #122 — near-empty categories confirmed. Fix required urgently. ### dietary Keywords like `vegetarian`, `vegan`, `gluten-free` may match if the corpus uses dietary tags, but coverage may be low if the corpus does not tag dietary restrictions explicitly. May need inference from ingredient lists rather than keyword matching. ### main_ingredient Keywords like `chicken`, `beef`, `pork`, `pasta`, `vegetables` — these are likely to have reasonable overlap with corpus `category` values, but should be verified. ## Tasks - [ ] Run corpus distribution queries (category + keywords columns) - [ ] Cross-reference current keyword lists against corpus vocabulary - [ ] Update keyword lists for each domain based on findings - [ ] For domains where keyword matching is insufficient (e.g. dietary), evaluate whether a secondary inference strategy is needed (e.g. ingredient-based tagging) - [ ] Verify each category returns at least 50 recipes after the update - [ ] Consider adding a browser telemetry query to identify consistently-empty categories in production (migration 020 already captures `result_count` per browse) ## Notes - Do NOT renumber or remove existing keywords — only extend. Existing community tags reference these keywords. - The `browser_telemetry` table (migration 020) captures `result_count` per domain/category/page — query it to see which categories users are hitting that return nothing. - Changes to keyword lists do not require a migration.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/kiwi#123
No description provided.