Recipe corpus scalability: pre-compute browse counts + long-term search service migration #116

Closed
opened 2026-04-21 11:22:58 -07:00 by pyr0ball · 0 comments
Owner

Problem

The recipe corpus is a 3.8 GB SQLite file (/Library/Assets/kiwi/kiwi.db) mounted read-only into the cloud container. Every user request that touches the browse endpoints opens a connection to this file and runs an FTS5 MATCH query.

SQLite WAL mode supports unlimited concurrent readers, so there is no write contention — but there are real pressure points:

  1. Memory pressure — FTS5 loads index pages into the SQLite page cache. Many concurrent browse queries fight for OS page cache and cause repeated disk reads at scale.
  2. Per-process count cache_COUNT_CACHE in store.py caches browse counts per keyword set, but it is in-process. Every API restart wipes it. Multiple API replicas each build their own cache independently.
  3. No connection pooling — every request opens a new SQLite connection to the 3.8 GB file.
  4. ATTACH overhead — per-user kiwi.db files ATTACH to the corpus DB on every request, meaning N concurrent users = N open file handles on the same large file.

Phase 1 (now): Pre-computed browse counts table

Complete. app/services/recipe/browse_counts_cache.py implements the full caching layer:

  • browse_counts SQLite file separate from corpus and per-user DBs
  • Warm from cache at startup; async background refresh if stale (>7 days)
  • Nightly refresh loop in main.py lifespan
  • infer_recipe_tags.py triggers refresh at end of pipeline run
  • _COUNT_CACHE in store.py pre-warmed on startup — FTS never hit for known keyword sets

Phase 2 (medium term): Dedicated search sidecar

Replace the FTS5 index with Meilisearch or Typesense running as a sidecar container. Both handle concurrent search correctly, support faceted browse natively, and are purpose-built for this use case. Track separately.

Phase 3 (long term, only if needed): PostgreSQL corpus

Only if Phase 1 and 2 are insufficient at scale.

Acceptance criteria — Phase 1

  • browse_counts table added
  • Populated at startup and after pipeline run
  • Browse category/subcategory endpoints read from table, not live FTS
  • _COUNT_CACHE pre-warmed from persistent cache file
  • Count refresh triggered automatically at end of infer_recipe_tags.py
## Problem The recipe corpus is a 3.8 GB SQLite file (`/Library/Assets/kiwi/kiwi.db`) mounted read-only into the cloud container. Every user request that touches the browse endpoints opens a connection to this file and runs an FTS5 MATCH query. SQLite WAL mode supports unlimited concurrent readers, so there is no write contention — but there are real pressure points: 1. **Memory pressure** — FTS5 loads index pages into the SQLite page cache. Many concurrent browse queries fight for OS page cache and cause repeated disk reads at scale. 2. **Per-process count cache** — `_COUNT_CACHE` in `store.py` caches browse counts per keyword set, but it is in-process. Every API restart wipes it. Multiple API replicas each build their own cache independently. 3. **No connection pooling** — every request opens a new SQLite connection to the 3.8 GB file. 4. **ATTACH overhead** — per-user `kiwi.db` files ATTACH to the corpus DB on every request, meaning N concurrent users = N open file handles on the same large file. ## Recommended fix — three phases ### Phase 1 (now): Pre-computed browse counts table **Complete.** `app/services/recipe/browse_counts_cache.py` implements the full caching layer: - `browse_counts` SQLite file separate from corpus and per-user DBs - Warm from cache at startup; async background refresh if stale (>7 days) - Nightly refresh loop in `main.py` lifespan - `infer_recipe_tags.py` triggers refresh at end of pipeline run - `_COUNT_CACHE` in `store.py` pre-warmed on startup — FTS never hit for known keyword sets ### Phase 2 (medium term): Dedicated search sidecar Replace the FTS5 index with **Meilisearch** or **Typesense** running as a sidecar container. Both handle concurrent search correctly, support faceted browse natively, and are purpose-built for this use case. Track separately. ### Phase 3 (long term, only if needed): PostgreSQL corpus Only if Phase 1 and 2 are insufficient at scale. ## Acceptance criteria — Phase 1 - [x] `browse_counts` table added - [x] Populated at startup and after pipeline run - [x] Browse category/subcategory endpoints read from table, not live FTS - [x] `_COUNT_CACHE` pre-warmed from persistent cache file - [x] Count refresh triggered automatically at end of `infer_recipe_tags.py`
pyr0ball added the
backlog
enhancement
labels 2026-04-21 11:22:58 -07:00
pyr0ball added this to the Beta — Recipe Suggestions milestone 2026-04-24 16:09:31 -07:00
Sign in to join this conversation.
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/kiwi#116
No description provided.