Recipe corpus scalability: pre-compute browse counts + long-term search service migration #116

New issue

Closed

opened 2026-04-21 11:22:58 -07:00 by pyr0ball · 0 comments

pyr0ball commented

2026-04-21 11:22:58 -07:00

Owner

Problem

The recipe corpus is a 3.8 GB SQLite file (/Library/Assets/kiwi/kiwi.db) mounted read-only into the cloud container. Every user request that touches the browse endpoints opens a connection to this file and runs an FTS5 MATCH query.

SQLite WAL mode supports unlimited concurrent readers, so there is no write contention — but there are real pressure points:

Memory pressure — FTS5 loads index pages into the SQLite page cache. Many concurrent browse queries fight for OS page cache and cause repeated disk reads at scale.
Per-process count cache — _COUNT_CACHE in store.py caches browse counts per keyword set, but it is in-process. Every API restart wipes it. Multiple API replicas each build their own cache independently.
No connection pooling — every request opens a new SQLite connection to the 3.8 GB file.
ATTACH overhead — per-user kiwi.db files ATTACH to the corpus DB on every request, meaning N concurrent users = N open file handles on the same large file.

Recommended fix — three phases

Phase 1 (now): Pre-computed browse counts table

Complete. app/services/recipe/browse_counts_cache.py implements the full caching layer:

browse_counts SQLite file separate from corpus and per-user DBs
Warm from cache at startup; async background refresh if stale (>7 days)
Nightly refresh loop in main.py lifespan
infer_recipe_tags.py triggers refresh at end of pipeline run
_COUNT_CACHE in store.py pre-warmed on startup — FTS never hit for known keyword sets

Phase 2 (medium term): Dedicated search sidecar

Replace the FTS5 index with Meilisearch or Typesense running as a sidecar container. Both handle concurrent search correctly, support faceted browse natively, and are purpose-built for this use case. Track separately.

Phase 3 (long term, only if needed): PostgreSQL corpus

Only if Phase 1 and 2 are insufficient at scale.

Acceptance criteria — Phase 1

browse_counts table added
Populated at startup and after pipeline run
Browse category/subcategory endpoints read from table, not live FTS
_COUNT_CACHE pre-warmed from persistent cache file
Count refresh triggered automatically at end of infer_recipe_tags.py

## Problem The recipe corpus is a 3.8 GB SQLite file (`/Library/Assets/kiwi/kiwi.db`) mounted read-only into the cloud container. Every user request that touches the browse endpoints opens a connection to this file and runs an FTS5 MATCH query. SQLite WAL mode supports unlimited concurrent readers, so there is no write contention — but there are real pressure points: 1. **Memory pressure** — FTS5 loads index pages into the SQLite page cache. Many concurrent browse queries fight for OS page cache and cause repeated disk reads at scale. 2. **Per-process count cache** — `_COUNT_CACHE` in `store.py` caches browse counts per keyword set, but it is in-process. Every API restart wipes it. Multiple API replicas each build their own cache independently. 3. **No connection pooling** — every request opens a new SQLite connection to the 3.8 GB file. 4. **ATTACH overhead** — per-user `kiwi.db` files ATTACH to the corpus DB on every request, meaning N concurrent users = N open file handles on the same large file. ## Recommended fix — three phases ### Phase 1 (now): Pre-computed browse counts table **Complete.** `app/services/recipe/browse_counts_cache.py` implements the full caching layer: - `browse_counts` SQLite file separate from corpus and per-user DBs - Warm from cache at startup; async background refresh if stale (>7 days) - Nightly refresh loop in `main.py` lifespan - `infer_recipe_tags.py` triggers refresh at end of pipeline run - `_COUNT_CACHE` in `store.py` pre-warmed on startup — FTS never hit for known keyword sets ### Phase 2 (medium term): Dedicated search sidecar Replace the FTS5 index with **Meilisearch** or **Typesense** running as a sidecar container. Both handle concurrent search correctly, support faceted browse natively, and are purpose-built for this use case. Track separately. ### Phase 3 (long term, only if needed): PostgreSQL corpus Only if Phase 1 and 2 are insufficient at scale. ## Acceptance criteria — Phase 1 - [x] `browse_counts` table added - [x] Populated at startup and after pipeline run - [x] Browse category/subcategory endpoints read from table, not live FTS - [x] `_COUNT_CACHE` pre-warmed from persistent cache file - [x] Count refresh triggered automatically at end of `infer_recipe_tags.py`