Startup vec DB schema validation: detect dimension mismatch and auto-rebuild #3

Closed
opened 2026-05-05 21:33:53 -07:00 by pyr0ball · 0 comments
Owner

Problem

When PAGEPIPER_EMBED_DIMS changes (e.g. swapping embedding models), the sqlite-vec virtual table has the old dimension baked into its DDL (float[768] vs float[1024]). The mismatch is not caught at startup — it surfaces as a runtime sqlite3.OperationalError: Dimension mismatch only when a user tries to chat or search.

This is especially disruptive for cloud instances where the vec DB is root-owned (created by the container process) and can only be deleted from inside the container.

Fix

At application startup (lifespan), read the actual dimension from the page_vecs_vecs virtual table schema and compare to PAGEPIPER_EMBED_DIMS:

def _check_vec_schema(vec_db_path: str, expected_dims: int) -> None:
    import re, os
    try:
        conn = sqlite3.connect(vec_db_path)
        row = conn.execute(
            "SELECT sql FROM sqlite_master WHERE name='page_vecs_vecs'"
        ).fetchone()
        conn.close()
        if row:
            m = re.search(r'float\[(\d+)\]', row[0])
            if m and int(m.group(1)) != expected_dims:
                logger.warning(
                    "Vec DB dimension mismatch (%s vs config %s) — dropping and rebuilding",
                    m.group(1), expected_dims,
                )
                os.remove(vec_db_path)
    except Exception:
        pass  # missing or corrupt — let LocalSQLiteVecStore recreate

Call this from lifespan() before any ingest or search runs. After dropping, trigger re-ingest for all ready documents so vectors are rebuilt against the new schema.

Additional items

  • Log current embed model + dims at startup so operators can confirm configuration
  • Add fast-fail validation: if PAGEPIPER_EMBED_DIMS is not a positive int, exit with a clear error message
  • Run the api container as a non-root user so the vec DB is not root-owned on the host (avoids needing docker exec to delete it)
## Problem When `PAGEPIPER_EMBED_DIMS` changes (e.g. swapping embedding models), the sqlite-vec virtual table has the old dimension baked into its DDL (`float[768]` vs `float[1024]`). The mismatch is not caught at startup — it surfaces as a runtime `sqlite3.OperationalError: Dimension mismatch` only when a user tries to chat or search. This is especially disruptive for cloud instances where the vec DB is root-owned (created by the container process) and can only be deleted from inside the container. ## Fix At application startup (lifespan), read the actual dimension from the `page_vecs_vecs` virtual table schema and compare to `PAGEPIPER_EMBED_DIMS`: ```python def _check_vec_schema(vec_db_path: str, expected_dims: int) -> None: import re, os try: conn = sqlite3.connect(vec_db_path) row = conn.execute( "SELECT sql FROM sqlite_master WHERE name='page_vecs_vecs'" ).fetchone() conn.close() if row: m = re.search(r'float\[(\d+)\]', row[0]) if m and int(m.group(1)) != expected_dims: logger.warning( "Vec DB dimension mismatch (%s vs config %s) — dropping and rebuilding", m.group(1), expected_dims, ) os.remove(vec_db_path) except Exception: pass # missing or corrupt — let LocalSQLiteVecStore recreate ``` Call this from `lifespan()` before any ingest or search runs. After dropping, trigger re-ingest for all `ready` documents so vectors are rebuilt against the new schema. ## Additional items - Log current embed model + dims at startup so operators can confirm configuration - Add fast-fail validation: if `PAGEPIPER_EMBED_DIMS` is not a positive int, exit with a clear error message - Run the api container as a non-root user so the vec DB is not root-owned on the host (avoids needing `docker exec` to delete it)
pyr0ball added this to the Alpha milestone 2026-05-06 09:03:30 -07:00
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/pagepiper#3
No description provided.