DB write contention: FTS5 batch inserts starve incident API and watcher flushes #60

New issue

Closed

opened 2026-05-31 22:52:17 -07:00 by pyr0ball · 0 comments

pyr0ball commented

2026-05-31 22:52:17 -07:00

Owner

Problem

When the log watcher flushes a large burst of entries (e.g. sonarr/radarr at high volume), the FTS5 index update runs inside the same transaction as the log_entries insert. This holds an exclusive SQLite write lock for the full duration of that FTS rebuild — long enough to starve:

The /api/incidents POST endpoint (returns 500 after the 30s SQLite timeout)
The watcher's own next flush (also times out, logged as Flush error for 'docker:sonarr': database is locked)

Observed in production

turnstone-cluster[1671653]: sqlite3.OperationalError: database is locked
turnstone-cluster[1671653]: Flush error for 'docker:radarr': database is locked
turnstone-cluster[1671653]: Flush error for 'docker:sonarr': database is locked
turnstone-cluster[1671653]: Flush error for 'network-syslog': database is locked

The POST /api/incidents endpoint required 4 retries with a 60s HTTP timeout to finally land. Earlier attempts with a 15s timeout all failed.

Root cause

SQLite FTS5 index updates are synchronous within the inserting transaction. A batch of e.g. 500 log entries from a chatty source will hold the write lock for the entire FTS rebuild — potentially several seconds. During that window, every other write (incident creation, concurrent flush) hits database is locked and either waits or times out.

This is the same problem that motivated splitting context_facts into a dedicated DB file (turnstone-context.db). The comment in pipeline.py line 198 even documents it:

"can hold the main DB write lock for seconds at a time when flushing large journal batches"

Options

Option A: Separate incidents into their own DB file (recommended)

Same pattern already used for context_facts. Incidents are low-volume and low-latency-sensitive — isolating them means the API is never blocked by log ingestion. Requires updating DB_PATH reference in create_incident / list_incidents / get_incident.

Option B: Decouple FTS indexing into a background pass

Insert into log_entries without updating FTS, then run a periodic FTS sync job. Reduces per-flush lock hold time to the raw INSERT cost. More complex — requires a fts_sync_cursor or similar to track what's indexed.

Option C: Use `content_rowid` FTS5 external content table

Store content in log_entries, point FTS at it as an external content table. FTS index updates still happen at insert time but don't duplicate the content, reducing write volume. Partial improvement only.

Option D: Raise SQLite busy timeout

Band-aid only. Doesn't fix the lock; just makes callers wait longer before failing.

Recommended fix

Option A for incidents (isolated DB) is the lowest-risk change and directly mirrors the existing context_facts pattern. Option B is the right long-term fix for FTS but is more invasive.

Impact on training data

The incident tagging workflow (used to label real-world patterns for future model training) is blocked by this when the cluster is under load. This needs to be reliable before the training data pipeline can depend on it.

## Problem When the log watcher flushes a large burst of entries (e.g. sonarr/radarr at high volume), the FTS5 index update runs inside the same transaction as the `log_entries` insert. This holds an exclusive SQLite write lock for the full duration of that FTS rebuild — long enough to starve: - The `/api/incidents` POST endpoint (returns 500 after the 30s SQLite timeout) - The watcher's own next flush (also times out, logged as `Flush error for 'docker:sonarr': database is locked`) ### Observed in production ``` turnstone-cluster[1671653]: sqlite3.OperationalError: database is locked turnstone-cluster[1671653]: Flush error for 'docker:radarr': database is locked turnstone-cluster[1671653]: Flush error for 'docker:sonarr': database is locked turnstone-cluster[1671653]: Flush error for 'network-syslog': database is locked ``` The `POST /api/incidents` endpoint required 4 retries with a 60s HTTP timeout to finally land. Earlier attempts with a 15s timeout all failed. ## Root cause SQLite FTS5 index updates are synchronous within the inserting transaction. A batch of e.g. 500 log entries from a chatty source will hold the write lock for the entire FTS rebuild — potentially several seconds. During that window, every other write (incident creation, concurrent flush) hits `database is locked` and either waits or times out. This is the same problem that motivated splitting `context_facts` into a dedicated DB file (`turnstone-context.db`). The comment in `pipeline.py` line 198 even documents it: > "can hold the main DB write lock for seconds at a time when flushing large journal batches" ## Options ### Option A: Separate incidents into their own DB file (recommended) Same pattern already used for `context_facts`. Incidents are low-volume and low-latency-sensitive — isolating them means the API is never blocked by log ingestion. Requires updating `DB_PATH` reference in `create_incident` / `list_incidents` / `get_incident`. ### Option B: Decouple FTS indexing into a background pass Insert into `log_entries` without updating FTS, then run a periodic FTS sync job. Reduces per-flush lock hold time to the raw INSERT cost. More complex — requires a `fts_sync_cursor` or similar to track what's indexed. ### Option C: Use `content_rowid` FTS5 external content table Store content in `log_entries`, point FTS at it as an external content table. FTS index updates still happen at insert time but don't duplicate the content, reducing write volume. Partial improvement only. ### Option D: Raise SQLite busy timeout Band-aid only. Doesn't fix the lock; just makes callers wait longer before failing. ## Recommended fix Option A for incidents (isolated DB) is the lowest-risk change and directly mirrors the existing `context_facts` pattern. Option B is the right long-term fix for FTS but is more invasive. ## Impact on training data The incident tagging workflow (used to label real-world patterns for future model training) is blocked by this when the cluster is under load. This needs to be reliable before the training data pipeline can depend on it.