DB write contention: FTS5 batch inserts starve incident API and watcher flushes #60
Labels
No labels
compliance
demo
deployment
docs
enhancement
parser
patterns
performance
security
ux
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Circuit-Forge/turnstone#60
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
When the log watcher flushes a large burst of entries (e.g. sonarr/radarr at high volume), the FTS5 index update runs inside the same transaction as the
log_entriesinsert. This holds an exclusive SQLite write lock for the full duration of that FTS rebuild — long enough to starve:/api/incidentsPOST endpoint (returns 500 after the 30s SQLite timeout)Flush error for 'docker:sonarr': database is locked)Observed in production
The
POST /api/incidentsendpoint required 4 retries with a 60s HTTP timeout to finally land. Earlier attempts with a 15s timeout all failed.Root cause
SQLite FTS5 index updates are synchronous within the inserting transaction. A batch of e.g. 500 log entries from a chatty source will hold the write lock for the entire FTS rebuild — potentially several seconds. During that window, every other write (incident creation, concurrent flush) hits
database is lockedand either waits or times out.This is the same problem that motivated splitting
context_factsinto a dedicated DB file (turnstone-context.db). The comment inpipeline.pyline 198 even documents it:Options
Option A: Separate incidents into their own DB file (recommended)
Same pattern already used for
context_facts. Incidents are low-volume and low-latency-sensitive — isolating them means the API is never blocked by log ingestion. Requires updatingDB_PATHreference increate_incident/list_incidents/get_incident.Option B: Decouple FTS indexing into a background pass
Insert into
log_entrieswithout updating FTS, then run a periodic FTS sync job. Reduces per-flush lock hold time to the raw INSERT cost. More complex — requires afts_sync_cursoror similar to track what's indexed.Option C: Use
content_rowidFTS5 external content tableStore content in
log_entries, point FTS at it as an external content table. FTS index updates still happen at insert time but don't duplicate the content, reducing write volume. Partial improvement only.Option D: Raise SQLite busy timeout
Band-aid only. Doesn't fix the lock; just makes callers wait longer before failing.
Recommended fix
Option A for incidents (isolated DB) is the lowest-risk change and directly mirrors the existing
context_factspattern. Option B is the right long-term fix for FTS but is more invasive.Impact on training data
The incident tagging workflow (used to label real-world patterns for future model training) is blocked by this when the cluster is under load. This needs to be reliable before the training data pipeline can depend on it.