Fingerprint-based incremental glean: skip unchanged log files on batch re-glean #30

Closed
opened 2026-05-24 22:02:57 -07:00 by pyr0ball · 0 comments
Owner

Turnstone currently re-gleams every configured log source on each batch interval, even when files have not changed. For large static archives or slow-rotating logs this wastes I/O and CPU.

Proposed approach:

  • At glean time, compute a fingerprint (mtime + size, or Blake2b of last N bytes) for each source file
  • Persist fingerprints in SQLite alongside the last-gleaned offset
  • On next glean cycle, skip files whose fingerprint has not changed
  • Force-re-glean available via manage.sh glean --force

Impact: Makes the 15-minute glean interval viable even for sources with hundreds of MB of static history. Particularly useful for corpus import workflows.

Reference: https://github.com/Lum1104/Understand-Anything (fingerprint-based incremental update pattern)

Turnstone currently re-gleams every configured log source on each batch interval, even when files have not changed. For large static archives or slow-rotating logs this wastes I/O and CPU. **Proposed approach:** - At glean time, compute a fingerprint (mtime + size, or Blake2b of last N bytes) for each source file - Persist fingerprints in SQLite alongside the last-gleaned offset - On next glean cycle, skip files whose fingerprint has not changed - Force-re-glean available via `manage.sh glean --force` **Impact:** Makes the 15-minute glean interval viable even for sources with hundreds of MB of static history. Particularly useful for corpus import workflows. Reference: https://github.com/Lum1104/Understand-Anything (fingerprint-based incremental update pattern)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/turnstone#30
No description provided.