turnstone/app/glean
pyr0ball 99b44ddb81 feat(corpus): synthetic log corpus generator for demos and testing
Adds scripts/gen_corpus.py that produces realistic-but-artificial log
files across all four supported formats (journald JSON, docker envelope,
qBittorrent hotio, AVCX plaintext). Output feeds directly into
glean_corpus.py for demo environments and parser regression tests with
no production data required.

- Seed-based RNG with independent per-source sub-streams (same seed =
  same sequence for each file regardless of source count changes)
- Controllable time range, event density, and error injection rate
- Severity distribution mirrors real infrastructure (70% INFO, ~6% ERROR,
  ~2% CRITICAL) with adjustable boost via --error-rate
- 17 tests covering output structure, reproducibility, format correctness,
  parser round-trip, and CLI acceptance criteria

Also fixes a latent bug in app/glean/plaintext.py: ISO 8601 timestamps
were silently failing to parse because the T separator was normalised to
space in the input string but the strptime format string still contained T.
Fix: apply the same normalisation to the format before calling strptime.

Closes: #46
2026-06-11 10:57:20 -07:00
..
__init__.py refactor: rename ingest → glean throughout codebase 2026-05-20 23:02:55 -07:00
base.py feat: domain-view mapping for patterns and diagnose output (#32) 2026-06-01 19:57:16 -07:00
caddy.py refactor: rename ingest → glean throughout codebase 2026-05-20 23:02:55 -07:00
dmesg_log.py refactor: rename ingest → glean throughout codebase 2026-05-20 23:02:55 -07:00
doc_upload.py feat: dual-backend SQLite/Postgres + multi-tenant source namespacing 2026-06-08 08:37:54 -07:00
docker_log.py refactor: rename ingest → glean throughout codebase 2026-05-20 23:02:55 -07:00
journald.py refactor: rename ingest → glean throughout codebase 2026-05-20 23:02:55 -07:00
mqtt_subscriber.py fix(db): add timeout=30s to all sqlite3.connect() calls across app 2026-05-26 23:12:48 -07:00
pipeline.py feat: dual-backend SQLite/Postgres + multi-tenant source namespacing 2026-06-08 08:37:54 -07:00
plaintext.py feat(corpus): synthetic log corpus generator for demos and testing 2026-06-11 10:57:20 -07:00
plex.py refactor: rename ingest → glean throughout codebase 2026-05-20 23:02:55 -07:00
qbittorrent.py refactor: rename ingest → glean throughout codebase 2026-05-20 23:02:55 -07:00
servarr.py refactor: rename ingest → glean throughout codebase 2026-05-20 23:02:55 -07:00
ssh.py feat: SSH remote host glean — transport layer and pipeline integration (closes #22, backend) 2026-05-20 23:03:13 -07:00
syslog.py refactor: rename ingest → glean throughout codebase 2026-05-20 23:02:55 -07:00
tautulli.py refactor: rename ingest → glean throughout codebase 2026-05-20 23:02:55 -07:00
wazuh.py refactor: rename ingest → glean throughout codebase 2026-05-20 23:02:55 -07:00