feat: synthetic log corpus generator for demo and testing #46

Closed
opened 2026-05-26 23:05:25 -07:00 by pyr0ball · 0 comments
Owner

Turnstone demos and integration tests currently rely on either real infrastructure logs (not shareable) or manually crafted fixtures (limited volume and variety).

Add a scripts/gen_corpus.py synthetic corpus generator that produces realistic but entirely artificial log data:

Generators needed:

  • Journald JSON output (system services, systemd units, kernel messages)
  • Docker stdout (container lifecycle, application logs)
  • AVCX device logs (format per parser issue)
  • qBittorrent app logs (tracker events, download completions, errors)

Configuration:

  • Controllable time range, event density, error injection rate
  • Seed-based for reproducible test corpora
  • Output to JSONL / plain log files ready for glean_corpus.py

Use cases:

  • Demo environment that looks real without exposing any production data
  • Benchmark / load testing the glean pipeline
  • Regression tests for parsers without live infrastructure

Acceptance criteria: python scripts/gen_corpus.py --days 7 --out /tmp/demo-corpus/ produces a gleanable corpus with varied severity distribution.

Turnstone demos and integration tests currently rely on either real infrastructure logs (not shareable) or manually crafted fixtures (limited volume and variety). Add a `scripts/gen_corpus.py` synthetic corpus generator that produces realistic but entirely artificial log data: **Generators needed:** - Journald JSON output (system services, systemd units, kernel messages) - Docker stdout (container lifecycle, application logs) - AVCX device logs (format per parser issue) - qBittorrent app logs (tracker events, download completions, errors) **Configuration:** - Controllable time range, event density, error injection rate - Seed-based for reproducible test corpora - Output to JSONL / plain log files ready for `glean_corpus.py` **Use cases:** - Demo environment that looks real without exposing any production data - Benchmark / load testing the glean pipeline - Regression tests for parsers without live infrastructure **Acceptance criteria:** `python scripts/gen_corpus.py --days 7 --out /tmp/demo-corpus/` produces a gleanable corpus with varied severity distribution.
pyr0ball added this to the Enterprise POC Deliverable milestone 2026-05-26 23:05:25 -07:00
pyr0ball added the
demo
parser
enhancement
labels 2026-05-26 23:05:25 -07:00
Sign in to join this conversation.
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/turnstone#46
No description provided.