- Rename patterns/sources-example-node.yaml → patterns/sources-example.yaml
and update header/comments to be host-agnostic
- Replace internal node names in gen_corpus.py _HOSTS with generic names
- Replace example-node hostname in syslog test fixtures with testhost
- Replace example-node example in mcp_server.py doc with myserver
- Replace private LAN IP (<YOUR_HOST_IP>) in docker-standalone.sh with
<HEIMDALL_LAN_IP> placeholder
- Replace private IPs in sources-cluster.yaml comments with <YOUR_HOST_IP>
- Remove instance-specific hostname from llm.py fallback comment
- Replace Caddy example domain in podman-standalone.sh with placeholder
- Changed glob → rglob in glean_dir so corpus directories with format
subfolders (journald/, docker/, etc.) are fully ingested
- Fixed gen_corpus.py docker SOURCE to emit "docker:<service>" prefix
so the pipeline correctly detects format as 'docker' not 'plaintext'
- 17/17 gen_corpus tests passing
Closes: #46
Adds scripts/gen_corpus.py that produces realistic-but-artificial log
files across all four supported formats (journald JSON, docker envelope,
qBittorrent hotio, EXT_DEVICE plaintext). Output feeds directly into
glean_corpus.py for demo environments and parser regression tests with
no production data required.
- Seed-based RNG with independent per-source sub-streams (same seed =
same sequence for each file regardless of source count changes)
- Controllable time range, event density, and error injection rate
- Severity distribution mirrors real infrastructure (70% INFO, ~6% ERROR,
~2% CRITICAL) with adjustable boost via --error-rate
- 17 tests covering output structure, reproducibility, format correctness,
parser round-trip, and CLI acceptance criteria
Also fixes a latent bug in app/glean/plaintext.py: ISO 8601 timestamps
were silently failing to parse because the T separator was normalised to
space in the input string but the strptime format string still contained T.
Fix: apply the same normalisation to the format before calling strptime.
Closes: #46