turnstone/docs/compliance/checklist.md

# Turnstone Compliance Checklist

**Last reviewed:** 2026-05-28
**Applies to:** All deployments handling log data in compliance-sensitive environments.

Symbols: ✅ satisfied by code, ⚙️ operator action required, ⚠️ known limitation, 🔲 not implemented.

---

## Data Isolation

### Source-level query isolation
✅ **`source_filter` enforced on all log-returning endpoints.**
Every endpoint that returns log entries accepts a `source` parameter. Both the FTS5 keyword search path and the time-window scan path apply `source_id LIKE ?` before returning results. No cross-source data leakage is possible through the API.

Relevant code: `app/services/search.py` — `search()` and `entries_in_window()`.

### FTS5 cross-source leakage
✅ **FTS5 index includes `source_id` as an UNINDEXED column; all queries filter on it.**
The virtual table schema stores `source_id` alongside each entry. Query functions always join back to the base table or filter the FTS result set by `source_id`. There is no full-corpus FTS path that ignores source.

### SQLite file permissions
⚙️ **Operator responsibility — not enforced by Turnstone.**
Turnstone does not set file permissions on the database. Recommended posture for multi-user hosts:

```bash
# Restrict DB to the Turnstone process user only
chmod 600 /devl/turnstone-cluster/data/turnstone.db
chmod 600 /devl/turnstone-cluster/data/turnstone-context.db
chown turnstone:turnstone /devl/turnstone-cluster/data/
```

Run Turnstone as a dedicated non-root user via systemd `User=turnstone`.

---

## Audit Logging

### API query logging
✅ **Implemented as FastAPI middleware (`turnstone.audit` logger).**
Every request to `/turnstone/api/*` is logged at INFO level with:
- Timestamp (from the logging handler)
- HTTP method
- Path + query string
- Response status code
- Request duration (ms)

Body content is never logged. Example output:
```
2026-05-28 14:23:01 INFO turnstone.audit  GET /turnstone/api/diagnose/stream?source=heimdall-journal 200 1843ms
```

To capture audit logs to a separate file, configure the `turnstone.audit` logger in your logging config:
```python
# In your uvicorn startup or log config YAML:
logging.getLogger("turnstone.audit").addHandler(
    logging.FileHandler("/var/log/turnstone/audit.log")
)
```

### Glean operation logging
✅ **Glean scheduler logs source ID, entry count, and duration at INFO level.**
Relevant logger: `app.tasks.glean_scheduler` — logs start, per-source stats, and errors.
Log example:
```
INFO app.tasks.glean_scheduler  Batch glean complete in 12.4s — {'heimdall-journal': 847, 'plex': 12}
```

### Error logging
✅ **Errors logged with source context but without PII in message fields.**
Exception handlers in `rest.py` log at ERROR level with the endpoint path and error type. Raw log entry text is not included in error messages. Stack traces go to the `uvicorn.error` logger.

---

## LLM / PII Egress

### Multi-agent pipeline (recommended path, `TURNSTONE_MULTI_AGENT_DIAGNOSE=true`)
✅ **Raw log message text is NOT sent to the LLM.**
Stage 5 (synthesizer) sends only:
- The operator's query string
- Timeline statistics (cluster counts, burst counts, gap counts — no entry text)
- Hypothesis titles from Stage 3 (derived labels, not raw messages)
- Runbook context from the operator's own uploaded documents

No raw `MESSAGE` field content reaches the LLM in this path. Review: `app/services/diagnose/synthesizer.py`.

### Legacy single-call path (`TURNSTONE_MULTI_AGENT_DIAGNOSE` unset or `false`)
⚠️ **Raw log message text (truncated to 200 chars) IS sent to the LLM.**
The legacy `summarize()` function in `app/services/llm.py` builds a prompt that includes up to 25 log entries with their `text` field (truncated). If log entries contain hostnames, usernames, IP addresses, or other PII, those values are included in the LLM call.

**Operator action for PII-sensitive deployments:** Enable `TURNSTONE_MULTI_AGENT_DIAGNOSE=true` to use the pipeline path, which does not expose raw log text.

### Avocet harvester (corpus export)
✅ **Only pattern-tagged entries are exported; export can be disabled.**
The harvester (`harvester/harvester.py`) only POSTs entries that matched at least one named pattern. It does not export the full corpus. Disable by leaving `TURNSTONE_SUBMIT_ENDPOINT` unset (the default).

### External telemetry
✅ **None.** Turnstone makes no calls to Sentry, Segment, Amplitude, or any analytics service. The only outbound network calls are:
- Your configured `GPU_SERVER_URL` (LLM inference, operator-controlled)
- HuggingFace Hub (model downloads — disable with `TURNSTONE_OFFLINE_MODE=1`)
- SSH connections to configured remote log sources (operator-defined)

---

## Configuration Hardening

For compliance deployments, set these in `.env`:

```bash
# Block HuggingFace network access (model weights pre-downloaded)
TURNSTONE_OFFLINE_MODE=1

# Require bearer token for all API calls
TURNSTONE_API_KEY=<strong-random-token>

# Use multi-agent pipeline (no raw log text to LLM)
TURNSTONE_MULTI_AGENT_DIAGNOSE=true

# Disable Avocet corpus push if not needed
# (leave TURNSTONE_SUBMIT_ENDPOINT unset)
```

---

## Outstanding Items

🔲 **Per-user access control** — all authenticated clients share the same API key. There is no per-user identity, role separation, or per-source ACL. Track as a future enhancement.

🔲 **Audit log retention policy** — Turnstone writes audit events to the logging system but does not manage log rotation or retention. Operator must configure log rotation (logrotate, systemd journal limits, etc.).

🔲 **Encrypted DB at rest** — SQLite does not support transparent encryption. For encryption at rest, use full-disk encryption (LUKS) or an encrypted filesystem on the host.

🔲 **TLS between client and Turnstone** — Turnstone binds to HTTP by default. For production, place Caddy or nginx in front and terminate TLS there. Do not expose port 8534 directly over untrusted networks.

---

## Data Subject Rights (GDPR / CCPA)

### Right to erasure — anonymized records

⚠️ **Anonymized log data cannot be selectively deleted on a per-subject basis.**

When PII sanitization is applied to a bundle export (redacting IP addresses, usernames, hostnames), the resulting data is no longer linked to a specific data subject. As a consequence, Turnstone cannot identify which stored log entries relate to that subject and cannot fulfill a targeted deletion request for records that have already been anonymized.

**Operators must clearly disclose this limitation to data subjects before export:**

> "Anonymized log data exported or submitted from this system cannot be individually identified or selectively deleted. If data was exported in anonymized form, Turnstone cannot distinguish your records from others in the exported set. The right to erasure does not apply to data that is no longer personally identifiable."

This is consistent with GDPR Recital 26, which excludes anonymized data from the regulation's scope. However, the original (pre-anonymization) records in Turnstone's local SQLite database *can* be deleted by source ID via the Sources view (Delete all entries for source) or directly via the database.

**Recommended operator practice:**
- Maintain a log of which bundles were exported, when, and to whom — the audit log (`turnstone.audit`) covers this.
- Provide data subjects with the bundle export timestamp and source scope so they can verify what was shared.
- For full erasure of pre-anonymization records: use `DELETE /api/sources/{source_id}` to purge all entries for a given source from the local DB.