feat: acoustic environment fingerprinting + privacy risk scoring #20

New issue

Closed

opened 2026-04-11 09:58:12 -07:00 by pyr0ball · 0 comments

pyr0ball commented

2026-04-11 09:58:12 -07:00

Owner

feat: acoustic environment fingerprinting + privacy risk scoring

Extends cf_voice.acoustic and cf_voice.events to classify the acoustic scene around the primary speaker, enabling privacy-aware session behaviour in Linnet and downstream products (Osprey AMD, Egret DSAR recording detection).

Why this matters

The acoustic environment is identifiable. Specific birdsong species + regional accent + traffic pattern can narrow a location to a neighbourhood. The fingerprinting system must score privacy risk locally before deciding what to log, transmit, or surface — including to cloud inference. Local-first is load-bearing here, not aspirational.

Scope: four new signal types

All flow through the existing AcousticResult / AudioEvent pipeline in cf_voice.acoustic.

1. Scene classification — `event_type: "scene"`

Broad acoustic scene category. Primary input to privacy risk scoring.

Proposed SCENE_LABELS:

indoor_quiet, indoor_crowd, outdoor_urban, outdoor_nature, vehicle, public_transit

Backend: AST/YAMNet acoustic scene model (AudioSet "acoustic scene" subset). New SceneBackend protocol in acoustic.py alongside AcousticBackend.

2. Extended environ labels — `event_type: "environ"` expansion

Expand the current telephony-only set to cover general-purpose acoustic events:

Nature: birdsong, wind, rain, water
Urban: traffic, crowd_chatter, street_crossing_signal, construction
Indoor: hvac, keyboard_typing, restaurant_ambience

Backend: expand _YAMNET_MAP / _AST_MAP in acoustic.py. Builds on #5 (YAMNet).

3. Accent / language identification — `event_type: "accent"`

Regional accent of primary speaker. Accent alone is not high-risk, but combined with specific birdsong or quiet rural background it becomes location-identifying — the privacy scorer accounts for this compound signal.

Fields: language: str, region: str, confidence: float

Backend: facebook/mms-lid-126 for language, wav2vec2 accent fine-tune for region. New cf_voice/accent.py. Lazy-loaded, gated by CF_VOICE_ACCENT=1 (off by default — GPU cost + privacy sensitivity).

4. Background speaker presence — `SPEAKER_LABELS` expansion

Add background_voices label: detectable via VAD + speaker count from pyannote or silero-vad. Distinct from primary speaker classification (#8).

Privacy risk scoring (`cf_voice/privacy.py`)

A privacy_risk value (low / moderate / high) derived locally per audio window from the combined fingerprint. Never sent to cloud. Never logged server-side.

Signal combination	Risk
`outdoor_urban` + `crowd_chatter` + `traffic`	low — clearly public
`indoor_quiet` + `background_voices`	moderate — conversation overheard
`outdoor_nature` + `birdsong` + regional accent	moderate-high — location-identifying compound
`indoor_quiet` + no background voices	low

Risk gates (Linnet)

high: warn user before sending audio chunk to cloud STT/inference; offer local-only fallback
moderate: attach privacy_flags to session state, no blocking action by default
low: proceed normally; no annotation

Implementation order

Expand ENVIRON_LABELS + label maps (builds on #5)
SceneBackend protocol + MockSceneBackend + ASTSceneBackend stub in acoustic.py
cf_voice/accent.py: AccentClassifier with lazy-load + CF_VOICE_ACCENT gate
cf_voice/privacy.py: score_privacy_risk(scene, environ, speaker, accent) -> PrivacyRisk
Linnet: expose privacy_risk in GET /session/{id}, add scene-event + accent-event SSE types
Linnet frontend: subtle informational indicator in NowPanel (not alarming)

cf-voice changes required

cf_voice/events.py: SCENE_LABELS, ACCENT_LABELS, expanded ENVIRON_LABELS
cf_voice/acoustic.py: SceneBackend protocol, mock, AST stub
cf_voice/accent.py: AccentClassifier (new module)
cf_voice/privacy.py: PrivacyRisk dataclass + scoring function
cf_voice/context.py: wire scene + accent classifiers into _classify_real_async

Linnet changes

app/api/sessions.py: privacy_risk field in GET /session/{id} response
app/api/events.py: scene-event, accent-event SSE event types
app/models/: SceneEvent, AccentEvent models
Frontend: useToneStream listener for new event types; NowPanel subtle indicator
compose.cloud.yml: CF_VOICE_ACCENT=0 default (opt-in, expensive)

Non-goals

Do not store accent or scene labels server-side in the corrections DB
Do not transmit privacy_risk=high audio chunks to cloud STT without explicit user consent
BirdNET species-level identification is out of scope (too identifying; use genre-level "birdsong" only)

feat: acoustic environment fingerprinting + privacy risk scoring Extends `cf_voice.acoustic` and `cf_voice.events` to classify the acoustic scene around the primary speaker, enabling privacy-aware session behaviour in Linnet and downstream products (Osprey AMD, Egret DSAR recording detection). ## Why this matters The acoustic environment is identifiable. Specific birdsong species + regional accent + traffic pattern can narrow a location to a neighbourhood. The fingerprinting system must score privacy risk **locally** before deciding what to log, transmit, or surface — including to cloud inference. Local-first is load-bearing here, not aspirational. ## Scope: four new signal types All flow through the existing `AcousticResult` / `AudioEvent` pipeline in `cf_voice.acoustic`. ### 1. Scene classification — `event_type: "scene"` Broad acoustic scene category. Primary input to privacy risk scoring. Proposed `SCENE_LABELS`: - `indoor_quiet`, `indoor_crowd`, `outdoor_urban`, `outdoor_nature`, `vehicle`, `public_transit` Backend: AST/YAMNet acoustic scene model (AudioSet "acoustic scene" subset). New `SceneBackend` protocol in `acoustic.py` alongside `AcousticBackend`. ### 2. Extended environ labels — `event_type: "environ"` expansion Expand the current telephony-only set to cover general-purpose acoustic events: - **Nature**: `birdsong`, `wind`, `rain`, `water` - **Urban**: `traffic`, `crowd_chatter`, `street_crossing_signal`, `construction` - **Indoor**: `hvac`, `keyboard_typing`, `restaurant_ambience` Backend: expand `_YAMNET_MAP` / `_AST_MAP` in `acoustic.py`. Builds on #5 (YAMNet). ### 3. Accent / language identification — `event_type: "accent"` Regional accent of primary speaker. Accent alone is not high-risk, but combined with specific birdsong or quiet rural background it becomes location-identifying — the privacy scorer accounts for this compound signal. Fields: `language: str`, `region: str`, `confidence: float` Backend: `facebook/mms-lid-126` for language, wav2vec2 accent fine-tune for region. New `cf_voice/accent.py`. Lazy-loaded, gated by `CF_VOICE_ACCENT=1` (off by default — GPU cost + privacy sensitivity). ### 4. Background speaker presence — `SPEAKER_LABELS` expansion Add `background_voices` label: detectable via VAD + speaker count from pyannote or silero-vad. Distinct from primary speaker classification (#8). ## Privacy risk scoring (`cf_voice/privacy.py`) A `privacy_risk` value (`low` / `moderate` / `high`) derived locally per audio window from the combined fingerprint. **Never sent to cloud. Never logged server-side.** | Signal combination | Risk | |--------------------|------| | `outdoor_urban` + `crowd_chatter` + `traffic` | low — clearly public | | `indoor_quiet` + `background_voices` | moderate — conversation overheard | | `outdoor_nature` + `birdsong` + regional accent | moderate-high — location-identifying compound | | `indoor_quiet` + no background voices | low | ### Risk gates (Linnet) - `high`: warn user before sending audio chunk to cloud STT/inference; offer local-only fallback - `moderate`: attach `privacy_flags` to session state, no blocking action by default - `low`: proceed normally; no annotation ## Implementation order 1. Expand `ENVIRON_LABELS` + label maps (builds on #5) 2. `SceneBackend` protocol + `MockSceneBackend` + `ASTSceneBackend` stub in `acoustic.py` 3. `cf_voice/accent.py`: `AccentClassifier` with lazy-load + `CF_VOICE_ACCENT` gate 4. `cf_voice/privacy.py`: `score_privacy_risk(scene, environ, speaker, accent) -> PrivacyRisk` 5. Linnet: expose `privacy_risk` in `GET /session/{id}`, add `scene-event` + `accent-event` SSE types 6. Linnet frontend: subtle informational indicator in NowPanel (not alarming) ## cf-voice changes required - `cf_voice/events.py`: `SCENE_LABELS`, `ACCENT_LABELS`, expanded `ENVIRON_LABELS` - `cf_voice/acoustic.py`: `SceneBackend` protocol, mock, AST stub - `cf_voice/accent.py`: `AccentClassifier` (new module) - `cf_voice/privacy.py`: `PrivacyRisk` dataclass + scoring function - `cf_voice/context.py`: wire scene + accent classifiers into `_classify_real_async` ## Linnet changes - `app/api/sessions.py`: `privacy_risk` field in `GET /session/{id}` response - `app/api/events.py`: `scene-event`, `accent-event` SSE event types - `app/models/`: `SceneEvent`, `AccentEvent` models - Frontend: `useToneStream` listener for new event types; NowPanel subtle indicator - `compose.cloud.yml`: `CF_VOICE_ACCENT=0` default (opt-in, expensive) ## Non-goals - Do not store accent or scene labels server-side in the corrections DB - Do not transmit `privacy_risk=high` audio chunks to cloud STT without explicit user consent - BirdNET species-level identification is out of scope (too identifying; use genre-level "birdsong" only)