feat: full voice pipeline — AST acoustic, accent, privacy, prosody, dimensional, trajectory, telephony, FastAPI app
New modules shipped (from Linnet integration): - acoustic.py: AST (MIT/ast-finetuned-audioset-10-10-0.4593) replaces YAMNet stub; 527 AudioSet classes mapped to queue/speaker/environ/scene labels; _LABEL_MAP includes hold_music, ringback, DTMF, background_shift, AMD signal chain - accent.py: facebook/mms-lid-126 language ID → regional accent labels (en_gb, en_us, en_au, fr, es, de, zh, …); lazy-loaded, gated by CF_VOICE_ACCENT - privacy.py: compound privacy risk scorer — public_env, background_voices, nature scene, accent signals; returns 0–3 score without storing any audio - prosody.py: openSMILE-backed prosody extractor (sarcasm_risk, flat_f0_score, speech_rate, pitch_range); mock mode returns neutral values - dimensional.py: audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim valence/arousal/dominance scorer; gated by CF_VOICE_DIMENSIONAL - trajectory.py: rolling buffer for arousal/valence deltas, trend detection (escalating/suppressed/stable), coherence scoring, suppression/reframe flags - telephony.py: TelephonyBackend Protocol + MockTelephonyBackend + SignalWireBackend + FreeSWITCHBackend; CallSession dataclass; make_telephony() factory - app.py: FastAPI service (port 8007) — /health + /classify; accepts base64 PCM chunks, returns full AudioEventOut including dimensional/prosody/accent fields - prefs.py: voice preference helpers (elcor_mode, confidence_threshold, whisper_model, elcor_prior_frames); cf-core and env-var fallback Tests: fix stale tests (YAMNetAcousticBackend → ASTAcousticBackend, scene field added to AcousticResult, speaker_at gap now resolves dominant speaker not UNKNOWN, make_io real path returns MicVoiceIO when sounddevice installed). 78 tests passing. Closes #2, #3.
This commit is contained in:
parent
335d51f02f
commit
24f04b67db
26 changed files with 3974 additions and 111 deletions
36
.env.example
36
.env.example
|
|
@ -3,14 +3,29 @@
|
|||
# load it via python-dotenv in their own startup. For standalone cf-voice
|
||||
# dev/testing, source this file manually or install python-dotenv.
|
||||
|
||||
# ── HuggingFace ───────────────────────────────────────────────────────────────
|
||||
# Required for pyannote.audio speaker diarization model download.
|
||||
# Get a free token at https://huggingface.co/settings/tokens
|
||||
# Also accept the gated model terms at:
|
||||
# ── HuggingFace — free tier / local use ──────────────────────────────────────
|
||||
# Used by the local diarization path (free tier, user's own machine).
|
||||
# Each user must:
|
||||
# 1. Create a free account at huggingface.co
|
||||
# 2. Accept the gated model terms at:
|
||||
# https://huggingface.co/pyannote/speaker-diarization-3.1
|
||||
# https://huggingface.co/pyannote/segmentation-3.0
|
||||
# 3. Generate a read token at huggingface.co/settings/tokens
|
||||
HF_TOKEN=
|
||||
|
||||
# ── HuggingFace — paid tier / cf-orch backend ─────────────────────────────────
|
||||
# Used by cf-orch when running diarization as a managed service on Heimdall.
|
||||
# This is a CircuitForge org token — NOT the user's personal token.
|
||||
#
|
||||
# Prerequisites (one-time, manual — tracked in circuitforge-orch#27):
|
||||
# 1. Create CircuitForge org on huggingface.co
|
||||
# 2. Accept pyannote/speaker-diarization-3.1 terms under the org account
|
||||
# 3. Accept pyannote/segmentation-3.0 terms under the org account
|
||||
# 4. Generate a read-only org token and set it here
|
||||
#
|
||||
# Leave blank on local installs — HF_TOKEN above is used instead.
|
||||
CF_HF_TOKEN=
|
||||
|
||||
# ── Whisper STT ───────────────────────────────────────────────────────────────
|
||||
# Model size: tiny | base | small | medium | large-v2 | large-v3
|
||||
# Smaller = faster / less VRAM; larger = more accurate.
|
||||
|
|
@ -29,3 +44,16 @@ CF_VOICE_MOCK=
|
|||
# ── Tone classifier ───────────────────────────────────────────────────────────
|
||||
# Minimum confidence to emit a VoiceFrame (below this = frame skipped).
|
||||
CF_VOICE_CONFIDENCE_THRESHOLD=0.55
|
||||
|
||||
# ── Elcor annotation mode ─────────────────────────────────────────────────────
|
||||
# Accessibility feature for autistic and ND users. Switches tone subtext from
|
||||
# generic format ("Tone: Frustrated") to Elcor-style prefix format
|
||||
# ("With barely concealed frustration:"). Opt-in, local-only.
|
||||
# Overridden by cf-core preferences store when circuitforge_core is installed.
|
||||
# 1 = enabled, 0 or unset = disabled (default).
|
||||
CF_VOICE_ELCOR=0
|
||||
|
||||
# Number of prior VoiceFrames to include as context for Elcor label generation.
|
||||
# Larger windows = more contextually aware annotations, higher LLM prompt cost.
|
||||
# Default: 4 frames (~10 seconds of rolling context at 2.5s intervals).
|
||||
CF_VOICE_ELCOR_PRIOR_FRAMES=4
|
||||
|
|
|
|||
72
README.md
72
README.md
|
|
@ -126,6 +126,64 @@ Host apps subscribing via `<LinnetWidget />` receive `MessageEvent` with `type =
|
|||
|
||||
---
|
||||
|
||||
---
|
||||
|
||||
## Telephony
|
||||
|
||||
`cf_voice.telephony` provides the outbound call abstraction for Osprey, Harrier, Ibis, and Kestrel.
|
||||
|
||||
### Quick start
|
||||
|
||||
```python
|
||||
from cf_voice.telephony import make_telephony
|
||||
|
||||
# Mock mode — no real calls placed (CF_VOICE_MOCK=1 or mock=True)
|
||||
backend = make_telephony(mock=True)
|
||||
|
||||
session = await backend.dial(
|
||||
to="+15551234567",
|
||||
from_="+18005550000",
|
||||
webhook_url="https://yourapp.example.com/voice/events",
|
||||
amd=True, # answering machine detection
|
||||
)
|
||||
|
||||
# Adaptive service identification (osprey#21)
|
||||
await backend.announce(session.call_sid, "This is an automated assistant.")
|
||||
|
||||
# Navigate IVR
|
||||
await backend.send_dtmf(session.call_sid, "2") # Press 2 for billing
|
||||
|
||||
# Bridge to user's phone once human agent answers
|
||||
await backend.bridge(session.call_sid, "+14155550100")
|
||||
|
||||
await backend.hangup(session.call_sid)
|
||||
```
|
||||
|
||||
### Backend selection
|
||||
|
||||
`make_telephony()` resolves the backend in this order:
|
||||
|
||||
| Condition | Backend |
|
||||
|---|---|
|
||||
| `CF_VOICE_MOCK=1` or `mock=True` | `MockTelephonyBackend` (dev/CI) |
|
||||
| `CF_SW_PROJECT_ID` env set | `SignalWireBackend` (paid tier) |
|
||||
| `CF_ESL_PASSWORD` env set | `FreeSWITCHBackend` (free tier, self-hosted) |
|
||||
| none | `RuntimeError` |
|
||||
|
||||
### Installing real backends
|
||||
|
||||
```bash
|
||||
# Paid tier — SignalWire managed telephony
|
||||
pip install cf-voice[signalwire]
|
||||
|
||||
# Free tier — self-hosted FreeSWITCH (requires compiled ESL bindings)
|
||||
pip install cf-voice[freeswitch]
|
||||
```
|
||||
|
||||
Set credentials in `.env` (see `.env.example`).
|
||||
|
||||
---
|
||||
|
||||
## Mock mode
|
||||
|
||||
Set `CF_VOICE_MOCK=1` or pass `mock=True` to `make_io()`. Emits synthetic `VoiceFrame` objects on a timer. No GPU, microphone, or `HF_TOKEN` required. All API surface is identical to real mode.
|
||||
|
|
@ -139,6 +197,7 @@ Set `CF_VOICE_MOCK=1` or pass `mock=True` to `make_io()`. Emits synthetic `Voice
|
|||
| `cf_voice.models` | MIT | `VoiceFrame` dataclass |
|
||||
| `cf_voice.events` | MIT | `AudioEvent`, `ToneEvent`, wire format types |
|
||||
| `cf_voice.io` | MIT | `VoiceIO` base, `MockVoiceIO`, `make_io()` factory |
|
||||
| `cf_voice.telephony` | MIT (Protocol + Mock), BSL (backends) | `TelephonyBackend` Protocol, `MockTelephonyBackend`, `SignalWireBackend`, `FreeSWITCHBackend`, `make_telephony()` |
|
||||
| `cf_voice.capture` | BSL 1.1 | `MicVoiceIO` — real mic capture, 2s windowing |
|
||||
| `cf_voice.stt` | BSL 1.1 | `WhisperSTT` — faster-whisper async wrapper |
|
||||
| `cf_voice.classify` | BSL 1.1 | `ToneClassifier` — wav2vec2 SER + librosa prosody |
|
||||
|
|
@ -149,6 +208,19 @@ BSL applies to inference modules. IO + types + wire format = MIT.
|
|||
|
||||
---
|
||||
|
||||
---
|
||||
|
||||
## Attribution
|
||||
|
||||
Speaker diarization uses [pyannote.audio](https://github.com/pyannote/pyannote-audio) (MIT) and the following gated HuggingFace models (CC BY 4.0):
|
||||
|
||||
- `pyannote/speaker-diarization-3.1` — Hervé Bredin et al.
|
||||
- `pyannote/segmentation-3.0` — Hervé Bredin et al.
|
||||
|
||||
CC BY 4.0 requires attribution in any distributed product. The models are gated: each user must accept the license terms on HuggingFace before their `HF_TOKEN` will authorize a download.
|
||||
|
||||
---
|
||||
|
||||
## Consumed by
|
||||
|
||||
- `Circuit-Forge/linnet` — real-time tone annotation PWA (primary consumer)
|
||||
|
|
|
|||
152
cf_voice/accent.py
Normal file
152
cf_voice/accent.py
Normal file
|
|
@ -0,0 +1,152 @@
|
|||
# cf_voice/accent.py — accent / language identification classifier
|
||||
#
|
||||
# MIT licensed (AccentResult dataclass + mock). BSL 1.1 (real inference).
|
||||
# Gated by CF_VOICE_ACCENT=1 — off by default (GPU cost + privacy sensitivity).
|
||||
#
|
||||
# Accent alone is not high-risk, but combined with birdsong or a quiet rural
|
||||
# background it becomes location-identifying. The privacy scorer accounts for
|
||||
# this compound signal.
|
||||
#
|
||||
# Real backend: facebook/mms-lid-126 for language detection, wav2vec2 accent
|
||||
# fine-tune for region. Lazy-loaded to keep startup fast.
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import os
|
||||
from dataclasses import dataclass
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
@dataclass
|
||||
class AccentResult:
|
||||
"""
|
||||
Language + regional accent classification for the primary speaker.
|
||||
|
||||
language: BCP-47 language tag (e.g. "en", "fr", "zh")
|
||||
region: cf-voice ACCENT_LABEL string (e.g. "en_gb", "en_us", "other")
|
||||
confidence: float in [0, 1]
|
||||
"""
|
||||
language: str
|
||||
region: str
|
||||
confidence: float
|
||||
|
||||
|
||||
class MockAccentClassifier:
|
||||
"""
|
||||
Synthetic accent classifier for development and CI.
|
||||
|
||||
Returns a fixed result so the privacy scorer can exercise all code paths
|
||||
without loading a real model.
|
||||
"""
|
||||
|
||||
def classify(self, audio: "list[float] | bytes") -> AccentResult | None:
|
||||
return AccentResult(language="en", region="en_gb", confidence=0.72)
|
||||
|
||||
|
||||
class AccentClassifier:
|
||||
"""
|
||||
Real accent / language classifier.
|
||||
|
||||
BSL 1.1 — requires [inference] extras.
|
||||
|
||||
Language detection: facebook/mms-lid-126 (126 languages, MIT licensed).
|
||||
Accent region: maps language tag to a regional ACCENT_LABEL.
|
||||
|
||||
VRAM: ~500 MB on CUDA.
|
||||
"""
|
||||
|
||||
_LANG_MODEL_ID = "facebook/mms-lid-126"
|
||||
|
||||
def __init__(self) -> None:
|
||||
try:
|
||||
from transformers import Wav2Vec2ForSequenceClassification, AutoFeatureExtractor
|
||||
except ImportError as exc:
|
||||
raise ImportError(
|
||||
"transformers is required for accent classification. "
|
||||
"Install with: pip install cf-voice[inference]"
|
||||
) from exc
|
||||
|
||||
import torch
|
||||
|
||||
self._device = "cuda" if torch.cuda.is_available() else "cpu"
|
||||
logger.info("Loading language ID model %s on %s", self._LANG_MODEL_ID, self._device)
|
||||
self._extractor = AutoFeatureExtractor.from_pretrained(self._LANG_MODEL_ID)
|
||||
self._model = Wav2Vec2ForSequenceClassification.from_pretrained(
|
||||
self._LANG_MODEL_ID
|
||||
).to(self._device)
|
||||
self._model.eval()
|
||||
|
||||
def classify(self, audio: "list[float] | bytes") -> AccentResult | None:
|
||||
import numpy as np
|
||||
import torch
|
||||
|
||||
if isinstance(audio, bytes):
|
||||
audio_np = np.frombuffer(audio, dtype=np.float32)
|
||||
else:
|
||||
audio_np = np.asarray(audio, dtype=np.float32)
|
||||
|
||||
if len(audio_np) < 1600: # need at least 100ms at 16kHz
|
||||
return None
|
||||
|
||||
inputs = self._extractor(
|
||||
audio_np, sampling_rate=16_000, return_tensors="pt", padding=True
|
||||
)
|
||||
inputs = {k: v.to(self._device) for k, v in inputs.items()}
|
||||
|
||||
with torch.no_grad():
|
||||
logits = self._model(**inputs).logits
|
||||
probs = torch.softmax(logits, dim=-1)[0]
|
||||
|
||||
top_idx = int(probs.argmax())
|
||||
confidence = float(probs[top_idx])
|
||||
language = self._model.config.id2label.get(top_idx, "other")
|
||||
|
||||
region = _lang_to_region(language)
|
||||
return AccentResult(language=language, region=region, confidence=confidence)
|
||||
|
||||
|
||||
def _lang_to_region(lang: str) -> str:
|
||||
"""Map a BCP-47 / ISO 639-3 language tag to a cf-voice ACCENT_LABEL."""
|
||||
_MAP: dict[str, str] = {
|
||||
"eng": "en_us", # MMS uses ISO 639-3; sub-regional accent needs fine-tune
|
||||
"fra": "fr",
|
||||
"spa": "es",
|
||||
"deu": "de",
|
||||
"zho": "zh",
|
||||
"jpn": "ja",
|
||||
"en": "en_us",
|
||||
"en-GB": "en_gb",
|
||||
"en-AU": "en_au",
|
||||
"en-CA": "en_ca",
|
||||
"en-IN": "en_in",
|
||||
"fr": "fr",
|
||||
"de": "de",
|
||||
"es": "es",
|
||||
"zh": "zh",
|
||||
"ja": "ja",
|
||||
}
|
||||
return _MAP.get(lang, "other")
|
||||
|
||||
|
||||
def make_accent_classifier(
|
||||
mock: bool | None = None,
|
||||
) -> "MockAccentClassifier | AccentClassifier | None":
|
||||
"""
|
||||
Factory: return an AccentClassifier if CF_VOICE_ACCENT=1, else None.
|
||||
|
||||
Callers must check for None before invoking classify().
|
||||
"""
|
||||
enabled = os.environ.get("CF_VOICE_ACCENT", "") == "1"
|
||||
if not enabled:
|
||||
return None
|
||||
|
||||
use_mock = mock if mock is not None else os.environ.get("CF_VOICE_MOCK", "") == "1"
|
||||
if use_mock:
|
||||
return MockAccentClassifier()
|
||||
|
||||
try:
|
||||
return AccentClassifier()
|
||||
except (ImportError, Exception) as exc:
|
||||
logger.warning("AccentClassifier unavailable (%s) — using mock", exc)
|
||||
return MockAccentClassifier()
|
||||
366
cf_voice/acoustic.py
Normal file
366
cf_voice/acoustic.py
Normal file
|
|
@ -0,0 +1,366 @@
|
|||
# cf_voice/acoustic.py — queue / environ / speaker acoustic event classifier
|
||||
#
|
||||
# MIT licensed (Protocol + mock). BSL 1.1 (real YAMNet inference).
|
||||
# Requires [inference] extras for real mode.
|
||||
#
|
||||
# This module is the AMD (answering machine detection) backbone for Osprey.
|
||||
# It runs in parallel with the STT pipeline — it never processes words,
|
||||
# only acoustic features (pitch, timbre, background, DTMF tones, ringback).
|
||||
#
|
||||
# Navigation v0.2.x wires the real YAMNet model.
|
||||
# Current: mock emits a plausible call-lifecycle sequence.
|
||||
from __future__ import annotations
|
||||
|
||||
import asyncio
|
||||
import logging
|
||||
import random
|
||||
import time
|
||||
from dataclasses import dataclass
|
||||
from typing import AsyncIterator, Protocol, Sequence, runtime_checkable
|
||||
|
||||
from cf_voice.events import AudioEvent, QUEUE_LABELS, SPEAKER_LABELS, ENVIRON_LABELS, SCENE_LABELS
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
_SAMPLE_RATE = 16_000
|
||||
|
||||
|
||||
@dataclass
|
||||
class AcousticResult:
|
||||
"""Batch of AudioEvents produced from a single audio window."""
|
||||
queue: AudioEvent | None
|
||||
speaker: AudioEvent | None
|
||||
environ: AudioEvent | None
|
||||
scene: AudioEvent | None
|
||||
timestamp: float
|
||||
|
||||
|
||||
@runtime_checkable
|
||||
class AcousticBackend(Protocol):
|
||||
"""
|
||||
Interface for acoustic event classifiers.
|
||||
|
||||
classify_window() takes a PCM float32 buffer (mono, 16kHz) and returns an
|
||||
AcousticResult covering one analysis window (~2s). It is synchronous and
|
||||
runs in a thread pool when called from async code.
|
||||
"""
|
||||
|
||||
def classify_window(
|
||||
self,
|
||||
audio: "list[float] | bytes",
|
||||
timestamp: float = 0.0,
|
||||
) -> AcousticResult:
|
||||
...
|
||||
|
||||
|
||||
@runtime_checkable
|
||||
class SceneBackend(Protocol):
|
||||
"""
|
||||
Interface for dedicated acoustic scene classifiers.
|
||||
|
||||
Separate from AcousticBackend to allow future swapping to a specialised
|
||||
scene model (e.g. AudioSet acoustic-scene subset) without touching the
|
||||
telephony event classifier.
|
||||
"""
|
||||
|
||||
def classify_scene(
|
||||
self,
|
||||
audio: "list[float] | bytes",
|
||||
timestamp: float = 0.0,
|
||||
) -> AudioEvent | None:
|
||||
...
|
||||
|
||||
|
||||
# ── Call lifecycle for mock mode ──────────────────────────────────────────────
|
||||
# Approximates what a real outbound call looks like acoustically.
|
||||
# Phases: ringing → ivr_greeting → ivr_navigation → human_answer → call_center
|
||||
|
||||
_MOCK_LIFECYCLE: list[dict] = [
|
||||
# (min_s, max_s): how long to stay in this phase
|
||||
{"queue": "ringback", "speaker": "no_speaker", "environ": "quiet", "scene": "indoor_quiet", "dur": (2, 5)},
|
||||
{"queue": "silence", "speaker": "ivr_synth", "environ": "quiet", "scene": "indoor_quiet", "dur": (1, 2)},
|
||||
{"queue": "hold_music", "speaker": "no_speaker", "environ": "music", "scene": "indoor_quiet", "dur": (2, 8)},
|
||||
{"queue": "silence", "speaker": "ivr_synth", "environ": "quiet", "scene": "indoor_quiet", "dur": (1, 3)},
|
||||
{"queue": "dtmf_tone", "speaker": "no_speaker", "environ": "quiet", "scene": "indoor_quiet", "dur": (0.5, 1)},
|
||||
{"queue": "silence", "speaker": "ivr_synth", "environ": "quiet", "scene": "indoor_quiet", "dur": (0.5, 1)},
|
||||
{"queue": "hold_music", "speaker": "no_speaker", "environ": "music", "scene": "indoor_quiet", "dur": (3, 12)},
|
||||
# AMD moment: background_shift is the primary signal
|
||||
{"queue": "silence", "speaker": "no_speaker", "environ": "background_shift", "scene": "indoor_crowd", "dur": (0.5, 1)},
|
||||
{"queue": "silence", "speaker": "human_single", "environ": "call_center", "scene": "indoor_crowd", "dur": (30, 60)},
|
||||
]
|
||||
|
||||
|
||||
class MockAcousticBackend:
|
||||
"""
|
||||
Synthetic acoustic classifier for development and CI.
|
||||
|
||||
Cycles through a plausible call lifecycle so Osprey's IVR state machine
|
||||
can be tested without real telephony. The AMD signal (background_shift →
|
||||
human_single) is emitted at the right point in the sequence.
|
||||
|
||||
Usage:
|
||||
backend = MockAcousticBackend(seed=42)
|
||||
result = backend.classify_window(b"", timestamp=4.5)
|
||||
print(result.environ.label) # → "hold_music", "background_shift", etc.
|
||||
"""
|
||||
|
||||
def __init__(self, seed: int | None = None) -> None:
|
||||
self._rng = random.Random(seed)
|
||||
self._phase_idx = 0
|
||||
self._phase_start = time.monotonic()
|
||||
self._phase_dur = self._draw_phase_dur(0)
|
||||
|
||||
def _draw_phase_dur(self, idx: int) -> float:
|
||||
lo, hi = _MOCK_LIFECYCLE[idx % len(_MOCK_LIFECYCLE)]["dur"]
|
||||
return self._rng.uniform(lo, hi)
|
||||
|
||||
def _current_phase(self) -> dict:
|
||||
now = time.monotonic()
|
||||
elapsed = now - self._phase_start
|
||||
if elapsed >= self._phase_dur:
|
||||
self._phase_idx = (self._phase_idx + 1) % len(_MOCK_LIFECYCLE)
|
||||
self._phase_start = now
|
||||
self._phase_dur = self._draw_phase_dur(self._phase_idx)
|
||||
return _MOCK_LIFECYCLE[self._phase_idx]
|
||||
|
||||
def _make_event(
|
||||
self,
|
||||
event_type: str,
|
||||
label: str,
|
||||
timestamp: float,
|
||||
) -> AudioEvent:
|
||||
return AudioEvent(
|
||||
timestamp=timestamp,
|
||||
event_type=event_type, # type: ignore[arg-type]
|
||||
label=label,
|
||||
confidence=self._rng.uniform(0.72, 0.97),
|
||||
)
|
||||
|
||||
def classify_window(
|
||||
self,
|
||||
audio: "list[float] | bytes",
|
||||
timestamp: float = 0.0,
|
||||
) -> AcousticResult:
|
||||
phase = self._current_phase()
|
||||
return AcousticResult(
|
||||
queue=self._make_event("queue", phase["queue"], timestamp),
|
||||
speaker=self._make_event("speaker", phase["speaker"], timestamp),
|
||||
environ=self._make_event("environ", phase["environ"], timestamp),
|
||||
scene=self._make_event("scene", phase["scene"], timestamp),
|
||||
timestamp=timestamp,
|
||||
)
|
||||
|
||||
|
||||
# ── AST acoustic backend (BSL 1.1) ───────────────────────────────────────────
|
||||
|
||||
|
||||
class ASTAcousticBackend:
|
||||
"""
|
||||
Audio Spectrogram Transformer acoustic event classifier.
|
||||
|
||||
BSL 1.1 — requires [inference] extras.
|
||||
|
||||
Uses MIT/ast-finetuned-audioset-10-10-0.4593 (527 AudioSet classes) to
|
||||
classify queue state, speaker type, and background environment from a
|
||||
single forward pass. Top-15 predictions are scanned; the highest-confidence
|
||||
match per event category is emitted.
|
||||
|
||||
Model: MIT/ast-finetuned-audioset-10-10-0.4593
|
||||
VRAM: ~300 MB on CUDA (fp32)
|
||||
Input: float32 16kHz mono audio (any length; feature extractor pads/truncates)
|
||||
|
||||
Replaces the YAMNet stub. Synchronous — run from a thread pool executor
|
||||
when called from async code.
|
||||
"""
|
||||
|
||||
_MODEL_ID = "MIT/ast-finetuned-audioset-10-10-0.4593"
|
||||
_SAMPLE_RATE = 16_000
|
||||
_TOP_K = 20 # scan more classes — many relevant ones are in the 10-20 range
|
||||
|
||||
# Minimum confidence below which an event is suppressed even if it's the
|
||||
# top match in its category.
|
||||
_MIN_CONFIDENCE: dict[str, float] = {
|
||||
"queue": 0.10,
|
||||
"speaker": 0.08,
|
||||
"environ": 0.12,
|
||||
"scene": 0.08, # scenes fire reliably — lower bar is fine
|
||||
}
|
||||
|
||||
# AudioSet class name → (event_type, cf-voice label).
|
||||
# Top-K predictions are scanned; highest confidence per category wins.
|
||||
# "call_center" requires dedicated call-centre acoustics, not generic indoor.
|
||||
# "Music" was previously duplicated (queue + environ) — Python dicts keep the
|
||||
# last entry, silently losing the queue mapping. Fixed: use the specific
|
||||
# "Musical instrument" AudioSet parent for hold_music; "Music" maps to environ.
|
||||
_LABEL_MAP: dict[str, tuple[str, str]] = {
|
||||
# ── Queue / call-state labels ──────────────────────────────────────────
|
||||
"Ringtone": ("queue", "ringback"),
|
||||
"Telephone bell ringing": ("queue", "ringback"),
|
||||
"Busy signal": ("queue", "busy"),
|
||||
"Dial tone": ("queue", "dtmf_tone"),
|
||||
"DTMF": ("queue", "dtmf_tone"),
|
||||
"Silence": ("queue", "silence"),
|
||||
# ── Speaker type labels ────────────────────────────────────────────────
|
||||
"Speech": ("speaker", "human_single"),
|
||||
"Male speech, man speaking": ("speaker", "human_single"),
|
||||
"Female speech, woman speaking": ("speaker", "human_single"),
|
||||
"Child speech, kid speaking": ("speaker", "human_single"),
|
||||
"Crowd": ("speaker", "human_multi"),
|
||||
"Hubbub, speech noise, speech babble": ("speaker", "human_multi"),
|
||||
"Laughter": ("speaker", "human_multi"),
|
||||
"Chuckle, chortle": ("speaker", "human_multi"),
|
||||
"Speech synthesizer": ("speaker", "ivr_synth"),
|
||||
# ── Environmental labels ───────────────────────────────────────────────
|
||||
# Telephony — requires specific call-centre acoustics, not generic indoor
|
||||
"Telephone": ("environ", "call_center"),
|
||||
"Telephone dialing, DTMF": ("environ", "call_center"),
|
||||
"Reverberation": ("environ", "background_shift"),
|
||||
"Echo": ("environ", "background_shift"),
|
||||
"Background noise": ("environ", "noise_floor_change"),
|
||||
"Noise": ("environ", "noise_floor_change"),
|
||||
"White noise": ("environ", "noise_floor_change"),
|
||||
"Pink noise": ("environ", "noise_floor_change"),
|
||||
"Static": ("environ", "noise_floor_change"),
|
||||
"Music": ("environ", "music"),
|
||||
# Nature
|
||||
"Bird": ("environ", "birdsong"),
|
||||
"Bird vocalization, bird call, bird song": ("environ", "birdsong"),
|
||||
"Chirp, tweet": ("environ", "birdsong"),
|
||||
"Wind": ("environ", "wind"),
|
||||
"Wind noise (microphone)": ("environ", "wind"),
|
||||
"Rain": ("environ", "rain"),
|
||||
"Rain on surface": ("environ", "rain"),
|
||||
"Water": ("environ", "water"),
|
||||
"Stream": ("environ", "water"),
|
||||
# Urban
|
||||
"Traffic noise, roadway noise": ("environ", "traffic"),
|
||||
"Vehicle": ("environ", "traffic"),
|
||||
"Crowd": ("environ", "crowd_chatter"),
|
||||
"Chatter": ("environ", "crowd_chatter"),
|
||||
"Construction": ("environ", "construction"),
|
||||
"Drill": ("environ", "construction"),
|
||||
# Indoor
|
||||
"Air conditioning": ("environ", "hvac"),
|
||||
"Mechanical fan": ("environ", "hvac"),
|
||||
"Computer keyboard": ("environ", "keyboard_typing"),
|
||||
"Typing": ("environ", "keyboard_typing"),
|
||||
"Restaurant": ("environ", "restaurant"),
|
||||
"Dishes, pots, and pans": ("environ", "restaurant"),
|
||||
# ── Acoustic scene labels ──────────────────────────────────────────────
|
||||
# "Inside, small/large room" moved from environ to scene — they correctly
|
||||
# describe the acoustic scene but are NOT specific enough for call_center.
|
||||
"Inside, small room": ("scene", "indoor_quiet"),
|
||||
"Inside, large room or hall": ("scene", "indoor_crowd"),
|
||||
"Outside, urban or manmade": ("scene", "outdoor_urban"),
|
||||
"Field recording": ("scene", "outdoor_nature"),
|
||||
"Rail transport": ("scene", "public_transit"),
|
||||
"Bus": ("scene", "public_transit"),
|
||||
"Train": ("scene", "public_transit"),
|
||||
"Car": ("scene", "vehicle"),
|
||||
"Truck": ("scene", "vehicle"),
|
||||
"Motorcycle": ("scene", "vehicle"),
|
||||
# Music in the queue sense — "Musical instrument" is more specific
|
||||
# than the ambiguous top-level "Music" class
|
||||
"Musical instrument": ("queue", "hold_music"),
|
||||
"Piano": ("queue", "hold_music"),
|
||||
"Guitar": ("queue", "hold_music"),
|
||||
}
|
||||
|
||||
def __init__(self) -> None:
|
||||
try:
|
||||
from transformers import ASTFeatureExtractor, ASTForAudioClassification
|
||||
except ImportError as exc:
|
||||
raise ImportError(
|
||||
"transformers is required for AST acoustic classification. "
|
||||
"Install with: pip install cf-voice[inference]"
|
||||
) from exc
|
||||
|
||||
import torch
|
||||
|
||||
self._device = "cuda" if torch.cuda.is_available() else "cpu"
|
||||
logger.info("Loading AST acoustic model %s on %s", self._MODEL_ID, self._device)
|
||||
self._extractor = ASTFeatureExtractor.from_pretrained(self._MODEL_ID)
|
||||
self._model = ASTForAudioClassification.from_pretrained(self._MODEL_ID).to(
|
||||
self._device
|
||||
)
|
||||
self._model.eval()
|
||||
|
||||
def classify_window(
|
||||
self,
|
||||
audio: "list[float] | bytes",
|
||||
timestamp: float = 0.0,
|
||||
) -> AcousticResult:
|
||||
import numpy as np
|
||||
import torch
|
||||
|
||||
if isinstance(audio, bytes):
|
||||
audio_np = np.frombuffer(audio, dtype=np.float32)
|
||||
else:
|
||||
audio_np = np.asarray(audio, dtype=np.float32)
|
||||
|
||||
if len(audio_np) == 0:
|
||||
return AcousticResult(queue=None, speaker=None, environ=None, scene=None, timestamp=timestamp)
|
||||
|
||||
inputs = self._extractor(
|
||||
audio_np, sampling_rate=self._SAMPLE_RATE, return_tensors="pt"
|
||||
)
|
||||
inputs = {k: v.to(self._device) for k, v in inputs.items()}
|
||||
|
||||
with torch.no_grad():
|
||||
logits = self._model(**inputs).logits
|
||||
probs = torch.softmax(logits, dim=-1)[0]
|
||||
id2label = self._model.config.id2label
|
||||
|
||||
top_k = min(self._TOP_K, len(probs))
|
||||
top_indices = probs.topk(top_k).indices.tolist()
|
||||
predictions = [(id2label[i], float(probs[i])) for i in top_indices]
|
||||
|
||||
# Take highest-confidence match per category
|
||||
best: dict[str, tuple[str, float]] = {} # event_type → (label, conf)
|
||||
for ast_label, conf in predictions:
|
||||
mapping = self._LABEL_MAP.get(ast_label)
|
||||
if mapping is None:
|
||||
continue
|
||||
etype, cf_label = mapping
|
||||
if etype not in best or conf > best[etype][1]:
|
||||
best[etype] = (cf_label, conf)
|
||||
|
||||
def _make_event(etype: str, label: str, conf: float) -> AudioEvent:
|
||||
return AudioEvent(
|
||||
timestamp=timestamp,
|
||||
event_type=etype, # type: ignore[arg-type]
|
||||
label=label,
|
||||
confidence=round(conf, 4),
|
||||
)
|
||||
|
||||
def _above_threshold(etype: str) -> bool:
|
||||
if etype not in best:
|
||||
return False
|
||||
_, conf = best[etype]
|
||||
return conf >= self._MIN_CONFIDENCE.get(etype, 0.10)
|
||||
|
||||
return AcousticResult(
|
||||
queue=_make_event("queue", *best["queue"]) if _above_threshold("queue") else None,
|
||||
speaker=_make_event("speaker", *best["speaker"]) if _above_threshold("speaker") else None,
|
||||
environ=_make_event("environ", *best["environ"]) if _above_threshold("environ") else None,
|
||||
scene=_make_event("scene", *best["scene"]) if _above_threshold("scene") else None,
|
||||
timestamp=timestamp,
|
||||
)
|
||||
|
||||
|
||||
def make_acoustic(mock: bool | None = None) -> "MockAcousticBackend | ASTAcousticBackend":
|
||||
"""
|
||||
Factory: return an AcousticBackend for the current environment.
|
||||
|
||||
mock=True or CF_VOICE_MOCK=1 → MockAcousticBackend
|
||||
Otherwise → ASTAcousticBackend (falls back to mock on import error)
|
||||
"""
|
||||
import os
|
||||
use_mock = mock if mock is not None else os.environ.get("CF_VOICE_MOCK", "") == "1"
|
||||
if use_mock:
|
||||
return MockAcousticBackend()
|
||||
try:
|
||||
return ASTAcousticBackend()
|
||||
except (ImportError, Exception) as exc:
|
||||
logger.warning("ASTAcousticBackend unavailable (%s) — using mock", exc)
|
||||
return MockAcousticBackend()
|
||||
197
cf_voice/app.py
Normal file
197
cf_voice/app.py
Normal file
|
|
@ -0,0 +1,197 @@
|
|||
"""
|
||||
cf-voice FastAPI service — managed by cf-orch.
|
||||
|
||||
Tone/affect classification sidecar for Linnet and any product that needs
|
||||
real-time audio context annotation. Wraps ContextClassifier so it runs as an
|
||||
independent managed process rather than embedded in the consumer's process.
|
||||
|
||||
Endpoints:
|
||||
GET /health → {"status": "ok", "mode": "mock"|"real"}
|
||||
POST /classify → ClassifyResponse
|
||||
|
||||
Usage:
|
||||
python -m cf_voice.app --port 8007 --gpu-id 0
|
||||
|
||||
Mock mode (no GPU, no audio hardware required):
|
||||
CF_VOICE_MOCK=1 python -m cf_voice.app --port 8007
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import logging
|
||||
import os
|
||||
|
||||
import uvicorn
|
||||
from fastapi import FastAPI, HTTPException
|
||||
from pydantic import BaseModel
|
||||
|
||||
from cf_voice.context import ContextClassifier, model_status
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
_classifier: ContextClassifier | None = None
|
||||
_mock_mode: bool = False
|
||||
|
||||
|
||||
# ── Request / response models ─────────────────────────────────────────────────
|
||||
|
||||
|
||||
class ClassifyRequest(BaseModel):
|
||||
audio_chunk: str | None = None # base64-encoded PCM int16 mono 16kHz; None in mock mode
|
||||
timestamp: float = 0.0
|
||||
elcor: bool | None = None
|
||||
prior_frames: int | None = None
|
||||
session_id: str = ""
|
||||
language: str | None = None # BCP-47 hint for Whisper ("en", "es", …); None = auto-detect
|
||||
num_speakers: int | None = None # pyannote hint: None = auto; 1–8 = fixed min+max
|
||||
|
||||
|
||||
class AudioEventOut(BaseModel):
|
||||
event_type: str
|
||||
label: str
|
||||
confidence: float
|
||||
timestamp: float
|
||||
speaker_id: str = "speaker_a"
|
||||
subtext: str | None = None
|
||||
affect: str | None = None
|
||||
shift_magnitude: float | None = None
|
||||
shift_direction: str | None = None
|
||||
prosody_flags: list[str] = []
|
||||
# Dimensional emotion (audeering model) — None when classifier disabled
|
||||
valence: float | None = None
|
||||
arousal: float | None = None
|
||||
dominance: float | None = None
|
||||
# Prosodic signals (openSMILE) — None when extractor disabled
|
||||
sarcasm_risk: float | None = None
|
||||
flat_f0_score: float | None = None
|
||||
# Trajectory signals — None until BASELINE_MIN frames buffered per speaker
|
||||
arousal_delta: float | None = None
|
||||
valence_delta: float | None = None
|
||||
trend: str | None = None
|
||||
# Coherence signals (SER vs VAD)
|
||||
coherence_score: float | None = None
|
||||
suppression_flag: bool | None = None
|
||||
reframe_type: str | None = None
|
||||
affect_divergence: float | None = None
|
||||
|
||||
|
||||
class ClassifyResponse(BaseModel):
|
||||
events: list[AudioEventOut]
|
||||
|
||||
|
||||
# ── App factory ───────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def create_app(gpu_id: int = 0, mock: bool = False) -> FastAPI:
|
||||
global _classifier, _mock_mode
|
||||
|
||||
# Signal GPU to the inference backends (wav2vec2 loads via transformers pipeline)
|
||||
if not mock:
|
||||
os.environ.setdefault("CUDA_VISIBLE_DEVICES", str(gpu_id))
|
||||
|
||||
_mock_mode = mock or os.environ.get("CF_VOICE_MOCK", "") == "1"
|
||||
_classifier = ContextClassifier.mock() if _mock_mode else ContextClassifier.from_env()
|
||||
logger.info("cf-voice ready: mode=%s", "mock" if _mock_mode else "real")
|
||||
|
||||
app = FastAPI(title="cf-voice", version="0.1.0")
|
||||
|
||||
@app.on_event("startup")
|
||||
async def _startup_prewarm() -> None:
|
||||
"""Pre-warm all configured models so downloads happen at startup, not
|
||||
on the first classify call (which has a hard 9-second timeout)."""
|
||||
if _classifier is not None:
|
||||
import asyncio as _asyncio
|
||||
_asyncio.create_task(_classifier.prewarm())
|
||||
|
||||
@app.get("/health")
|
||||
def health() -> dict:
|
||||
result: dict = {
|
||||
"status": "ok",
|
||||
"mode": "mock" if _mock_mode else "real",
|
||||
"models": dict(model_status),
|
||||
}
|
||||
# Surface misconfigured-but-silent diarizer so Linnet can warn the user.
|
||||
# Check env vars only — no model loading needed at health-check time.
|
||||
warnings: list[str] = []
|
||||
if os.environ.get("CF_VOICE_DIARIZE", "0") == "1":
|
||||
token = os.environ.get("HF_TOKEN", "").strip()
|
||||
if not token:
|
||||
warnings.append(
|
||||
"Diarization is enabled (CF_VOICE_DIARIZE=1) but HF_TOKEN is not set. "
|
||||
"Speaker identity badges will not appear. "
|
||||
"Set HF_TOKEN in your .env and accept pyannote model terms at huggingface.co."
|
||||
)
|
||||
if warnings:
|
||||
result["warnings"] = warnings
|
||||
return result
|
||||
|
||||
@app.post("/classify")
|
||||
async def classify(req: ClassifyRequest) -> ClassifyResponse:
|
||||
if _classifier is None:
|
||||
raise HTTPException(503, detail="classifier not initialised")
|
||||
try:
|
||||
events = await _classifier.classify_chunk_async(
|
||||
audio_b64=req.audio_chunk,
|
||||
timestamp=req.timestamp,
|
||||
prior_frames=req.prior_frames,
|
||||
elcor=req.elcor,
|
||||
session_id=req.session_id,
|
||||
language=req.language,
|
||||
num_speakers=req.num_speakers,
|
||||
)
|
||||
except NotImplementedError as exc:
|
||||
raise HTTPException(501, detail=str(exc))
|
||||
|
||||
from cf_voice.events import ToneEvent
|
||||
|
||||
out: list[AudioEventOut] = []
|
||||
for e in events:
|
||||
is_tone = isinstance(e, ToneEvent)
|
||||
out.append(AudioEventOut(
|
||||
event_type=e.event_type,
|
||||
label=e.label,
|
||||
confidence=round(e.confidence, 4),
|
||||
timestamp=e.timestamp,
|
||||
speaker_id=getattr(e, "speaker_id", "speaker_a") or "speaker_a",
|
||||
subtext=getattr(e, "subtext", None),
|
||||
affect=getattr(e, "affect", None) if is_tone else None,
|
||||
shift_magnitude=getattr(e, "shift_magnitude", None) if is_tone else None,
|
||||
shift_direction=getattr(e, "shift_direction", None) if is_tone else None,
|
||||
prosody_flags=getattr(e, "prosody_flags", []) if is_tone else [],
|
||||
valence=getattr(e, "valence", None) if is_tone else None,
|
||||
arousal=getattr(e, "arousal", None) if is_tone else None,
|
||||
dominance=getattr(e, "dominance", None) if is_tone else None,
|
||||
sarcasm_risk=getattr(e, "sarcasm_risk", None) if is_tone else None,
|
||||
flat_f0_score=getattr(e, "flat_f0_score", None) if is_tone else None,
|
||||
arousal_delta=getattr(e, "arousal_delta", None) if is_tone else None,
|
||||
valence_delta=getattr(e, "valence_delta", None) if is_tone else None,
|
||||
trend=getattr(e, "trend", None) if is_tone else None,
|
||||
coherence_score=getattr(e, "coherence_score", None) if is_tone else None,
|
||||
suppression_flag=getattr(e, "suppression_flag", None) if is_tone else None,
|
||||
reframe_type=getattr(e, "reframe_type", None) if is_tone else None,
|
||||
affect_divergence=getattr(e, "affect_divergence", None) if is_tone else None,
|
||||
))
|
||||
return ClassifyResponse(events=out)
|
||||
|
||||
return app
|
||||
|
||||
|
||||
# ── CLI entrypoint ────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def _parse_args() -> argparse.Namespace:
|
||||
parser = argparse.ArgumentParser(description="cf-voice tone classification server")
|
||||
parser.add_argument("--port", type=int, default=8007)
|
||||
parser.add_argument("--host", default="0.0.0.0")
|
||||
parser.add_argument("--gpu-id", type=int, default=0)
|
||||
parser.add_argument("--mock", action="store_true",
|
||||
help="Run in mock mode (no GPU, no audio hardware needed)")
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
logging.basicConfig(level=logging.INFO,
|
||||
format="%(asctime)s %(levelname)s %(name)s — %(message)s")
|
||||
args = _parse_args()
|
||||
app = create_app(gpu_id=args.gpu_id, mock=args.mock)
|
||||
uvicorn.run(app, host=args.host, port=args.port, log_level="info")
|
||||
|
|
@ -82,13 +82,21 @@ class ToneClassifier:
|
|||
Tone/affect classifier: wav2vec2 SER + librosa prosody.
|
||||
|
||||
Loads the model lazily on first call to avoid import-time GPU allocation.
|
||||
Thread-safe for concurrent classify() calls — the pipeline is stateless
|
||||
Thread-safe for concurrent classify() calls — the model is stateless
|
||||
per-call; session state lives in the caller (ContextClassifier).
|
||||
|
||||
Uses AutoFeatureExtractor + AutoModelForAudioClassification directly
|
||||
rather than hf_pipeline to avoid torchcodec audio backend initialization.
|
||||
torchcodec 0.11.0 requires libnvrtc.so.13, which is absent on CUDA 12.x
|
||||
systems. Calling the model directly bypasses the pipeline's audio backend
|
||||
selection entirely since we already have float32 at 16kHz.
|
||||
"""
|
||||
|
||||
def __init__(self, threshold: float = _DEFAULT_THRESHOLD) -> None:
|
||||
self._threshold = threshold
|
||||
self._pipeline = None # lazy-loaded
|
||||
self._feature_extractor = None # lazy-loaded
|
||||
self._model = None # lazy-loaded
|
||||
self._device: str = "cpu"
|
||||
|
||||
@classmethod
|
||||
def from_env(cls) -> "ToneClassifier":
|
||||
|
|
@ -96,23 +104,41 @@ class ToneClassifier:
|
|||
return cls(threshold=threshold)
|
||||
|
||||
def _load_pipeline(self) -> None:
|
||||
if self._pipeline is not None:
|
||||
if self._model is not None:
|
||||
return
|
||||
try:
|
||||
from transformers import pipeline as hf_pipeline
|
||||
from transformers import (
|
||||
AutoFeatureExtractor,
|
||||
AutoModelForAudioClassification,
|
||||
)
|
||||
except ImportError as exc:
|
||||
raise ImportError(
|
||||
"transformers is required for tone classification. "
|
||||
"Install with: pip install cf-voice[inference]"
|
||||
) from exc
|
||||
|
||||
device = 0 if _cuda_available() else -1
|
||||
logger.info("Loading SER model %s on device %s", _SER_MODEL_ID, device)
|
||||
self._pipeline = hf_pipeline(
|
||||
"audio-classification",
|
||||
model=_SER_MODEL_ID,
|
||||
device=device,
|
||||
import torch
|
||||
|
||||
if _cuda_available():
|
||||
self._device = "cuda:0"
|
||||
# fp16 halves VRAM from ~6.7 GB to ~3.3 GB on RTX 4000.
|
||||
# Only supported on CUDA — CPU must stay float32.
|
||||
torch_dtype = torch.float16
|
||||
else:
|
||||
self._device = "cpu"
|
||||
torch_dtype = torch.float32
|
||||
|
||||
logger.info(
|
||||
"Loading SER model %s on device=%s dtype=%s",
|
||||
_SER_MODEL_ID, self._device, torch_dtype,
|
||||
)
|
||||
self._feature_extractor = AutoFeatureExtractor.from_pretrained(_SER_MODEL_ID)
|
||||
self._model = AutoModelForAudioClassification.from_pretrained(
|
||||
_SER_MODEL_ID,
|
||||
torch_dtype=torch_dtype,
|
||||
).to(self._device)
|
||||
# Switch to inference mode — disables dropout, batch norm tracks running stats
|
||||
self._model.train(False)
|
||||
|
||||
def classify(self, audio_float32: np.ndarray, transcript: str = "") -> ToneResult:
|
||||
"""
|
||||
|
|
@ -121,13 +147,33 @@ class ToneClassifier:
|
|||
transcript is used as a weak signal for ambiguous cases (e.g. words
|
||||
like "unfortunately" bias toward apologetic even on a neutral voice).
|
||||
"""
|
||||
import torch
|
||||
|
||||
self._load_pipeline()
|
||||
|
||||
# Ensure the model sees float32 at the right rate
|
||||
assert audio_float32.dtype == np.float32, "audio must be float32"
|
||||
|
||||
# Run SER
|
||||
preds = self._pipeline({"raw": audio_float32, "sampling_rate": _SAMPLE_RATE})
|
||||
# Run SER — call feature extractor + model directly to bypass the
|
||||
# hf_pipeline audio backend (avoids torchcodec / libnvrtc dependency).
|
||||
inputs = self._feature_extractor(
|
||||
audio_float32,
|
||||
sampling_rate=_SAMPLE_RATE,
|
||||
return_tensors="pt",
|
||||
)
|
||||
inputs = {k: v.to(self._device) for k, v in inputs.items()}
|
||||
if self._model.dtype == torch.float16:
|
||||
inputs = {k: v.to(torch.float16) for k, v in inputs.items()}
|
||||
|
||||
with torch.no_grad():
|
||||
logits = self._model(**inputs).logits
|
||||
probs = torch.softmax(logits, dim=-1)[0]
|
||||
id2label = self._model.config.id2label
|
||||
preds = [
|
||||
{"label": id2label[i], "score": float(probs[i])}
|
||||
for i in range(len(probs))
|
||||
]
|
||||
|
||||
best = max(preds, key=lambda p: p["score"])
|
||||
emotion = best["label"].lower()
|
||||
confidence = float(best["score"])
|
||||
|
|
@ -158,7 +204,7 @@ class ToneClassifier:
|
|||
self, audio_float32: np.ndarray, transcript: str = ""
|
||||
) -> ToneResult:
|
||||
"""classify() without blocking the event loop."""
|
||||
loop = asyncio.get_event_loop()
|
||||
loop = asyncio.get_running_loop()
|
||||
fn = partial(self.classify, audio_float32, transcript)
|
||||
return await loop.run_in_executor(None, fn)
|
||||
|
||||
|
|
|
|||
|
|
@ -1,99 +1,289 @@
|
|||
# cf_voice/context.py — tone classification and context enrichment
|
||||
# cf_voice/context.py — parallel audio context classifier (orchestrator)
|
||||
#
|
||||
# BSL 1.1 when real inference models are integrated.
|
||||
# Currently a passthrough stub: wraps a VoiceIO source and forwards frames.
|
||||
# Mock mode: MIT licensed (no real inference).
|
||||
#
|
||||
# Real implementation (Notation v0.1.x) will:
|
||||
# - Run YAMNet acoustic event detection on the audio buffer
|
||||
# - Run wav2vec2-based SER (speech emotion recognition)
|
||||
# - Run librosa prosody extraction (pitch, energy, rate)
|
||||
# - Combine into enriched VoiceFrame label + confidence
|
||||
# - Support pyannote.audio speaker diarization (Navigation v0.2.x)
|
||||
# Runs three classifiers in parallel against the same audio window:
|
||||
# 1. Tone/affect (classify.py) — wav2vec2 SER + librosa prosody
|
||||
# 2. Queue/environ (acoustic.py) — YAMNet acoustic event detection
|
||||
# 3. Speaker type/VAD (diarize.py) — pyannote.audio (Navigation v0.2.x)
|
||||
#
|
||||
# Combined output is a list[AudioEvent] per window, merged into VoiceFrame
|
||||
# for the streaming path.
|
||||
#
|
||||
# Elcor mode reads from cf-core preferences (cf_voice.prefs) so that the
|
||||
# annotation format is user-configurable without per-request flags.
|
||||
from __future__ import annotations
|
||||
|
||||
import asyncio
|
||||
import logging
|
||||
import os
|
||||
from typing import AsyncIterator
|
||||
|
||||
from cf_voice.acoustic import MockAcousticBackend, make_acoustic
|
||||
from cf_voice.events import AudioEvent, ToneEvent, tone_event_from_voice_frame
|
||||
from cf_voice.io import MockVoiceIO, VoiceIO, make_io
|
||||
from cf_voice.models import VoiceFrame
|
||||
from cf_voice.prefs import get_elcor_prior_frames, is_elcor_enabled
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# ── Per-model download/load status registry ───────────────────────────────────
|
||||
# Written by _load_* methods; read by the /health endpoint in app.py.
|
||||
# Values: "disabled" | "loading" | "ready" | "error"
|
||||
# Thread-safe: individual str assignment is atomic in CPython.
|
||||
model_status: dict[str, str] = {}
|
||||
|
||||
|
||||
# ── No-op coroutines for disabled/unavailable classifiers ─────────────────────
|
||||
|
||||
async def _noop_stt() -> None:
|
||||
"""Placeholder when STT is disabled or unavailable."""
|
||||
return None
|
||||
|
||||
|
||||
async def _noop_diarize() -> list:
|
||||
"""Placeholder when diarization is disabled or unavailable."""
|
||||
return []
|
||||
|
||||
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
class ContextClassifier:
|
||||
"""
|
||||
High-level voice context classifier.
|
||||
|
||||
Wraps a VoiceIO source and enriches each VoiceFrame with tone annotation.
|
||||
In stub mode the frames pass through unchanged — the enrichment pipeline
|
||||
(YAMNet + wav2vec2 + librosa) is filled in incrementally.
|
||||
Wraps a VoiceIO source and runs three parallel classifiers on each audio
|
||||
window: tone (SER), queue/environ (YAMNet), and speaker (pyannote).
|
||||
|
||||
In mock mode all classifiers produce synthetic events — no GPU, microphone,
|
||||
or HuggingFace token required.
|
||||
|
||||
Usage
|
||||
-----
|
||||
classifier = ContextClassifier.from_env()
|
||||
async for frame in classifier.stream():
|
||||
print(frame.label, frame.confidence)
|
||||
|
||||
For the full multi-class event list (queue + speaker + tone):
|
||||
events = classifier.classify_chunk(audio_b64, timestamp=4.5)
|
||||
"""
|
||||
|
||||
def __init__(self, io: VoiceIO) -> None:
|
||||
def __init__(
|
||||
self,
|
||||
io: VoiceIO,
|
||||
user_id: str | None = None,
|
||||
store=None,
|
||||
) -> None:
|
||||
self._io = io
|
||||
self._user_id = user_id
|
||||
self._store = store
|
||||
self._acoustic = make_acoustic(
|
||||
mock=isinstance(io, MockVoiceIO)
|
||||
or os.environ.get("CF_VOICE_MOCK", "") == "1"
|
||||
)
|
||||
# Lazy — loaded on first real classify call, then reused.
|
||||
self._tone: "ToneClassifier | None" = None
|
||||
# STT: loaded if faster-whisper is installed. Controlled by CF_VOICE_STT (default: 1).
|
||||
self._stt: "WhisperSTT | None" = None
|
||||
self._stt_loaded: bool = False # False = not yet attempted
|
||||
# Diarizer: optional — requires HF_TOKEN and CF_VOICE_DIARIZE=1.
|
||||
self._diarizer: "Diarizer | None" = None
|
||||
self._diarizer_loaded: bool = False
|
||||
# Per-session speaker label tracker — maps pyannote IDs → "Speaker A/B/..."
|
||||
# Reset at session end (when the ContextClassifier is stopped).
|
||||
from cf_voice.diarize import SpeakerTracker
|
||||
self._speaker_tracker: SpeakerTracker = SpeakerTracker()
|
||||
# One-at-a-time GPU classify gate. All three models share the same GPU;
|
||||
# running them "in parallel" just serializes at the CUDA level while
|
||||
# filling the thread pool. Drop incoming frames when a classify is
|
||||
# already in flight — freshness beats completeness for real-time audio.
|
||||
self._classify_lock: asyncio.Lock = asyncio.Lock()
|
||||
# Dimensional classifier (audeering) — lazy, CF_VOICE_DIMENSIONAL=1
|
||||
self._dimensional: "DimensionalClassifier | None" = None
|
||||
self._dimensional_loaded: bool = False
|
||||
# Prosodic extractor (openSMILE) — lazy, CF_VOICE_PROSODY=1
|
||||
self._prosodic: "ProsodicExtractor | None" = None
|
||||
self._prosodic_loaded: bool = False
|
||||
# Per-speaker rolling dimensional buffers for trajectory/coherence signals.
|
||||
# Keys are speaker_id strings; values are deques of DimensionalResult.
|
||||
# Reset at session end alongside SpeakerTracker.
|
||||
from collections import deque as _deque
|
||||
from cf_voice.trajectory import BUFFER_WINDOW
|
||||
self._dim_buffer: dict[str, "_deque"] = {}
|
||||
self._last_ser_affect: dict[str, str] = {}
|
||||
self._buffer_window = BUFFER_WINDOW
|
||||
# Accent classifier — lazy, gated by CF_VOICE_ACCENT=1
|
||||
self._accent: "MockAccentClassifier | AccentClassifier | None" = None
|
||||
self._accent_loaded: bool = False
|
||||
|
||||
@classmethod
|
||||
def from_env(cls, interval_s: float = 2.5) -> "ContextClassifier":
|
||||
def from_env(
|
||||
cls,
|
||||
interval_s: float = 2.5,
|
||||
user_id: str | None = None,
|
||||
store=None,
|
||||
) -> "ContextClassifier":
|
||||
"""
|
||||
Create a ContextClassifier from environment.
|
||||
CF_VOICE_MOCK=1 activates mock mode (no GPU, no audio hardware needed).
|
||||
|
||||
CF_VOICE_MOCK=1 activates full mock mode (no GPU, no audio hardware).
|
||||
If real audio hardware is unavailable (faster-whisper not installed),
|
||||
falls back to mock mode automatically.
|
||||
user_id + store are forwarded to cf-core preferences for Elcor/threshold
|
||||
lookups.
|
||||
"""
|
||||
if os.environ.get("CF_VOICE_MOCK", "") == "1":
|
||||
return cls.mock(interval_s=interval_s, user_id=user_id, store=store)
|
||||
try:
|
||||
io = make_io(interval_s=interval_s)
|
||||
return cls(io=io)
|
||||
except (NotImplementedError, ImportError):
|
||||
# Real audio hardware or inference extras unavailable — fall back to
|
||||
# mock mode so the coordinator starts cleanly on headless nodes.
|
||||
return cls.mock(interval_s=interval_s, user_id=user_id, store=store)
|
||||
return cls(io=io, user_id=user_id, store=store)
|
||||
|
||||
@classmethod
|
||||
def mock(cls, interval_s: float = 2.5, seed: int | None = None) -> "ContextClassifier":
|
||||
def mock(
|
||||
cls,
|
||||
interval_s: float = 2.5,
|
||||
seed: int | None = None,
|
||||
user_id: str | None = None,
|
||||
store=None,
|
||||
) -> "ContextClassifier":
|
||||
"""Create a ContextClassifier backed by MockVoiceIO. Useful in tests."""
|
||||
from cf_voice.io import MockVoiceIO
|
||||
return cls(io=MockVoiceIO(interval_s=interval_s, seed=seed))
|
||||
return cls(
|
||||
io=MockVoiceIO(interval_s=interval_s, seed=seed),
|
||||
user_id=user_id,
|
||||
store=store,
|
||||
)
|
||||
|
||||
async def stream(self) -> AsyncIterator[VoiceFrame]:
|
||||
"""
|
||||
Yield enriched VoiceFrames continuously.
|
||||
|
||||
Stub: frames from the IO layer pass through unchanged.
|
||||
Real: enrichment pipeline runs here before yield.
|
||||
Real (Navigation v0.2.x): acoustic + diarization enrichment runs here.
|
||||
"""
|
||||
async for frame in self._io.stream():
|
||||
yield self._enrich(frame)
|
||||
|
||||
async def stop(self) -> None:
|
||||
await self._io.stop()
|
||||
self._speaker_tracker.reset()
|
||||
self._dim_buffer.clear()
|
||||
self._last_ser_affect.clear()
|
||||
|
||||
def classify_chunk(
|
||||
self,
|
||||
audio_b64: str,
|
||||
audio_b64: str | None = None,
|
||||
timestamp: float = 0.0,
|
||||
prior_frames: int = 0,
|
||||
elcor: bool = False,
|
||||
prior_frames: int | None = None,
|
||||
elcor: bool | None = None,
|
||||
session_id: str = "",
|
||||
) -> list[AudioEvent]:
|
||||
"""
|
||||
Classify a single audio chunk and return AudioEvents.
|
||||
Classify a single audio window and return all AudioEvents.
|
||||
|
||||
This is the request-response path used by the cf-orch endpoint.
|
||||
Returns a heterogeneous list containing zero or one of each:
|
||||
- ToneEvent (event_type="tone")
|
||||
- AudioEvent (event_type="queue")
|
||||
- AudioEvent (event_type="speaker")
|
||||
- AudioEvent (event_type="environ")
|
||||
|
||||
This is the request-response path used by the cf-orch SSE endpoint.
|
||||
The streaming path (async generator) is for continuous consumers.
|
||||
|
||||
elcor=True switches subtext format to Mass Effect Elcor prefix style.
|
||||
Generic tone annotation is always present regardless of elcor flag.
|
||||
audio_b64 Base64-encoded PCM int16 mono 16kHz bytes.
|
||||
Pass None in mock mode (ignored).
|
||||
timestamp Session-relative seconds since capture started.
|
||||
prior_frames Rolling context window size for Elcor LLM.
|
||||
Defaults to user preference (PREF_ELCOR_PRIOR_FRAMES).
|
||||
elcor Override Elcor mode for this request.
|
||||
None = read from user preference (PREF_ELCOR_MODE).
|
||||
session_id Caller-assigned correlation ID for the session.
|
||||
"""
|
||||
if isinstance(self._io, MockVoiceIO):
|
||||
return self._classify_chunk_mock(timestamp, prior_frames, elcor)
|
||||
use_elcor = elcor if elcor is not None else is_elcor_enabled(
|
||||
user_id=self._user_id, store=self._store
|
||||
)
|
||||
context_frames = prior_frames if prior_frames is not None else get_elcor_prior_frames(
|
||||
user_id=self._user_id, store=self._store
|
||||
)
|
||||
|
||||
return self._classify_chunk_real(audio_b64, timestamp, elcor)
|
||||
if isinstance(self._io, MockVoiceIO) or os.environ.get("CF_VOICE_MOCK", "") == "1":
|
||||
return self._classify_mock(timestamp, context_frames, use_elcor, session_id)
|
||||
|
||||
def _classify_chunk_mock(
|
||||
self, timestamp: float, prior_frames: int, elcor: bool
|
||||
if not audio_b64:
|
||||
return []
|
||||
|
||||
return self._classify_real(audio_b64, timestamp, use_elcor, session_id)
|
||||
|
||||
async def classify_chunk_async(
|
||||
self,
|
||||
audio_b64: str | None = None,
|
||||
timestamp: float = 0.0,
|
||||
prior_frames: int | None = None,
|
||||
elcor: bool | None = None,
|
||||
session_id: str = "",
|
||||
language: str | None = None,
|
||||
num_speakers: int | None = None,
|
||||
) -> list[AudioEvent]:
|
||||
"""Synthetic path — used in mock mode and CI."""
|
||||
"""
|
||||
Async variant of classify_chunk.
|
||||
|
||||
Runs tone, STT, diarization, and acoustic classification in parallel
|
||||
using asyncio.gather(). Use this from async contexts (FastAPI routes)
|
||||
to get true concurrency across all four inference paths.
|
||||
"""
|
||||
use_elcor = elcor if elcor is not None else is_elcor_enabled(
|
||||
user_id=self._user_id, store=self._store
|
||||
)
|
||||
context_frames = prior_frames if prior_frames is not None else get_elcor_prior_frames(
|
||||
user_id=self._user_id, store=self._store
|
||||
)
|
||||
|
||||
if isinstance(self._io, MockVoiceIO) or os.environ.get("CF_VOICE_MOCK", "") == "1":
|
||||
return self._classify_mock(timestamp, context_frames, use_elcor, session_id)
|
||||
|
||||
if not audio_b64:
|
||||
return []
|
||||
|
||||
# Drop frame if a classify is already in flight — GPU models serialize
|
||||
# anyway, so queuing just adds latency without improving output.
|
||||
if self._classify_lock.locked():
|
||||
logger.debug("classify busy — dropping frame at t=%.2f", timestamp)
|
||||
return []
|
||||
|
||||
async with self._classify_lock:
|
||||
# Diarization (pyannote) can take 3–8 s on first invocations even with GPU.
|
||||
# 25 s gives enough headroom without stalling the stream for too long.
|
||||
try:
|
||||
return await asyncio.wait_for(
|
||||
self._classify_real_async(audio_b64, timestamp, use_elcor, session_id, language, num_speakers),
|
||||
timeout=25.0,
|
||||
)
|
||||
except asyncio.TimeoutError:
|
||||
logger.warning("classify_real_async timed out at t=%.2f — dropping frame", timestamp)
|
||||
return []
|
||||
|
||||
def _classify_mock(
|
||||
self,
|
||||
timestamp: float,
|
||||
prior_frames: int,
|
||||
elcor: bool,
|
||||
session_id: str,
|
||||
) -> list[AudioEvent]:
|
||||
"""
|
||||
Synthetic multi-class event batch.
|
||||
|
||||
Tone event comes from the MockVoiceIO RNG (consistent seed behaviour).
|
||||
Queue/speaker/environ come from MockAcousticBackend (call lifecycle simulation).
|
||||
"""
|
||||
rng = self._io._rng # type: ignore[attr-defined]
|
||||
import time as _time
|
||||
label = rng.choice(self._io._labels) # type: ignore[attr-defined]
|
||||
shift = rng.uniform(0.1, 0.7) if prior_frames > 0 else 0.0
|
||||
|
||||
frame = VoiceFrame(
|
||||
label=label,
|
||||
confidence=rng.uniform(0.6, 0.97),
|
||||
|
|
@ -101,30 +291,54 @@ class ContextClassifier:
|
|||
shift_magnitude=round(shift, 3),
|
||||
timestamp=timestamp,
|
||||
)
|
||||
tone = tone_event_from_voice_frame(
|
||||
tone: ToneEvent = tone_event_from_voice_frame(
|
||||
frame_label=frame.label,
|
||||
frame_confidence=frame.confidence,
|
||||
shift_magnitude=frame.shift_magnitude,
|
||||
timestamp=frame.timestamp,
|
||||
elcor=elcor,
|
||||
)
|
||||
return [tone]
|
||||
tone.session_id = session_id
|
||||
|
||||
def _classify_chunk_real(
|
||||
self, audio_b64: str, timestamp: float, elcor: bool
|
||||
acoustic = self._acoustic.classify_window(b"", timestamp=timestamp)
|
||||
|
||||
events: list[AudioEvent] = [tone]
|
||||
if acoustic.queue:
|
||||
events.append(acoustic.queue)
|
||||
if acoustic.speaker:
|
||||
events.append(acoustic.speaker)
|
||||
if acoustic.environ:
|
||||
events.append(acoustic.environ)
|
||||
if acoustic.scene:
|
||||
events.append(acoustic.scene)
|
||||
return events
|
||||
|
||||
def _classify_real(
|
||||
self,
|
||||
audio_b64: str,
|
||||
timestamp: float,
|
||||
elcor: bool,
|
||||
session_id: str,
|
||||
) -> list[AudioEvent]:
|
||||
"""Real inference path — used when CF_VOICE_MOCK is unset."""
|
||||
import asyncio
|
||||
"""
|
||||
Real inference path — used when CF_VOICE_MOCK is unset.
|
||||
|
||||
Tone: wav2vec2 SER via ToneClassifier (classify.py).
|
||||
Acoustic: YAMNet via YAMNetAcousticBackend (Navigation v0.2.x stub).
|
||||
Speaker: pyannote VAD (diarize.py) — merged in ContextClassifier, not here.
|
||||
"""
|
||||
import base64
|
||||
|
||||
import numpy as np
|
||||
|
||||
from cf_voice.classify import ToneClassifier
|
||||
|
||||
pcm = base64.b64decode(audio_b64)
|
||||
audio = np.frombuffer(pcm, dtype=np.int16).astype(np.float32) / 32_768.0
|
||||
|
||||
# ToneClassifier is stateless per-call, safe to instantiate inline
|
||||
classifier = ToneClassifier.from_env()
|
||||
tone_result = classifier.classify(audio)
|
||||
if self._tone is None:
|
||||
self._tone = ToneClassifier.from_env()
|
||||
tone_result = self._tone.classify(audio)
|
||||
|
||||
frame = VoiceFrame(
|
||||
label=tone_result.label,
|
||||
|
|
@ -133,20 +347,398 @@ class ContextClassifier:
|
|||
shift_magnitude=0.0,
|
||||
timestamp=timestamp,
|
||||
)
|
||||
event = tone_event_from_voice_frame(
|
||||
tone: ToneEvent = tone_event_from_voice_frame(
|
||||
frame_label=frame.label,
|
||||
frame_confidence=frame.confidence,
|
||||
shift_magnitude=frame.shift_magnitude,
|
||||
timestamp=frame.timestamp,
|
||||
elcor=elcor,
|
||||
)
|
||||
return [event]
|
||||
tone.session_id = session_id
|
||||
|
||||
events: list[AudioEvent] = [tone]
|
||||
|
||||
# Acoustic events: Navigation v0.2.x (YAMNet not yet implemented)
|
||||
# YAMNetAcousticBackend raises NotImplementedError at construction —
|
||||
# we catch and log rather than failing the entire classify call.
|
||||
try:
|
||||
acoustic = self._acoustic.classify_window(audio.tobytes(), timestamp=timestamp)
|
||||
if acoustic.queue:
|
||||
events.append(acoustic.queue)
|
||||
if acoustic.speaker:
|
||||
events.append(acoustic.speaker)
|
||||
if acoustic.environ:
|
||||
events.append(acoustic.environ)
|
||||
if acoustic.scene:
|
||||
events.append(acoustic.scene)
|
||||
except NotImplementedError:
|
||||
pass
|
||||
|
||||
return events
|
||||
|
||||
def _load_stt(self) -> "WhisperSTT | None":
|
||||
"""Lazy-load WhisperSTT once. Returns None if unavailable or disabled."""
|
||||
if self._stt_loaded:
|
||||
return self._stt
|
||||
self._stt_loaded = True
|
||||
if os.environ.get("CF_VOICE_STT", "1") != "1":
|
||||
model_status["stt"] = "disabled"
|
||||
return None
|
||||
model_status["stt"] = "loading"
|
||||
try:
|
||||
from cf_voice.stt import WhisperSTT
|
||||
self._stt = WhisperSTT.from_env()
|
||||
model_status["stt"] = "ready"
|
||||
logger.info("WhisperSTT loaded (model=%s)", os.environ.get("CF_VOICE_WHISPER_MODEL", "small"))
|
||||
except Exception as exc:
|
||||
model_status["stt"] = "error"
|
||||
logger.warning("WhisperSTT unavailable: %s", exc)
|
||||
return self._stt
|
||||
|
||||
def _load_diarizer(self) -> "Diarizer | None":
|
||||
"""Lazy-load Diarizer once. Returns None if HF_TOKEN absent or CF_VOICE_DIARIZE!=1."""
|
||||
if self._diarizer_loaded:
|
||||
return self._diarizer
|
||||
self._diarizer_loaded = True
|
||||
if os.environ.get("CF_VOICE_DIARIZE", "0") != "1":
|
||||
model_status["diarizer"] = "disabled"
|
||||
return None
|
||||
model_status["diarizer"] = "loading"
|
||||
try:
|
||||
from cf_voice.diarize import Diarizer
|
||||
self._diarizer = Diarizer.from_env()
|
||||
model_status["diarizer"] = "ready"
|
||||
logger.info("Diarizer loaded")
|
||||
except Exception as exc:
|
||||
model_status["diarizer"] = "error"
|
||||
logger.warning("Diarizer unavailable: %s", exc)
|
||||
return self._diarizer
|
||||
|
||||
def _load_dimensional(self) -> "DimensionalClassifier | None":
|
||||
"""Lazy-load DimensionalClassifier once. Returns None if CF_VOICE_DIMENSIONAL!=1."""
|
||||
if self._dimensional_loaded:
|
||||
return self._dimensional
|
||||
self._dimensional_loaded = True
|
||||
if os.environ.get("CF_VOICE_DIMENSIONAL", "0") != "1":
|
||||
model_status["dimensional"] = "disabled"
|
||||
return None
|
||||
model_status["dimensional"] = "loading"
|
||||
try:
|
||||
from cf_voice.dimensional import DimensionalClassifier
|
||||
self._dimensional = DimensionalClassifier()
|
||||
model_status["dimensional"] = "ready"
|
||||
logger.info("DimensionalClassifier loaded (audeering VAD model)")
|
||||
except Exception as exc:
|
||||
model_status["dimensional"] = "error"
|
||||
logger.warning("DimensionalClassifier unavailable: %s", exc)
|
||||
return self._dimensional
|
||||
|
||||
def _load_accent(self) -> "MockAccentClassifier | AccentClassifier | None":
|
||||
"""Lazy-load AccentClassifier once. Returns None if CF_VOICE_ACCENT!=1."""
|
||||
if self._accent_loaded:
|
||||
return self._accent
|
||||
self._accent_loaded = True
|
||||
from cf_voice.accent import make_accent_classifier
|
||||
result = make_accent_classifier(
|
||||
mock=isinstance(self._io, MockVoiceIO) or os.environ.get("CF_VOICE_MOCK", "") == "1"
|
||||
)
|
||||
self._accent = result
|
||||
if result is None:
|
||||
model_status["accent"] = "disabled"
|
||||
else:
|
||||
model_status["accent"] = "ready"
|
||||
logger.info("AccentClassifier loaded (mock=%s)", isinstance(result, type(result).__mro__[0]))
|
||||
return self._accent
|
||||
|
||||
def _load_prosodic(self) -> "ProsodicExtractor | None":
|
||||
"""Lazy-load ProsodicExtractor once. Returns None if CF_VOICE_PROSODY!=1."""
|
||||
if self._prosodic_loaded:
|
||||
return self._prosodic
|
||||
self._prosodic_loaded = True
|
||||
if os.environ.get("CF_VOICE_PROSODY", "0") != "1":
|
||||
model_status["prosody"] = "disabled"
|
||||
return None
|
||||
model_status["prosody"] = "loading"
|
||||
try:
|
||||
from cf_voice.prosody import ProsodicExtractor
|
||||
self._prosodic = ProsodicExtractor()
|
||||
model_status["prosody"] = "ready"
|
||||
logger.info("ProsodicExtractor loaded (openSMILE eGeMAPS)")
|
||||
except Exception as exc:
|
||||
model_status["prosody"] = "error"
|
||||
logger.warning("ProsodicExtractor unavailable: %s", exc)
|
||||
return self._prosodic
|
||||
|
||||
async def prewarm(self) -> None:
|
||||
"""Pre-load all configured models in a thread-pool so downloads happen at
|
||||
startup rather than on the first classify call. Safe to call multiple times
|
||||
(each _load_* method is idempotent after the first call)."""
|
||||
if isinstance(self._io, MockVoiceIO) or os.environ.get("CF_VOICE_MOCK", "") == "1":
|
||||
return
|
||||
loop = asyncio.get_running_loop()
|
||||
# Load each model in its own executor slot so status updates are visible
|
||||
# as each one completes rather than all at once.
|
||||
await loop.run_in_executor(None, self._load_stt)
|
||||
await loop.run_in_executor(None, self._load_diarizer)
|
||||
await loop.run_in_executor(None, self._load_dimensional)
|
||||
await loop.run_in_executor(None, self._load_prosodic)
|
||||
logger.info("cf-voice prewarm complete: %s", model_status)
|
||||
|
||||
async def _classify_real_async(
|
||||
self,
|
||||
audio_b64: str,
|
||||
timestamp: float,
|
||||
elcor: bool,
|
||||
session_id: str,
|
||||
language: str | None = None,
|
||||
num_speakers: int | None = None,
|
||||
) -> list[AudioEvent]:
|
||||
"""
|
||||
Real inference path running all classifiers in parallel.
|
||||
|
||||
Tone (wav2vec2) + STT (Whisper) + Diarization (pyannote, optional) +
|
||||
Acoustic (AST) all run concurrently via asyncio.gather(). Each result
|
||||
is type-checked after gather — a single classifier failure does not
|
||||
abort the call.
|
||||
|
||||
Transcript text is fed back to ToneClassifier as a weak signal (e.g.
|
||||
"unfortunately" biases toward apologetic). Diarizer output sets the
|
||||
speaker_id on the VoiceFrame.
|
||||
"""
|
||||
import base64
|
||||
from functools import partial
|
||||
|
||||
import numpy as np
|
||||
|
||||
from cf_voice.classify import ToneClassifier, _apply_transcript_hints, _AFFECT_TO_LABEL
|
||||
|
||||
pcm = base64.b64decode(audio_b64)
|
||||
audio = np.frombuffer(pcm, dtype=np.int16).astype(np.float32) / 32_768.0
|
||||
|
||||
# Lazy-load models on first real call
|
||||
if self._tone is None:
|
||||
self._tone = ToneClassifier.from_env()
|
||||
stt = self._load_stt()
|
||||
diarizer = self._load_diarizer()
|
||||
dimensional = self._load_dimensional()
|
||||
prosodic = self._load_prosodic()
|
||||
accent_clf = self._load_accent()
|
||||
|
||||
# Build coroutines — all run in thread pool executors internally.
|
||||
# Dimensional, prosodic, and accent run in parallel with SER/STT/diarization.
|
||||
tone_coro = self._tone.classify_async(audio)
|
||||
stt_coro = stt.transcribe_chunk_async(pcm, language=language) if stt else _noop_stt()
|
||||
diarize_coro = diarizer.diarize_async(audio, num_speakers=num_speakers) if diarizer else _noop_diarize()
|
||||
loop = asyncio.get_running_loop()
|
||||
acoustic_coro = loop.run_in_executor(
|
||||
None, partial(self._acoustic.classify_window, audio.tobytes(), timestamp)
|
||||
)
|
||||
dimensional_coro = dimensional.classify_async(audio) if dimensional else _noop_stt()
|
||||
prosodic_coro = prosodic.extract_async(audio) if prosodic else _noop_stt()
|
||||
accent_coro = loop.run_in_executor(
|
||||
None, partial(accent_clf.classify, audio.tobytes())
|
||||
) if accent_clf else _noop_stt()
|
||||
|
||||
(
|
||||
tone_result, stt_result, diarize_segs, acoustic,
|
||||
dimensional_result, prosodic_result, accent_result,
|
||||
) = await asyncio.gather(
|
||||
tone_coro, stt_coro, diarize_coro, acoustic_coro,
|
||||
dimensional_coro, prosodic_coro, accent_coro,
|
||||
return_exceptions=True,
|
||||
)
|
||||
|
||||
# Extract transcript text (STT optional)
|
||||
transcript = ""
|
||||
if stt_result and not isinstance(stt_result, BaseException):
|
||||
transcript = stt_result.text # type: ignore[union-attr]
|
||||
|
||||
# Apply transcript weak signal to affect if STT produced text
|
||||
if transcript and not isinstance(tone_result, BaseException):
|
||||
new_affect = _apply_transcript_hints(tone_result.affect, transcript) # type: ignore[union-attr]
|
||||
if new_affect != tone_result.affect: # type: ignore[union-attr]
|
||||
from cf_voice.classify import ToneResult
|
||||
tone_result = ToneResult( # type: ignore[assignment]
|
||||
label=_AFFECT_TO_LABEL.get(new_affect, tone_result.label), # type: ignore[union-attr]
|
||||
affect=new_affect,
|
||||
confidence=tone_result.confidence, # type: ignore[union-attr]
|
||||
prosody_flags=tone_result.prosody_flags, # type: ignore[union-attr]
|
||||
)
|
||||
|
||||
# Get speaker_id from diarization (falls back to "speaker_a")
|
||||
speaker_id = "speaker_a"
|
||||
if isinstance(diarize_segs, BaseException):
|
||||
logger.warning("Diarizer failed in gather: %s", diarize_segs)
|
||||
elif diarizer and diarize_segs is not None:
|
||||
window_mid = len(audio) / 2.0 / 16_000.0
|
||||
speaker_id = diarizer.speaker_at( # type: ignore[arg-type]
|
||||
diarize_segs, window_mid, tracker=self._speaker_tracker
|
||||
)
|
||||
logger.debug("diarize: segs=%d speaker=%s mid=%.3f", len(diarize_segs), speaker_id, window_mid)
|
||||
|
||||
if isinstance(tone_result, BaseException):
|
||||
logger.warning("Tone classifier failed: %s", tone_result)
|
||||
return []
|
||||
|
||||
# Unpack dimensional result (None when classifier is disabled or failed)
|
||||
dim = None
|
||||
if dimensional_result and not isinstance(dimensional_result, BaseException):
|
||||
dim = dimensional_result
|
||||
|
||||
# Unpack prosodic result. If dimensional is also available, pass the
|
||||
# calm-positive score so sarcasm_risk benefits from both signals.
|
||||
pros = None
|
||||
if prosodic_result and not isinstance(prosodic_result, BaseException):
|
||||
if dim is not None:
|
||||
# Re-compute sarcasm_risk with dimensional context
|
||||
from cf_voice.prosody import _compute_sarcasm_risk
|
||||
calm_pos = dim.calm_positive_score()
|
||||
updated_risk = _compute_sarcasm_risk(
|
||||
flat_f0=prosodic_result.flat_f0_score, # type: ignore[union-attr]
|
||||
calm_positive=calm_pos,
|
||||
)
|
||||
from cf_voice.prosody import ProsodicSignal
|
||||
pros = ProsodicSignal(
|
||||
f0_mean=prosodic_result.f0_mean, # type: ignore[union-attr]
|
||||
f0_std=prosodic_result.f0_std, # type: ignore[union-attr]
|
||||
jitter=prosodic_result.jitter, # type: ignore[union-attr]
|
||||
shimmer=prosodic_result.shimmer, # type: ignore[union-attr]
|
||||
loudness=prosodic_result.loudness, # type: ignore[union-attr]
|
||||
flat_f0_score=prosodic_result.flat_f0_score, # type: ignore[union-attr]
|
||||
sarcasm_risk=updated_risk,
|
||||
)
|
||||
else:
|
||||
pros = prosodic_result
|
||||
|
||||
frame = VoiceFrame(
|
||||
label=tone_result.label, # type: ignore[union-attr]
|
||||
confidence=tone_result.confidence, # type: ignore[union-attr]
|
||||
speaker_id=speaker_id,
|
||||
shift_magnitude=0.0,
|
||||
timestamp=timestamp,
|
||||
valence=dim.valence if dim else None,
|
||||
arousal=dim.arousal if dim else None,
|
||||
dominance=dim.dominance if dim else None,
|
||||
sarcasm_risk=pros.sarcasm_risk if pros else None,
|
||||
flat_f0_score=pros.flat_f0_score if pros else None,
|
||||
)
|
||||
tone_event: ToneEvent = tone_event_from_voice_frame(
|
||||
frame_label=frame.label,
|
||||
frame_confidence=frame.confidence,
|
||||
shift_magnitude=frame.shift_magnitude,
|
||||
timestamp=frame.timestamp,
|
||||
elcor=elcor,
|
||||
)
|
||||
tone_event.session_id = session_id
|
||||
tone_event.speaker_id = speaker_id
|
||||
# Attach dimensional and prosodic results to the wire event
|
||||
tone_event.valence = frame.valence
|
||||
tone_event.arousal = frame.arousal
|
||||
tone_event.dominance = frame.dominance
|
||||
tone_event.sarcasm_risk = frame.sarcasm_risk
|
||||
tone_event.flat_f0_score = frame.flat_f0_score
|
||||
|
||||
# Trajectory and coherence signals — only when dimensional is running
|
||||
if dim:
|
||||
from collections import deque as _deque
|
||||
from cf_voice.trajectory import compute_trajectory
|
||||
|
||||
spk_buffer = self._dim_buffer.setdefault(
|
||||
speaker_id, _deque(maxlen=self._buffer_window)
|
||||
)
|
||||
prior_affect = self._last_ser_affect.get(speaker_id)
|
||||
traj, coher = compute_trajectory(
|
||||
spk_buffer, dim, tone_result.affect, prior_affect # type: ignore[union-attr]
|
||||
)
|
||||
# Update buffer and affect history after computing (not before)
|
||||
spk_buffer.append(dim)
|
||||
self._last_ser_affect[speaker_id] = tone_result.affect # type: ignore[union-attr]
|
||||
|
||||
tone_event.arousal_delta = traj.arousal_delta if traj.baseline_established else None
|
||||
tone_event.valence_delta = traj.valence_delta if traj.baseline_established else None
|
||||
tone_event.trend = traj.trend if traj.baseline_established else None
|
||||
tone_event.coherence_score = coher.coherence_score
|
||||
tone_event.suppression_flag = coher.suppression_flag
|
||||
tone_event.reframe_type = coher.reframe_type if coher.reframe_type != "none" else None
|
||||
tone_event.affect_divergence = coher.affect_divergence
|
||||
|
||||
logger.debug(
|
||||
"Dimensional: valence=%.3f arousal=%.3f dominance=%.3f quadrant=%s "
|
||||
"trend=%s coherence=%.2f suppressed=%s reframe=%s",
|
||||
dim.valence, dim.arousal, dim.dominance, dim.affect_quadrant(),
|
||||
traj.trend, coher.coherence_score, coher.suppression_flag, coher.reframe_type,
|
||||
)
|
||||
|
||||
if pros:
|
||||
logger.debug(
|
||||
"Prosodic: flat_f0=%.3f sarcasm_risk=%.3f",
|
||||
pros.flat_f0_score, pros.sarcasm_risk,
|
||||
)
|
||||
|
||||
events: list[AudioEvent] = [tone_event]
|
||||
|
||||
# Emit transcript event so consumers can display live STT
|
||||
if transcript:
|
||||
events.append(AudioEvent(
|
||||
timestamp=timestamp,
|
||||
event_type="transcript", # type: ignore[arg-type]
|
||||
label=transcript,
|
||||
confidence=1.0,
|
||||
speaker_id=speaker_id,
|
||||
))
|
||||
|
||||
# Acoustic events (queue / speaker type / environ / scene)
|
||||
scene_label: str | None = None
|
||||
environ_labels: list[str] = []
|
||||
speaker_label: str | None = None
|
||||
if not isinstance(acoustic, BaseException):
|
||||
if acoustic.queue: # type: ignore[union-attr]
|
||||
events.append(acoustic.queue) # type: ignore[union-attr]
|
||||
if acoustic.speaker: # type: ignore[union-attr]
|
||||
events.append(acoustic.speaker) # type: ignore[union-attr]
|
||||
speaker_label = acoustic.speaker.label # type: ignore[union-attr]
|
||||
if acoustic.environ: # type: ignore[union-attr]
|
||||
events.append(acoustic.environ) # type: ignore[union-attr]
|
||||
environ_labels = [acoustic.environ.label] # type: ignore[union-attr]
|
||||
if acoustic.scene: # type: ignore[union-attr]
|
||||
events.append(acoustic.scene) # type: ignore[union-attr]
|
||||
scene_label = acoustic.scene.label # type: ignore[union-attr]
|
||||
|
||||
# Accent event (optional — gated by CF_VOICE_ACCENT=1)
|
||||
accent_region: str | None = None
|
||||
if accent_result and not isinstance(accent_result, BaseException):
|
||||
accent_region = accent_result.region # type: ignore[union-attr]
|
||||
events.append(AudioEvent(
|
||||
timestamp=timestamp,
|
||||
event_type="accent", # type: ignore[arg-type]
|
||||
label=accent_region,
|
||||
confidence=accent_result.confidence, # type: ignore[union-attr]
|
||||
speaker_id=speaker_id,
|
||||
))
|
||||
|
||||
# Privacy risk scoring — local only, never transmitted
|
||||
from cf_voice.privacy import score_privacy_risk
|
||||
risk = score_privacy_risk(
|
||||
scene=scene_label,
|
||||
environ_labels=environ_labels,
|
||||
speaker=speaker_label,
|
||||
accent=accent_region,
|
||||
)
|
||||
if risk.level != "low":
|
||||
logger.info(
|
||||
"privacy_risk=%s flags=%s session=%s",
|
||||
risk.level, risk.flags, session_id,
|
||||
)
|
||||
# Attach risk to the tone event so Linnet can surface the gate
|
||||
tone_event.prosody_flags = list(tone_event.prosody_flags) + [f"privacy:{risk.level}"]
|
||||
|
||||
return events
|
||||
|
||||
def _enrich(self, frame: VoiceFrame) -> VoiceFrame:
|
||||
"""
|
||||
Apply tone classification to a raw frame.
|
||||
Apply tone classification to a raw frame (streaming path).
|
||||
|
||||
Stub: identity transform — returns frame unchanged.
|
||||
Real: replace label + confidence with classifier output.
|
||||
Real (Navigation v0.2.x): replace label + confidence with classifier output.
|
||||
"""
|
||||
return frame
|
||||
|
|
|
|||
|
|
@ -7,12 +7,16 @@
|
|||
# Requires accepting gated model terms at:
|
||||
# https://huggingface.co/pyannote/speaker-diarization-3.1
|
||||
# https://huggingface.co/pyannote/segmentation-3.0
|
||||
#
|
||||
# Enable with: CF_VOICE_DIARIZE=1 (default off)
|
||||
# Requires: HF_TOKEN set in environment
|
||||
from __future__ import annotations
|
||||
|
||||
import asyncio
|
||||
import logging
|
||||
import os
|
||||
from dataclasses import dataclass
|
||||
import string
|
||||
from dataclasses import dataclass, field
|
||||
|
||||
import numpy as np
|
||||
|
||||
|
|
@ -21,11 +25,16 @@ logger = logging.getLogger(__name__)
|
|||
_DIARIZATION_MODEL = "pyannote/speaker-diarization-3.1"
|
||||
_SAMPLE_RATE = 16_000
|
||||
|
||||
# Label returned when two speakers overlap in the same window
|
||||
SPEAKER_MULTIPLE = "Multiple"
|
||||
# Label returned when no speaker segment covers the timestamp (silence / VAD miss)
|
||||
SPEAKER_UNKNOWN = "speaker_a"
|
||||
|
||||
|
||||
@dataclass
|
||||
class SpeakerSegment:
|
||||
"""A speaker-labelled time range within an audio window."""
|
||||
speaker_id: str # ephemeral local label, e.g. "SPEAKER_00"
|
||||
speaker_id: str # raw pyannote label, e.g. "SPEAKER_00"
|
||||
start_s: float
|
||||
end_s: float
|
||||
|
||||
|
|
@ -34,6 +43,51 @@ class SpeakerSegment:
|
|||
return self.end_s - self.start_s
|
||||
|
||||
|
||||
class SpeakerTracker:
|
||||
"""
|
||||
Maps ephemeral pyannote speaker IDs to stable per-session friendly labels.
|
||||
|
||||
pyannote returns IDs like "SPEAKER_00", "SPEAKER_01" which are opaque and
|
||||
may differ across audio windows. SpeakerTracker assigns a consistent
|
||||
friendly label ("Speaker A", "Speaker B", ...) for the lifetime of one
|
||||
session, based on first-seen order.
|
||||
|
||||
Speaker embeddings are never stored — only the raw_id → label string map,
|
||||
which contains no biometric information. Call reset() at session end to
|
||||
discard the map.
|
||||
|
||||
For sessions with more than 26 speakers, labels wrap to "Speaker AA",
|
||||
"Speaker AB", etc. (unlikely in practice but handled gracefully).
|
||||
"""
|
||||
|
||||
def __init__(self) -> None:
|
||||
self._map: dict[str, str] = {}
|
||||
self._counter: int = 0
|
||||
|
||||
def label(self, raw_id: str) -> str:
|
||||
"""Return the friendly label for a pyannote speaker ID."""
|
||||
if raw_id not in self._map:
|
||||
self._map[raw_id] = self._next_label()
|
||||
return self._map[raw_id]
|
||||
|
||||
def reset(self) -> None:
|
||||
"""Discard all label mappings. Call at session end."""
|
||||
self._map.clear()
|
||||
self._counter = 0
|
||||
|
||||
def _next_label(self) -> str:
|
||||
idx = self._counter
|
||||
self._counter += 1
|
||||
letters = string.ascii_uppercase
|
||||
n = len(letters)
|
||||
if idx < n:
|
||||
return f"Speaker {letters[idx]}"
|
||||
# Two-letter suffix for >26 speakers
|
||||
outer = idx // n
|
||||
inner = idx % n
|
||||
return f"Speaker {letters[outer - 1]}{letters[inner]}"
|
||||
|
||||
|
||||
class Diarizer:
|
||||
"""
|
||||
Async wrapper around pyannote.audio speaker diarization pipeline.
|
||||
|
|
@ -47,9 +101,9 @@ class Diarizer:
|
|||
Usage
|
||||
-----
|
||||
diarizer = Diarizer.from_env()
|
||||
tracker = SpeakerTracker()
|
||||
segments = await diarizer.diarize_async(audio_float32)
|
||||
for seg in segments:
|
||||
print(seg.speaker_id, seg.start_s, seg.end_s)
|
||||
label = diarizer.speaker_at(segments, timestamp_s=1.0, tracker=tracker)
|
||||
|
||||
Navigation v0.2.x wires this into ContextClassifier so that each
|
||||
VoiceFrame carries the correct speaker_id from diarization output.
|
||||
|
|
@ -67,7 +121,7 @@ class Diarizer:
|
|||
logger.info("Loading diarization pipeline %s", _DIARIZATION_MODEL)
|
||||
self._pipeline = Pipeline.from_pretrained(
|
||||
_DIARIZATION_MODEL,
|
||||
use_auth_token=hf_token,
|
||||
token=hf_token,
|
||||
)
|
||||
|
||||
# Move to GPU if available
|
||||
|
|
@ -92,16 +146,29 @@ class Diarizer:
|
|||
return cls(hf_token=token)
|
||||
|
||||
def _diarize_sync(
|
||||
self, audio_float32: np.ndarray, sample_rate: int = _SAMPLE_RATE
|
||||
self,
|
||||
audio_float32: np.ndarray,
|
||||
sample_rate: int = _SAMPLE_RATE,
|
||||
num_speakers: int | None = None,
|
||||
) -> list[SpeakerSegment]:
|
||||
"""Synchronous diarization — always call via diarize_async."""
|
||||
"""Synchronous diarization — always call via diarize_async.
|
||||
|
||||
num_speakers: when set, passed as min_speakers=max_speakers to pyannote,
|
||||
which skips the agglomeration heuristic and improves boundary accuracy
|
||||
for known-size conversations (e.g. 2-person call).
|
||||
"""
|
||||
import torch
|
||||
|
||||
# pyannote expects (channels, samples) float32 tensor
|
||||
waveform = torch.from_numpy(audio_float32[np.newaxis, :].astype(np.float32))
|
||||
diarization = self._pipeline(
|
||||
{"waveform": waveform, "sample_rate": sample_rate}
|
||||
)
|
||||
pipeline_kwargs: dict = {"waveform": waveform, "sample_rate": sample_rate}
|
||||
if num_speakers and num_speakers > 0:
|
||||
pipeline_kwargs["min_speakers"] = num_speakers
|
||||
pipeline_kwargs["max_speakers"] = num_speakers
|
||||
output = self._pipeline(pipeline_kwargs)
|
||||
# pyannote >= 3.3 wraps results in DiarizeOutput; earlier versions return
|
||||
# Annotation directly. Normalise to Annotation before iterating.
|
||||
diarization = getattr(output, "speaker_diarization", output)
|
||||
|
||||
segments: list[SpeakerSegment] = []
|
||||
for turn, _, speaker in diarization.itertracks(yield_label=True):
|
||||
|
|
@ -118,6 +185,7 @@ class Diarizer:
|
|||
self,
|
||||
audio_float32: np.ndarray,
|
||||
sample_rate: int = _SAMPLE_RATE,
|
||||
num_speakers: int | None = None,
|
||||
) -> list[SpeakerSegment]:
|
||||
"""
|
||||
Diarize an audio window without blocking the event loop.
|
||||
|
|
@ -125,22 +193,58 @@ class Diarizer:
|
|||
audio_float32 should be 16kHz mono float32.
|
||||
Typical input is a 2-second window from MicVoiceIO (32000 samples).
|
||||
Returns segments ordered by start_s.
|
||||
|
||||
num_speakers: passed through to pyannote as min_speakers=max_speakers
|
||||
when set and > 0. Improves accuracy for known speaker counts.
|
||||
"""
|
||||
loop = asyncio.get_event_loop()
|
||||
from functools import partial
|
||||
loop = asyncio.get_running_loop()
|
||||
return await loop.run_in_executor(
|
||||
None, self._diarize_sync, audio_float32, sample_rate
|
||||
None,
|
||||
partial(self._diarize_sync, audio_float32, sample_rate, num_speakers),
|
||||
)
|
||||
|
||||
def speaker_at(
|
||||
self, segments: list[SpeakerSegment], timestamp_s: float
|
||||
self,
|
||||
segments: list[SpeakerSegment],
|
||||
timestamp_s: float,
|
||||
tracker: SpeakerTracker | None = None,
|
||||
window_s: float = 1.0,
|
||||
) -> str:
|
||||
"""
|
||||
Return the speaker_id active at a given timestamp within the window.
|
||||
Return the friendly speaker label dominating a window around timestamp_s.
|
||||
|
||||
Falls back to "speaker_a" if no segment covers the timestamp
|
||||
(e.g. during silence or at window boundaries).
|
||||
Strategy (in order):
|
||||
1. If segments directly cover timestamp_s: use majority rule among them.
|
||||
2. If timestamp_s falls in a silence gap: use the speaker with the most
|
||||
total speaking time across the whole window [0, window_s]. This handles
|
||||
pauses between pyannote segments without falling back to "speaker_a".
|
||||
3. No segments at all: SPEAKER_UNKNOWN.
|
||||
|
||||
tracker is optional; if omitted, raw pyannote IDs are returned as-is.
|
||||
"""
|
||||
if not segments:
|
||||
return SPEAKER_UNKNOWN
|
||||
|
||||
covering = [seg for seg in segments if seg.start_s <= timestamp_s <= seg.end_s]
|
||||
|
||||
if len(covering) >= 2:
|
||||
return SPEAKER_MULTIPLE
|
||||
|
||||
if len(covering) == 1:
|
||||
raw_id = covering[0].speaker_id
|
||||
return tracker.label(raw_id) if tracker else raw_id
|
||||
|
||||
# Midpoint fell in a silence gap — find dominant speaker over the window.
|
||||
from collections import defaultdict
|
||||
duration_by_speaker: dict[str, float] = defaultdict(float)
|
||||
win_start = max(0.0, timestamp_s - window_s / 2)
|
||||
win_end = timestamp_s + window_s / 2
|
||||
for seg in segments:
|
||||
if seg.start_s <= timestamp_s <= seg.end_s:
|
||||
return seg.speaker_id
|
||||
return "speaker_a"
|
||||
overlap = min(seg.end_s, win_end) - max(seg.start_s, win_start)
|
||||
if overlap > 0:
|
||||
duration_by_speaker[seg.speaker_id] += overlap
|
||||
if not duration_by_speaker:
|
||||
return SPEAKER_UNKNOWN
|
||||
raw_id = max(duration_by_speaker, key=lambda k: duration_by_speaker[k])
|
||||
return tracker.label(raw_id) if tracker else raw_id
|
||||
|
|
|
|||
190
cf_voice/dimensional.py
Normal file
190
cf_voice/dimensional.py
Normal file
|
|
@ -0,0 +1,190 @@
|
|||
# cf_voice/dimensional.py — audeering dimensional emotion model
|
||||
#
|
||||
# BSL 1.1: real inference. Requires [inference] extras.
|
||||
#
|
||||
# Model: audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim
|
||||
# Outputs three continuous 0-1 scores:
|
||||
# valence: negative (0) to positive (1)
|
||||
# arousal: low energy (0) to high energy (1)
|
||||
# dominance: submissive (0) to dominant (1)
|
||||
#
|
||||
# Trained on MSP-Podcast (in-the-wild conversational speech), not acted speech.
|
||||
# This is the key differentiator from SER models trained on RAVDESS/IEMOCAP.
|
||||
#
|
||||
# Enable with: CF_VOICE_DIMENSIONAL=1 (default off until audeering model is
|
||||
# downloaded — ~1.5GB, adds ~800MB GPU VRAM)
|
||||
#
|
||||
# HuggingFace model page:
|
||||
# https://huggingface.co/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim
|
||||
from __future__ import annotations
|
||||
|
||||
import asyncio
|
||||
import logging
|
||||
import os
|
||||
from dataclasses import dataclass
|
||||
from functools import partial
|
||||
|
||||
import numpy as np
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
_SAMPLE_RATE = 16_000
|
||||
_DIMENSIONAL_MODEL_ID = "audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim"
|
||||
|
||||
|
||||
@dataclass
|
||||
class DimensionalResult:
|
||||
"""
|
||||
Output of the audeering dimensional emotion model.
|
||||
|
||||
All scores are 0.0-1.0 continuous values:
|
||||
valence: negative affect (0) to positive affect (1)
|
||||
arousal: low energy / calm (0) to high energy / excited (1)
|
||||
dominance: submissive / uncertain (0) to dominant / assertive (1)
|
||||
|
||||
Sarcasm signal: low arousal + higher valence = "calm-positive" profile.
|
||||
Combined with flat F0 (prosody.py) and text divergence (linnet#22) for
|
||||
the full multi-signal sarcasm heuristic.
|
||||
"""
|
||||
valence: float
|
||||
arousal: float
|
||||
dominance: float
|
||||
|
||||
def affect_quadrant(self) -> str:
|
||||
"""
|
||||
Map VAD position to a descriptive quadrant label.
|
||||
|
||||
These are reference labels for logging and debugging, not user-facing.
|
||||
The annotation layer (Elcor) handles user-facing interpretation.
|
||||
"""
|
||||
v_high = self.valence >= 0.5
|
||||
a_high = self.arousal >= 0.5
|
||||
if v_high and a_high:
|
||||
return "enthusiastic"
|
||||
if v_high and not a_high:
|
||||
return "calm_positive" # sarcasm candidate when paired with flat F0
|
||||
if not v_high and a_high:
|
||||
return "frustrated_urgent"
|
||||
return "sad_resigned"
|
||||
|
||||
def calm_positive_score(self) -> float:
|
||||
"""
|
||||
0-1 score indicating how strongly the VAD position matches the
|
||||
calm-positive sarcasm candidate profile (low arousal, higher valence).
|
||||
|
||||
Used as one component of the combined sarcasm heuristic.
|
||||
"""
|
||||
valence_pos = max(0.0, self.valence - 0.5) * 2.0 # how positive
|
||||
arousal_low = 1.0 - self.arousal # how calm
|
||||
return (valence_pos * 0.5 + arousal_low * 0.5)
|
||||
|
||||
|
||||
class DimensionalClassifier:
|
||||
"""
|
||||
Async wrapper around the audeering wav2vec2 dimensional emotion model.
|
||||
|
||||
The model runs in a thread pool executor to avoid blocking asyncio.
|
||||
Loaded once on first call and reused; the underlying wav2vec2 model
|
||||
lands on CUDA when available (same device as the SER model in classify.py).
|
||||
|
||||
Usage
|
||||
-----
|
||||
clf = DimensionalClassifier.from_env()
|
||||
result = await clf.classify_async(audio_float32)
|
||||
print(result.valence, result.arousal, result.dominance)
|
||||
"""
|
||||
|
||||
def __init__(self) -> None:
|
||||
self._model = None
|
||||
self._processor = None
|
||||
self._loaded = False
|
||||
|
||||
def _ensure_loaded(self) -> None:
|
||||
"""Load model and processor on first inference call (not at construction)."""
|
||||
if self._loaded:
|
||||
return
|
||||
self._loaded = True
|
||||
|
||||
try:
|
||||
from transformers import Wav2Vec2Processor, Wav2Vec2ForSequenceClassification
|
||||
except ImportError as exc:
|
||||
raise ImportError(
|
||||
"transformers is required for dimensional emotion classification. "
|
||||
"Install with: pip install cf-voice[inference]"
|
||||
) from exc
|
||||
|
||||
logger.info("Loading dimensional emotion model %s", _DIMENSIONAL_MODEL_ID)
|
||||
self._processor = Wav2Vec2Processor.from_pretrained(_DIMENSIONAL_MODEL_ID)
|
||||
self._model = Wav2Vec2ForSequenceClassification.from_pretrained(_DIMENSIONAL_MODEL_ID)
|
||||
|
||||
try:
|
||||
import torch
|
||||
if torch.cuda.is_available():
|
||||
self._model = self._model.to(torch.device("cuda"))
|
||||
logger.info("Dimensional model on CUDA")
|
||||
except ImportError:
|
||||
pass
|
||||
|
||||
self._model.eval()
|
||||
|
||||
def _classify_sync(self, audio_float32: np.ndarray) -> DimensionalResult:
|
||||
"""
|
||||
Synchronous inference. Always call via classify_async.
|
||||
|
||||
The audeering model outputs [valence, arousal, dominance] as logits
|
||||
in the range 0-1 (sigmoid regression heads, not softmax). The model was
|
||||
fine-tuned on MSP-Podcast with per-dimension regression, not classification.
|
||||
"""
|
||||
self._ensure_loaded()
|
||||
|
||||
try:
|
||||
import torch
|
||||
except ImportError as exc:
|
||||
raise ImportError("torch is required for dimensional inference") from exc
|
||||
|
||||
inputs = self._processor(
|
||||
audio_float32,
|
||||
sampling_rate=_SAMPLE_RATE,
|
||||
return_tensors="pt",
|
||||
padding=True,
|
||||
)
|
||||
|
||||
if torch.cuda.is_available():
|
||||
inputs = {k: v.to("cuda") for k, v in inputs.items()}
|
||||
|
||||
with torch.no_grad():
|
||||
logits = self._model(**inputs).logits
|
||||
|
||||
# Model outputs [valence, arousal, dominance] in a single (1, 3) tensor
|
||||
scores = logits[0].cpu().float().numpy()
|
||||
valence = float(np.clip(scores[0], 0.0, 1.0))
|
||||
arousal = float(np.clip(scores[1], 0.0, 1.0))
|
||||
dominance = float(np.clip(scores[2], 0.0, 1.0))
|
||||
|
||||
return DimensionalResult(
|
||||
valence=round(valence, 4),
|
||||
arousal=round(arousal, 4),
|
||||
dominance=round(dominance, 4),
|
||||
)
|
||||
|
||||
async def classify_async(self, audio_float32: np.ndarray) -> DimensionalResult:
|
||||
"""
|
||||
Classify audio without blocking the event loop.
|
||||
|
||||
Runs in a thread pool executor. Designed to be gathered alongside
|
||||
the SER and diarization coroutines in context._classify_real_async().
|
||||
"""
|
||||
loop = asyncio.get_running_loop()
|
||||
return await loop.run_in_executor(
|
||||
None, partial(self._classify_sync, audio_float32)
|
||||
)
|
||||
|
||||
@classmethod
|
||||
def from_env(cls) -> "DimensionalClassifier":
|
||||
"""Construct from environment. Raises if CF_VOICE_DIMENSIONAL is not set."""
|
||||
if os.environ.get("CF_VOICE_DIMENSIONAL", "0") != "1":
|
||||
raise EnvironmentError(
|
||||
"CF_VOICE_DIMENSIONAL=1 is required to enable the audeering dimensional model. "
|
||||
"Add it to your .env file. The model requires ~800MB GPU VRAM."
|
||||
)
|
||||
return cls()
|
||||
|
|
@ -10,10 +10,10 @@ from __future__ import annotations
|
|||
from dataclasses import dataclass, field
|
||||
from typing import Literal
|
||||
|
||||
EventType = Literal["queue", "speaker", "environ", "tone"]
|
||||
EventType = Literal["queue", "speaker", "environ", "tone", "transcript", "scene", "accent"]
|
||||
|
||||
# ── Queue state labels ────────────────────────────────────────────────────────
|
||||
# Detected from YAMNet acoustic event classification
|
||||
# Detected from AST acoustic event classification
|
||||
QUEUE_LABELS = Literal[
|
||||
"hold_music", "silence", "ringback", "busy", "dead_air", "dtmf_tone"
|
||||
]
|
||||
|
|
@ -21,13 +21,36 @@ QUEUE_LABELS = Literal[
|
|||
# ── Speaker type labels ───────────────────────────────────────────────────────
|
||||
# Detected from pyannote VAD + custom IVR-vs-human head
|
||||
SPEAKER_LABELS = Literal[
|
||||
"ivr_synth", "human_single", "human_multi", "transfer", "no_speaker"
|
||||
"ivr_synth", "human_single", "human_multi", "transfer", "no_speaker",
|
||||
"background_voices",
|
||||
]
|
||||
|
||||
# ── Environmental labels ──────────────────────────────────────────────────────
|
||||
# Background shift is the primary AMD (answering machine detection) signal
|
||||
# Background shift is the primary AMD (answering machine detection) signal.
|
||||
# Telephony labels + general-purpose acoustic scene labels.
|
||||
ENVIRON_LABELS = Literal[
|
||||
"call_center", "music", "background_shift", "noise_floor_change", "quiet"
|
||||
# Telephony
|
||||
"call_center", "music", "background_shift", "noise_floor_change", "quiet",
|
||||
# Nature
|
||||
"birdsong", "wind", "rain", "water",
|
||||
# Urban
|
||||
"traffic", "crowd_chatter", "street_signal", "construction",
|
||||
# Indoor
|
||||
"hvac", "keyboard_typing", "restaurant",
|
||||
]
|
||||
|
||||
# ── Acoustic scene labels ─────────────────────────────────────────────────────
|
||||
# Broad scene category — primary input to privacy risk scoring.
|
||||
SCENE_LABELS = Literal[
|
||||
"indoor_quiet", "indoor_crowd", "outdoor_urban", "outdoor_nature",
|
||||
"vehicle", "public_transit",
|
||||
]
|
||||
|
||||
# ── Accent / language labels ──────────────────────────────────────────────────
|
||||
# Regional accent of primary speaker. Gated by CF_VOICE_ACCENT=1.
|
||||
ACCENT_LABELS = Literal[
|
||||
"en_gb", "en_us", "en_au", "en_ca", "en_in",
|
||||
"fr", "es", "de", "zh", "ja", "other",
|
||||
]
|
||||
|
||||
# ── Tone / affect labels ──────────────────────────────────────────────────────
|
||||
|
|
@ -86,12 +109,35 @@ class ToneEvent(AudioEvent):
|
|||
|
||||
The subtext field carries the human-readable annotation.
|
||||
Format is controlled by the caller (elcor flag in the classify request).
|
||||
|
||||
Dimensional emotion (Navigation v0.2.x — audeering model):
|
||||
valence / arousal / dominance are None when the dimensional classifier
|
||||
is not enabled (CF_VOICE_DIMENSIONAL != "1").
|
||||
|
||||
Prosodic signals (Navigation v0.2.x — openSMILE):
|
||||
sarcasm_risk / flat_f0_score are None when extractor is not enabled.
|
||||
"""
|
||||
affect: str = "neutral"
|
||||
shift_magnitude: float = 0.0
|
||||
shift_direction: str = "stable" # "warmer" | "colder" | "more_urgent" | "stable"
|
||||
prosody_flags: list[str] = field(default_factory=list)
|
||||
session_id: str = "" # caller-assigned; correlates events to a session
|
||||
# Dimensional emotion scores (audeering, optional)
|
||||
valence: float | None = None
|
||||
arousal: float | None = None
|
||||
dominance: float | None = None
|
||||
# Prosodic signals (openSMILE, optional)
|
||||
sarcasm_risk: float | None = None
|
||||
flat_f0_score: float | None = None
|
||||
# Trajectory signals (rolling buffer — activates after BASELINE_MIN frames)
|
||||
arousal_delta: float | None = None
|
||||
valence_delta: float | None = None
|
||||
trend: str | None = None # "stable"|"escalating"|"suppressed"|…
|
||||
# Coherence signals (SER vs VAD cross-comparison)
|
||||
coherence_score: float | None = None
|
||||
suppression_flag: bool | None = None
|
||||
reframe_type: str | None = None # "none"|"genuine"|"surface"
|
||||
affect_divergence: float | None = None
|
||||
|
||||
def __post_init__(self) -> None:
|
||||
# Force event_type to "tone" regardless of what the caller passed.
|
||||
|
|
|
|||
|
|
@ -118,5 +118,12 @@ def make_io(
|
|||
if use_mock:
|
||||
return MockVoiceIO(interval_s=interval_s)
|
||||
|
||||
try:
|
||||
from cf_voice.capture import MicVoiceIO
|
||||
return MicVoiceIO(device_index=device_index)
|
||||
except ImportError as exc:
|
||||
raise NotImplementedError(
|
||||
"Real audio capture requires [inference] extras. "
|
||||
"Install with: pip install cf-voice[inference]\n"
|
||||
f"Missing: {exc}"
|
||||
) from exc
|
||||
|
|
|
|||
|
|
@ -13,19 +13,30 @@ class VoiceFrame:
|
|||
A single annotated moment in a voice stream.
|
||||
|
||||
Produced by cf_voice.io (audio capture) and enriched by cf_voice.context
|
||||
(tone classification, speaker diarization).
|
||||
(tone classification, speaker diarization, dimensional emotion).
|
||||
|
||||
Fields
|
||||
------
|
||||
label Tone annotation, e.g. "Warmly impatient" or "Deflecting".
|
||||
Generic by default; Elcor-style prefix format is an
|
||||
easter egg surfaced by the product UI, not set here.
|
||||
confidence 0.0–1.0. Below ~0.5 the annotation is speculative.
|
||||
confidence 0.0-1.0. Below ~0.5 the annotation is speculative.
|
||||
speaker_id Ephemeral local label ("speaker_a", "speaker_b").
|
||||
Not tied to identity — resets each session.
|
||||
shift_magnitude Delta from the previous frame's tone, 0.0–1.0.
|
||||
shift_magnitude Delta from the previous frame's tone, 0.0-1.0.
|
||||
High values indicate a meaningful register shift.
|
||||
timestamp Session-relative seconds since capture started.
|
||||
|
||||
Dimensional emotion (audeering model — Navigation v0.2.x, optional):
|
||||
valence 0.0-1.0. Negative affect (0) to positive affect (1).
|
||||
arousal 0.0-1.0. Low energy / calm (0) to high energy / excited (1).
|
||||
dominance 0.0-1.0. Submissive / uncertain (0) to assertive / dominant (1).
|
||||
|
||||
Prosodic features (openSMILE eGeMAPS — Navigation v0.2.x, optional):
|
||||
sarcasm_risk 0.0-1.0 heuristic score: flat F0 + calm-positive VAD +
|
||||
text divergence (linnet#22). All three signals required for
|
||||
high confidence — audio-only signals are weak priors.
|
||||
flat_f0_score Normalised F0 flatness: 1.0 = maximally flat pitch.
|
||||
"""
|
||||
|
||||
label: str
|
||||
|
|
@ -34,6 +45,15 @@ class VoiceFrame:
|
|||
shift_magnitude: float
|
||||
timestamp: float
|
||||
|
||||
# Dimensional emotion scores — None when dimensional classifier is disabled
|
||||
valence: float | None = None
|
||||
arousal: float | None = None
|
||||
dominance: float | None = None
|
||||
|
||||
# Prosodic signals — None when prosodic extractor is disabled
|
||||
sarcasm_risk: float | None = None
|
||||
flat_f0_score: float | None = None
|
||||
|
||||
def is_reliable(self, threshold: float = 0.6) -> bool:
|
||||
"""Return True when confidence meets the given threshold."""
|
||||
return self.confidence >= threshold
|
||||
|
|
|
|||
181
cf_voice/prefs.py
Normal file
181
cf_voice/prefs.py
Normal file
|
|
@ -0,0 +1,181 @@
|
|||
# cf_voice/prefs.py — user preference hooks for cf-core preferences module
|
||||
#
|
||||
# MIT licensed. Provides voice-specific preference keys and helpers.
|
||||
#
|
||||
# When circuitforge_core is installed, reads/writes from the shared preference
|
||||
# store (LocalFileStore or cloud backend). When it is not installed (standalone
|
||||
# cf-voice use), falls back to environment variables only.
|
||||
#
|
||||
# Preference paths use dot-separated notation (cf-core convention):
|
||||
# "voice.elcor_mode" bool — Elcor-style tone annotations
|
||||
# "voice.confidence_threshold" float — minimum confidence to emit a frame
|
||||
# "voice.whisper_model" str — faster-whisper model size
|
||||
# "voice.elcor_prior_frames" int — rolling context window for Elcor LLM
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import os
|
||||
from typing import Any
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# ── Preference key constants ──────────────────────────────────────────────────
|
||||
|
||||
PREF_ELCOR_MODE = "voice.elcor_mode"
|
||||
PREF_CONFIDENCE_THRESHOLD = "voice.confidence_threshold"
|
||||
PREF_WHISPER_MODEL = "voice.whisper_model"
|
||||
PREF_ELCOR_PRIOR_FRAMES = "voice.elcor_prior_frames"
|
||||
|
||||
# Defaults used when neither preference store nor environment has a value
|
||||
_DEFAULTS: dict[str, Any] = {
|
||||
PREF_ELCOR_MODE: False,
|
||||
PREF_CONFIDENCE_THRESHOLD: 0.55,
|
||||
PREF_WHISPER_MODEL: "small",
|
||||
PREF_ELCOR_PRIOR_FRAMES: 4,
|
||||
}
|
||||
|
||||
# ── Environment variable fallbacks ────────────────────────────────────────────
|
||||
|
||||
_ENV_KEYS: dict[str, str] = {
|
||||
PREF_ELCOR_MODE: "CF_VOICE_ELCOR",
|
||||
PREF_CONFIDENCE_THRESHOLD: "CF_VOICE_CONFIDENCE_THRESHOLD",
|
||||
PREF_WHISPER_MODEL: "CF_VOICE_WHISPER_MODEL",
|
||||
PREF_ELCOR_PRIOR_FRAMES: "CF_VOICE_ELCOR_PRIOR_FRAMES",
|
||||
}
|
||||
|
||||
_COERCE: dict[str, type] = {
|
||||
PREF_ELCOR_MODE: bool,
|
||||
PREF_CONFIDENCE_THRESHOLD: float,
|
||||
PREF_WHISPER_MODEL: str,
|
||||
PREF_ELCOR_PRIOR_FRAMES: int,
|
||||
}
|
||||
|
||||
|
||||
def _from_env(pref_path: str) -> Any:
|
||||
"""Read a preference from its environment variable fallback."""
|
||||
env_key = _ENV_KEYS.get(pref_path)
|
||||
if env_key is None:
|
||||
return None
|
||||
raw = os.environ.get(env_key)
|
||||
if raw is None:
|
||||
return None
|
||||
coerce = _COERCE.get(pref_path, str)
|
||||
try:
|
||||
if coerce is bool:
|
||||
return raw.strip().lower() in ("1", "true", "yes")
|
||||
return coerce(raw)
|
||||
except (ValueError, TypeError):
|
||||
logger.warning("prefs: could not parse env %s=%r as %s", env_key, raw, coerce)
|
||||
return None
|
||||
|
||||
|
||||
def _cf_core_store():
|
||||
"""Return the cf-core default preference store, or None if not available."""
|
||||
try:
|
||||
from circuitforge_core.preferences import store as _store_mod
|
||||
return _store_mod._DEFAULT_STORE
|
||||
except ImportError:
|
||||
return None
|
||||
|
||||
|
||||
# ── Public API ────────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def get_voice_pref(
|
||||
pref_path: str,
|
||||
user_id: str | None = None,
|
||||
store=None,
|
||||
) -> Any:
|
||||
"""
|
||||
Read a voice preference value.
|
||||
|
||||
Resolution order:
|
||||
1. Explicit store (passed in by caller — used for testing or cloud backends)
|
||||
2. cf-core LocalFileStore (if circuitforge_core is installed)
|
||||
3. Environment variable fallback
|
||||
4. Built-in default
|
||||
|
||||
pref_path One of the PREF_* constants, e.g. PREF_ELCOR_MODE.
|
||||
user_id Passed to the store for cloud backends; local store ignores it.
|
||||
"""
|
||||
# 1. Explicit store
|
||||
if store is not None:
|
||||
val = store.get(user_id=user_id, path=pref_path, default=None)
|
||||
if val is not None:
|
||||
return val
|
||||
|
||||
# 2. cf-core default store
|
||||
cf_store = _cf_core_store()
|
||||
if cf_store is not None:
|
||||
val = cf_store.get(user_id=user_id, path=pref_path, default=None)
|
||||
if val is not None:
|
||||
return val
|
||||
|
||||
# 3. Environment variable
|
||||
env_val = _from_env(pref_path)
|
||||
if env_val is not None:
|
||||
return env_val
|
||||
|
||||
# 4. Built-in default
|
||||
return _DEFAULTS.get(pref_path)
|
||||
|
||||
|
||||
def set_voice_pref(
|
||||
pref_path: str,
|
||||
value: Any,
|
||||
user_id: str | None = None,
|
||||
store=None,
|
||||
) -> None:
|
||||
"""
|
||||
Write a voice preference value.
|
||||
|
||||
Writes to the explicit store if provided, otherwise to the cf-core default
|
||||
store. Raises RuntimeError if neither is available (env-only mode has no
|
||||
writable persistence).
|
||||
"""
|
||||
target = store or _cf_core_store()
|
||||
if target is None:
|
||||
raise RuntimeError(
|
||||
"No writable preference store available. "
|
||||
"Install circuitforge_core or pass a store explicitly."
|
||||
)
|
||||
target.set(user_id=user_id, path=pref_path, value=value)
|
||||
|
||||
|
||||
def is_elcor_enabled(user_id: str | None = None, store=None) -> bool:
|
||||
"""
|
||||
Convenience: return True if the user has Elcor annotation mode enabled.
|
||||
|
||||
Elcor mode switches tone subtext from generic format ("Tone: Frustrated")
|
||||
to the Mass Effect Elcor prefix format ("With barely concealed frustration:").
|
||||
It is an accessibility feature for autistic and ND users who benefit from
|
||||
explicit tonal annotation. Opt-in, local-only — no data leaves the device.
|
||||
|
||||
Defaults to False.
|
||||
"""
|
||||
return bool(get_voice_pref(PREF_ELCOR_MODE, user_id=user_id, store=store))
|
||||
|
||||
|
||||
def get_confidence_threshold(user_id: str | None = None, store=None) -> float:
|
||||
"""Return the minimum confidence threshold for emitting VoiceFrames (0.0–1.0)."""
|
||||
return float(
|
||||
get_voice_pref(PREF_CONFIDENCE_THRESHOLD, user_id=user_id, store=store)
|
||||
)
|
||||
|
||||
|
||||
def get_whisper_model(user_id: str | None = None, store=None) -> str:
|
||||
"""Return the faster-whisper model name to use (e.g. "small", "medium")."""
|
||||
return str(get_voice_pref(PREF_WHISPER_MODEL, user_id=user_id, store=store))
|
||||
|
||||
|
||||
def get_elcor_prior_frames(user_id: str | None = None, store=None) -> int:
|
||||
"""
|
||||
Return the number of prior VoiceFrames to include as context for Elcor
|
||||
label generation. Larger windows produce more contextually aware annotations
|
||||
but increase LLM prompt length and latency.
|
||||
|
||||
Default: 4 frames (~8–10 seconds of rolling context at 2s intervals).
|
||||
"""
|
||||
return int(
|
||||
get_voice_pref(PREF_ELCOR_PRIOR_FRAMES, user_id=user_id, store=store)
|
||||
)
|
||||
115
cf_voice/privacy.py
Normal file
115
cf_voice/privacy.py
Normal file
|
|
@ -0,0 +1,115 @@
|
|||
# cf_voice/privacy.py — local acoustic privacy risk scoring
|
||||
#
|
||||
# MIT licensed. Never transmitted to cloud. Never logged server-side.
|
||||
#
|
||||
# Derives a privacy_risk level (low / moderate / high) from the combined
|
||||
# acoustic fingerprint: scene + environ labels + speaker type + accent.
|
||||
#
|
||||
# Design rationale (#20):
|
||||
# - "outdoor_urban" + "crowd_chatter" + "traffic" → low: clearly public
|
||||
# - "indoor_quiet" + "background_voices" → moderate: conversation overheard
|
||||
# - "outdoor_nature" + "birdsong" + regional accent → moderate-high: location-identifying compound
|
||||
# - "indoor_quiet" + no background voices → low
|
||||
#
|
||||
# Risk gates (Linnet):
|
||||
# high: warn before sending audio chunk to cloud STT; offer local-only fallback
|
||||
# moderate: attach privacy_flags to session state, no blocking action
|
||||
# low: proceed normally
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass, field
|
||||
from typing import Literal
|
||||
|
||||
PrivacyLevel = Literal["low", "moderate", "high"]
|
||||
|
||||
|
||||
@dataclass
|
||||
class PrivacyRisk:
|
||||
"""
|
||||
Locally-computed privacy risk for a single audio window.
|
||||
|
||||
level: aggregate risk level
|
||||
flags: ordered list of contributing signal descriptions
|
||||
"""
|
||||
level: PrivacyLevel
|
||||
flags: list[str] = field(default_factory=list)
|
||||
|
||||
|
||||
# ── Signal sets ───────────────────────────────────────────────────────────────
|
||||
|
||||
_PUBLIC_SCENES = {"outdoor_urban", "public_transit"}
|
||||
_NATURE_SCENES = {"outdoor_nature"}
|
||||
_QUIET_SCENES = {"indoor_quiet"}
|
||||
|
||||
_LOCATION_ENVIRON = {"birdsong", "wind", "rain", "water"}
|
||||
_URBAN_ENVIRON = {"traffic", "crowd_chatter", "street_signal", "construction"}
|
||||
|
||||
|
||||
def score_privacy_risk(
|
||||
scene: str | None,
|
||||
environ_labels: list[str],
|
||||
speaker: str | None,
|
||||
accent: str | None,
|
||||
) -> PrivacyRisk:
|
||||
"""
|
||||
Derive a PrivacyRisk from the current acoustic fingerprint.
|
||||
|
||||
All inputs are nullable — this function handles partial signals gracefully.
|
||||
Called per audio window; results are never persisted or transmitted.
|
||||
|
||||
Args:
|
||||
scene: SCENE_LABEL string or None
|
||||
environ_labels: list of ENVIRON_LABEL strings active in this window
|
||||
speaker: SPEAKER_LABEL string or None
|
||||
accent: ACCENT_LABEL string or None (None when CF_VOICE_ACCENT disabled)
|
||||
"""
|
||||
flags: list[str] = []
|
||||
score = 0 # internal accumulator; maps to level at the end
|
||||
|
||||
environ_set = set(environ_labels)
|
||||
|
||||
# ── Clearly public environments → reduce risk ─────────────────────────────
|
||||
if scene in _PUBLIC_SCENES or environ_set & _URBAN_ENVIRON:
|
||||
flags.append("public_environment")
|
||||
score -= 1
|
||||
|
||||
# ── Background voices: conversation may be overheard ─────────────────────
|
||||
if speaker == "background_voices":
|
||||
flags.append("background_voices_detected")
|
||||
score += 2
|
||||
|
||||
# ── Quiet indoor: no background noise reduces identifiability ────────────
|
||||
if scene in _QUIET_SCENES and speaker not in ("background_voices", "human_multi"):
|
||||
flags.append("controlled_environment")
|
||||
# No score change — neutral
|
||||
|
||||
# ── Nature sounds: alone they suggest a quiet, potentially identifiable location
|
||||
nature_match = environ_set & _LOCATION_ENVIRON
|
||||
if nature_match:
|
||||
flags.append(f"location_signal: {', '.join(sorted(nature_match))}")
|
||||
score += 1
|
||||
|
||||
# ── Nature scene + nature sounds: compound location-identifying signal ────
|
||||
if scene in _NATURE_SCENES and nature_match:
|
||||
flags.append("compound_location_signal")
|
||||
score += 1
|
||||
|
||||
# ── Regional accent + nature: narrows location to region + environment ────
|
||||
if accent and accent not in ("en_us", "other") and nature_match:
|
||||
flags.append(f"accent_plus_location: {accent}")
|
||||
score += 1
|
||||
|
||||
# ── Quiet indoor + background voices: overheard conversation ─────────────
|
||||
if scene in _QUIET_SCENES and speaker == "background_voices":
|
||||
flags.append("overheard_conversation")
|
||||
score += 1
|
||||
|
||||
# ── Map score to level ────────────────────────────────────────────────────
|
||||
if score <= 0:
|
||||
level: PrivacyLevel = "low"
|
||||
elif score <= 2:
|
||||
level = "moderate"
|
||||
else:
|
||||
level = "high"
|
||||
|
||||
return PrivacyRisk(level=level, flags=flags)
|
||||
208
cf_voice/prosody.py
Normal file
208
cf_voice/prosody.py
Normal file
|
|
@ -0,0 +1,208 @@
|
|||
# cf_voice/prosody.py — openSMILE eGeMAPS prosodic feature extraction
|
||||
#
|
||||
# MIT licensed (opensmile-python package is MIT).
|
||||
#
|
||||
# Extracts 88 hand-crafted acoustic features from the eGeMAPS v02 feature set:
|
||||
# F0 mean / std / percentiles (pitch)
|
||||
# Jitter / Shimmer (cycle-to-cycle variation — vocal tension)
|
||||
# Energy / loudness envelope
|
||||
# MFCCs, spectral centroid
|
||||
# Speaking rate, pause ratio
|
||||
#
|
||||
# Runs on CPU in a thread pool executor — no GPU required. Designed to run
|
||||
# in parallel with the GPU classifiers in context._classify_real_async() via
|
||||
# asyncio.gather().
|
||||
#
|
||||
# Enable with: CF_VOICE_PROSODY=1 (default off)
|
||||
# Install: pip install opensmile
|
||||
#
|
||||
# openSMILE docs: https://audeering.github.io/opensmile-python/
|
||||
from __future__ import annotations
|
||||
|
||||
import asyncio
|
||||
import logging
|
||||
import os
|
||||
from dataclasses import dataclass
|
||||
from functools import partial
|
||||
|
||||
import numpy as np
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
_SAMPLE_RATE = 16_000
|
||||
|
||||
# F0 std normalisation constant: values below this threshold indicate flat prosody.
|
||||
# Derived from eGeMAPS feature "F0semitoneFrom27.5Hz_sma3nz_stddevNorm".
|
||||
# A typical conversational F0 std is ~0.3-0.5 semitones. Values under 0.2 are flat.
|
||||
_F0_STD_NORM_FEATURE = "F0semitoneFrom27.5Hz_sma3nz_stddevNorm"
|
||||
_F0_MEAN_FEATURE = "F0semitoneFrom27.5Hz_sma3nz_amean"
|
||||
_LOUDNESS_FEATURE = "loudness_sma3_amean"
|
||||
_JITTER_FEATURE = "jitterLocal_sma3nz_amean"
|
||||
_SHIMMER_FEATURE = "shimmerLocaldB_sma3nz_amean"
|
||||
_SPEECH_RATE_FEATURE = "VoicedSegmentsPerSec"
|
||||
|
||||
|
||||
@dataclass
|
||||
class ProsodicSignal:
|
||||
"""
|
||||
Summary prosodic features for a single audio window.
|
||||
|
||||
These are derived from the openSMILE eGeMAPS v02 feature set.
|
||||
All values are raw feature magnitudes unless noted otherwise.
|
||||
|
||||
f0_mean: Mean F0 in semitones from 27.5Hz reference
|
||||
f0_std: Normalised F0 standard deviation (flatness indicator)
|
||||
jitter: Cycle-to-cycle pitch variation (vocal tension)
|
||||
shimmer: Cycle-to-cycle amplitude variation (vocal stress)
|
||||
loudness: Mean loudness (energy proxy)
|
||||
sarcasm_risk: 0-1 heuristic score combining flat F0, calm-positive
|
||||
audio (from DimensionalResult if available), and optional
|
||||
text-audio divergence (linnet#22 signal, not yet wired).
|
||||
flat_f0_score: Normalised flatness: 1.0 = maximally flat, 0.0 = varied.
|
||||
"""
|
||||
f0_mean: float
|
||||
f0_std: float
|
||||
jitter: float
|
||||
shimmer: float
|
||||
loudness: float
|
||||
flat_f0_score: float
|
||||
sarcasm_risk: float
|
||||
|
||||
|
||||
def _compute_sarcasm_risk(
|
||||
flat_f0: float,
|
||||
calm_positive: float = 0.0,
|
||||
text_divergence: float = 0.0,
|
||||
) -> float:
|
||||
"""
|
||||
Heuristic sarcasm indicator. Not a trained model — a signal to combine
|
||||
with text divergence (linnet#22) for the final confidence score.
|
||||
|
||||
flat_f0: Normalised F0 flatness (1.0 = flat, 0.0 = varied).
|
||||
calm_positive: DimensionalResult.calm_positive_score() when available.
|
||||
text_divergence: abs(transcript_sentiment - audio_valence) from linnet#22.
|
||||
Pass 0.0 until the parallel text classifier is wired.
|
||||
|
||||
Weights: flat_f0 (40%), calm_positive (30%), text_divergence (30%).
|
||||
"""
|
||||
return min(1.0, flat_f0 * 0.4 + calm_positive * 0.3 + text_divergence * 0.3)
|
||||
|
||||
|
||||
class ProsodicExtractor:
|
||||
"""
|
||||
openSMILE eGeMAPS feature extractor for a single audio window.
|
||||
|
||||
CPU-bound inference — uses thread pool executor to avoid blocking asyncio.
|
||||
Lazy-loads opensmile on first call so import cost is deferred.
|
||||
|
||||
Usage
|
||||
-----
|
||||
extractor = ProsodicExtractor()
|
||||
signal = await extractor.extract_async(audio_float32)
|
||||
print(signal.flat_f0_score, signal.sarcasm_risk)
|
||||
"""
|
||||
|
||||
def __init__(self) -> None:
|
||||
self._smile = None
|
||||
|
||||
def _ensure_loaded(self) -> None:
|
||||
"""Lazy-load opensmile on first extract call."""
|
||||
if self._smile is not None:
|
||||
return
|
||||
|
||||
try:
|
||||
import opensmile
|
||||
except ImportError as exc:
|
||||
raise ImportError(
|
||||
"opensmile is required for prosodic feature extraction. "
|
||||
"Install with: pip install opensmile"
|
||||
) from exc
|
||||
|
||||
self._smile = opensmile.Smile(
|
||||
feature_set=opensmile.FeatureSet.eGeMAPSv02,
|
||||
feature_level=opensmile.FeatureLevel.Functionals,
|
||||
)
|
||||
logger.info("openSMILE eGeMAPS loaded")
|
||||
|
||||
def _extract_sync(
|
||||
self,
|
||||
audio_float32: np.ndarray,
|
||||
calm_positive: float = 0.0,
|
||||
text_divergence: float = 0.0,
|
||||
) -> ProsodicSignal:
|
||||
"""
|
||||
Synchronous feature extraction. Always call via extract_async.
|
||||
|
||||
Returns a ProsodicSignal with eGeMAPS features and a sarcasm risk score.
|
||||
If opensmile raises (e.g. audio too short, no voiced frames), returns a
|
||||
zero-filled ProsodicSignal so the caller does not need to handle exceptions.
|
||||
"""
|
||||
self._ensure_loaded()
|
||||
|
||||
try:
|
||||
feats = self._smile.process_signal(audio_float32, _SAMPLE_RATE)
|
||||
row = feats.iloc[0]
|
||||
|
||||
f0_mean = float(row.get(_F0_MEAN_FEATURE, 0.0))
|
||||
f0_std = float(row.get(_F0_STD_NORM_FEATURE, 0.0))
|
||||
jitter = float(row.get(_JITTER_FEATURE, 0.0))
|
||||
shimmer = float(row.get(_SHIMMER_FEATURE, 0.0))
|
||||
loudness = float(row.get(_LOUDNESS_FEATURE, 0.0))
|
||||
|
||||
except Exception as exc:
|
||||
logger.debug("openSMILE extraction failed (likely silent window): %s", exc)
|
||||
return ProsodicSignal(
|
||||
f0_mean=0.0, f0_std=0.0, jitter=0.0,
|
||||
shimmer=0.0, loudness=0.0, flat_f0_score=0.0, sarcasm_risk=0.0,
|
||||
)
|
||||
|
||||
# Normalise F0 variance to a flatness score.
|
||||
# f0_std of 0.4 semitones = neutral baseline → flat_f0 = 0.0
|
||||
# f0_std of 0.0 = maximally flat → flat_f0 = 1.0
|
||||
flat_f0 = 1.0 - min(f0_std / 0.4, 1.0)
|
||||
|
||||
sarcasm = _compute_sarcasm_risk(
|
||||
flat_f0=flat_f0,
|
||||
calm_positive=calm_positive,
|
||||
text_divergence=text_divergence,
|
||||
)
|
||||
|
||||
return ProsodicSignal(
|
||||
f0_mean=round(f0_mean, 4),
|
||||
f0_std=round(f0_std, 4),
|
||||
jitter=round(jitter, 6),
|
||||
shimmer=round(shimmer, 6),
|
||||
loudness=round(loudness, 4),
|
||||
flat_f0_score=round(flat_f0, 4),
|
||||
sarcasm_risk=round(sarcasm, 4),
|
||||
)
|
||||
|
||||
async def extract_async(
|
||||
self,
|
||||
audio_float32: np.ndarray,
|
||||
calm_positive: float = 0.0,
|
||||
text_divergence: float = 0.0,
|
||||
) -> ProsodicSignal:
|
||||
"""
|
||||
Extract prosodic features without blocking the event loop.
|
||||
|
||||
calm_positive: Pass DimensionalResult.calm_positive_score() when
|
||||
dimensional classification has already run.
|
||||
text_divergence: Pass abs(transcript_sentiment - valence) when the
|
||||
parallel text classifier (linnet#22) is wired.
|
||||
"""
|
||||
loop = asyncio.get_running_loop()
|
||||
return await loop.run_in_executor(
|
||||
None,
|
||||
partial(self._extract_sync, audio_float32, calm_positive, text_divergence),
|
||||
)
|
||||
|
||||
@classmethod
|
||||
def from_env(cls) -> "ProsodicExtractor":
|
||||
"""Construct from environment. Raises if CF_VOICE_PROSODY is not set."""
|
||||
if os.environ.get("CF_VOICE_PROSODY", "0") != "1":
|
||||
raise EnvironmentError(
|
||||
"CF_VOICE_PROSODY=1 is required to enable openSMILE eGeMAPS extraction. "
|
||||
"Add it to your .env and install opensmile: pip install opensmile"
|
||||
)
|
||||
return cls()
|
||||
|
|
@ -46,6 +46,17 @@ class WhisperSTT:
|
|||
print(result.text)
|
||||
"""
|
||||
|
||||
# Known single-token hallucinations that Whisper emits on music/noise with
|
||||
# low no_speech_prob (i.e. Whisper thinks it heard speech). These are too
|
||||
# short to be real utterances in any supported language context.
|
||||
_HALLUCINATION_TOKENS: frozenset[str] = frozenset({
|
||||
"ty", "t y", "bye", "hmm", "mm", "mhm", "uh", "um",
|
||||
})
|
||||
|
||||
# Suppress a transcript if it repeats unchanged across this many consecutive
|
||||
# windows — indicates Whisper is locked into a hallucination loop.
|
||||
_MAX_REPEATS = 2
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
model_name: str = "small",
|
||||
|
|
@ -77,6 +88,8 @@ class WhisperSTT:
|
|||
self._device = device
|
||||
self._model_name = model_name
|
||||
self._session_prompt: str = ""
|
||||
self._last_text: str = ""
|
||||
self._repeat_count: int = 0
|
||||
|
||||
@classmethod
|
||||
def from_env(cls) -> "WhisperSTT":
|
||||
|
|
@ -91,7 +104,14 @@ class WhisperSTT:
|
|||
"""Estimated VRAM usage in MB for this model/compute_type combination."""
|
||||
return _VRAM_ESTIMATES_MB.get(self._model_name, 1500)
|
||||
|
||||
def _transcribe_sync(self, audio_float32: np.ndarray) -> STTResult:
|
||||
# Segments above this no_speech_prob are hallucinations (silence/music/noise).
|
||||
# faster-whisper sets this per-segment; 0.6 catches the "thank you" / "thanks
|
||||
# for watching" family without cutting off genuine low-energy speech.
|
||||
_NO_SPEECH_THRESHOLD = 0.6
|
||||
|
||||
def _transcribe_sync(
|
||||
self, audio_float32: np.ndarray, language: str | None = None
|
||||
) -> STTResult:
|
||||
"""Synchronous transcription — always call via transcribe_chunk_async."""
|
||||
duration = len(audio_float32) / 16_000.0
|
||||
|
||||
|
|
@ -100,22 +120,49 @@ class WhisperSTT:
|
|||
text="", language="en", duration_s=duration, is_final=False
|
||||
)
|
||||
|
||||
# Energy gate: skip Whisper entirely on silent/near-silent audio.
|
||||
# In the sidecar path there is no upstream MicVoiceIO silence gate,
|
||||
# so we must check here. RMS < 0.005 is inaudible; Whisper will
|
||||
# hallucinate "thank you" or "thanks for watching" on silence.
|
||||
rms = float(np.sqrt(np.mean(audio_float32 ** 2)))
|
||||
if rms < 0.005:
|
||||
return STTResult(text="", language="en", duration_s=duration, is_final=False)
|
||||
|
||||
segments, info = self._model.transcribe(
|
||||
audio_float32,
|
||||
language=None,
|
||||
initial_prompt=self._session_prompt or None,
|
||||
vad_filter=False, # silence gating happens upstream in MicVoiceIO
|
||||
language=language or None, # None = Whisper auto-detect
|
||||
initial_prompt=None, # No session prompt — on 1s windows it causes
|
||||
# phrase lock-in (model anchors on prior text
|
||||
# rather than fresh audio). Reset via reset_session()
|
||||
# at conversation boundaries instead.
|
||||
vad_filter=True, # Silero VAD — skips non-speech frames
|
||||
word_timestamps=False,
|
||||
beam_size=3,
|
||||
temperature=0.0,
|
||||
)
|
||||
|
||||
text = " ".join(s.text.strip() for s in segments).strip()
|
||||
# Filter hallucinated segments: discard any segment where Whisper itself
|
||||
# says there is likely no speech (no_speech_prob > threshold). This is
|
||||
# the correct defense against "thank you" / music hallucinations — VAD
|
||||
# alone is insufficient because music harmonics look speech-like to Silero.
|
||||
text = " ".join(
|
||||
s.text.strip()
|
||||
for s in segments
|
||||
if s.no_speech_prob <= self._NO_SPEECH_THRESHOLD
|
||||
).strip()
|
||||
|
||||
# Rolling context: keep last ~50 words so the next chunk has prior text
|
||||
if text:
|
||||
words = (self._session_prompt + " " + text).split()
|
||||
self._session_prompt = " ".join(words[-50:])
|
||||
# Gate 1: single-token hallucinations that slip past no_speech_prob.
|
||||
if text.lower().rstrip(".,!?") in self._HALLUCINATION_TOKENS:
|
||||
text = ""
|
||||
|
||||
# Gate 2: repetition lock — same non-empty text N windows in a row.
|
||||
if text and text == self._last_text:
|
||||
self._repeat_count += 1
|
||||
if self._repeat_count >= self._MAX_REPEATS:
|
||||
text = ""
|
||||
else:
|
||||
self._last_text = text
|
||||
self._repeat_count = 0
|
||||
|
||||
return STTResult(
|
||||
text=text,
|
||||
|
|
@ -124,19 +171,29 @@ class WhisperSTT:
|
|||
is_final=duration >= 1.0 and info.language_probability > 0.5,
|
||||
)
|
||||
|
||||
async def transcribe_chunk_async(self, pcm_int16: bytes) -> STTResult:
|
||||
async def transcribe_chunk_async(
|
||||
self, pcm_int16: bytes, language: str | None = None
|
||||
) -> STTResult:
|
||||
"""
|
||||
Transcribe a raw PCM Int16 chunk, non-blocking.
|
||||
|
||||
pcm_int16 should be 16kHz mono bytes. Typical input is 20 × 100ms
|
||||
chunks accumulated by MicVoiceIO (2-second window = 64000 bytes).
|
||||
|
||||
language: BCP-47 hint (e.g. "en", "es"). None = Whisper auto-detects,
|
||||
which is slower and more hallucination-prone on short clips.
|
||||
"""
|
||||
from functools import partial
|
||||
audio = (
|
||||
np.frombuffer(pcm_int16, dtype=np.int16).astype(np.float32) / 32768.0
|
||||
)
|
||||
loop = asyncio.get_event_loop()
|
||||
return await loop.run_in_executor(None, self._transcribe_sync, audio)
|
||||
loop = asyncio.get_running_loop()
|
||||
return await loop.run_in_executor(
|
||||
None, partial(self._transcribe_sync, audio, language)
|
||||
)
|
||||
|
||||
def reset_session(self) -> None:
|
||||
"""Clear the rolling prompt. Call at the start of each new conversation."""
|
||||
"""Clear rolling state. Call at the start of each new conversation."""
|
||||
self._session_prompt = ""
|
||||
self._last_text = ""
|
||||
self._repeat_count = 0
|
||||
|
|
|
|||
500
cf_voice/telephony.py
Normal file
500
cf_voice/telephony.py
Normal file
|
|
@ -0,0 +1,500 @@
|
|||
# cf_voice/telephony.py — outbound telephony abstraction
|
||||
#
|
||||
# Protocol + mock backend: MIT licensed.
|
||||
# SignalWireBackend, FreeSWITCHBackend: BSL 1.1 (real telephony, cloud credentials).
|
||||
#
|
||||
# Consumers (Osprey, Harrier, Ibis, Kestrel) depend only on TelephonyBackend
|
||||
# and CallSession — both MIT. The concrete backends are selected by make_telephony()
|
||||
# based on the tier and available credentials.
|
||||
#
|
||||
# Requires optional extras for real backends:
|
||||
# pip install cf-voice[signalwire] — SignalWire (paid tier, CF-provisioned)
|
||||
# pip install cf-voice[freeswitch] — FreeSWITCH ESL (free tier, self-hosted)
|
||||
from __future__ import annotations
|
||||
|
||||
import asyncio
|
||||
import logging
|
||||
import os
|
||||
from dataclasses import dataclass, field
|
||||
from typing import Literal, Protocol, runtime_checkable
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
CallState = Literal[
|
||||
"dialing",
|
||||
"ringing",
|
||||
"in_progress",
|
||||
"hold",
|
||||
"bridged",
|
||||
"completed",
|
||||
"failed",
|
||||
"no_answer",
|
||||
"busy",
|
||||
]
|
||||
|
||||
|
||||
@dataclass
|
||||
class CallSession:
|
||||
"""
|
||||
Represents an active or completed outbound call.
|
||||
|
||||
call_sid is the backend-assigned identifier — for SignalWire this is a
|
||||
Twilio-compatible SID string; for FreeSWITCH it is the UUID.
|
||||
|
||||
state is updated by the backend as the call progresses. Consumers should
|
||||
poll via backend.get_state() or subscribe to webhook events.
|
||||
"""
|
||||
call_sid: str
|
||||
to: str
|
||||
from_: str
|
||||
state: CallState = "dialing"
|
||||
duration_s: float = 0.0
|
||||
# AMD result: "human" | "machine" | "unknown"
|
||||
# Populated once the backend resolves answering machine detection.
|
||||
amd_result: str = "unknown"
|
||||
error: str | None = None
|
||||
|
||||
|
||||
@runtime_checkable
|
||||
class TelephonyBackend(Protocol):
|
||||
"""
|
||||
Abstract telephony backend interface.
|
||||
|
||||
All methods are async. Implementations must be safe to call from an
|
||||
asyncio event loop. Long-running network operations run in a thread pool
|
||||
(not the caller's responsibility).
|
||||
|
||||
Field names are stable as of cf-voice v0.1.0.
|
||||
"""
|
||||
|
||||
async def dial(
|
||||
self,
|
||||
to: str,
|
||||
from_: str,
|
||||
webhook_url: str,
|
||||
*,
|
||||
amd: bool = False,
|
||||
) -> CallSession:
|
||||
"""
|
||||
Initiate an outbound call.
|
||||
|
||||
to / from_ E.164 numbers ("+15551234567").
|
||||
webhook_url URL the backend will POST call events to (SignalWire/TwiML style).
|
||||
amd If True, request answering machine detection. Result lands in
|
||||
CallSession.amd_result once the backend resolves it.
|
||||
|
||||
Returns a CallSession with state="dialing".
|
||||
"""
|
||||
...
|
||||
|
||||
async def send_dtmf(self, call_sid: str, digits: str) -> None:
|
||||
"""
|
||||
Send DTMF (dual-tone multi-frequency) tones mid-call.
|
||||
|
||||
digits String of 0-9, *, #, A-D. Each character is one tone.
|
||||
Pauses may be represented as 'w' (0.5s) or 'W' (1s) if the backend
|
||||
supports them.
|
||||
"""
|
||||
...
|
||||
|
||||
async def bridge(self, call_sid: str, target: str) -> None:
|
||||
"""
|
||||
Bridge the active call to a second E.164 number or SIP URI.
|
||||
|
||||
Used to connect the user directly to a human agent after Osprey has
|
||||
navigated the IVR. The original call leg remains connected.
|
||||
"""
|
||||
...
|
||||
|
||||
async def hangup(self, call_sid: str) -> None:
|
||||
"""Terminate the call. Idempotent — safe to call on already-ended calls."""
|
||||
...
|
||||
|
||||
async def announce(
|
||||
self,
|
||||
call_sid: str,
|
||||
text: str,
|
||||
voice: str = "default",
|
||||
) -> None:
|
||||
"""
|
||||
Play synthesised speech into the call.
|
||||
|
||||
Implements the adaptive service identification requirement (osprey#21):
|
||||
Osprey announces its identity before navigating an IVR so that the
|
||||
other party can consent to automated interaction.
|
||||
|
||||
voice Backend-specific voice identifier. "default" uses the backend's
|
||||
default TTS voice.
|
||||
"""
|
||||
...
|
||||
|
||||
async def get_state(self, call_sid: str) -> CallState:
|
||||
"""Fetch the current state of a call from the backend."""
|
||||
...
|
||||
|
||||
|
||||
# ── Mock backend (MIT) ────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
class MockTelephonyBackend:
|
||||
"""
|
||||
Synthetic telephony backend for development and CI.
|
||||
|
||||
No real calls are placed. Operations log to cf_voice.telephony and update
|
||||
in-memory CallSession objects. AMD resolves to "human" after a simulated
|
||||
delay.
|
||||
|
||||
Usage:
|
||||
backend = MockTelephonyBackend()
|
||||
session = await backend.dial("+15551234567", "+18005550000", "https://...")
|
||||
await backend.send_dtmf(session.call_sid, "1")
|
||||
await backend.hangup(session.call_sid)
|
||||
"""
|
||||
|
||||
def __init__(self, amd_delay_s: float = 0.5) -> None:
|
||||
self._sessions: dict[str, CallSession] = {}
|
||||
self._amd_delay_s = amd_delay_s
|
||||
self._call_counter = 0
|
||||
|
||||
def _next_sid(self) -> str:
|
||||
self._call_counter += 1
|
||||
return f"mock_sid_{self._call_counter:04d}"
|
||||
|
||||
async def dial(
|
||||
self,
|
||||
to: str,
|
||||
from_: str,
|
||||
webhook_url: str,
|
||||
*,
|
||||
amd: bool = False,
|
||||
) -> CallSession:
|
||||
sid = self._next_sid()
|
||||
session = CallSession(call_sid=sid, to=to, from_=from_, state="ringing")
|
||||
self._sessions[sid] = session
|
||||
logger.debug("MockTelephony: dial %s → %s (sid=%s)", from_, to, sid)
|
||||
|
||||
async def _progress() -> None:
|
||||
await asyncio.sleep(0.05)
|
||||
session.state = "in_progress"
|
||||
if amd:
|
||||
await asyncio.sleep(self._amd_delay_s)
|
||||
session.amd_result = "human"
|
||||
logger.debug("MockTelephony: AMD resolved human (sid=%s)", sid)
|
||||
|
||||
asyncio.create_task(_progress())
|
||||
return session
|
||||
|
||||
async def send_dtmf(self, call_sid: str, digits: str) -> None:
|
||||
self._sessions[call_sid] # KeyError if unknown — intentional
|
||||
logger.debug("MockTelephony: DTMF %r (sid=%s)", digits, call_sid)
|
||||
|
||||
async def bridge(self, call_sid: str, target: str) -> None:
|
||||
session = self._sessions[call_sid]
|
||||
session.state = "bridged"
|
||||
logger.debug("MockTelephony: bridge → %s (sid=%s)", target, call_sid)
|
||||
|
||||
async def hangup(self, call_sid: str) -> None:
|
||||
session = self._sessions.get(call_sid)
|
||||
if session:
|
||||
session.state = "completed"
|
||||
logger.debug("MockTelephony: hangup (sid=%s)", call_sid)
|
||||
|
||||
async def announce(
|
||||
self,
|
||||
call_sid: str,
|
||||
text: str,
|
||||
voice: str = "default",
|
||||
) -> None:
|
||||
self._sessions[call_sid] # KeyError if unknown — intentional
|
||||
logger.debug(
|
||||
"MockTelephony: announce voice=%s text=%r (sid=%s)", voice, text, call_sid
|
||||
)
|
||||
|
||||
async def get_state(self, call_sid: str) -> CallState:
|
||||
return self._sessions[call_sid].state
|
||||
|
||||
|
||||
# ── SignalWire backend (BSL 1.1) ──────────────────────────────────────────────
|
||||
|
||||
|
||||
class SignalWireBackend:
|
||||
"""
|
||||
SignalWire outbound telephony (Twilio-compatible REST API).
|
||||
|
||||
BSL 1.1 — requires paid tier or self-hosted CF SignalWire project.
|
||||
|
||||
Credentials sourced from environment:
|
||||
CF_SW_PROJECT_ID — SignalWire project ID
|
||||
CF_SW_AUTH_TOKEN — SignalWire auth token
|
||||
CF_SW_SPACE_URL — space URL, e.g. "yourspace.signalwire.com"
|
||||
|
||||
Requires: pip install cf-voice[signalwire]
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
project_id: str | None = None,
|
||||
auth_token: str | None = None,
|
||||
space_url: str | None = None,
|
||||
) -> None:
|
||||
try:
|
||||
from signalwire.rest import Client as SWClient # type: ignore[import]
|
||||
except ImportError as exc:
|
||||
raise ImportError(
|
||||
"SignalWire SDK is required for SignalWireBackend. "
|
||||
"Install with: pip install cf-voice[signalwire]"
|
||||
) from exc
|
||||
|
||||
self._project_id = project_id or os.environ["CF_SW_PROJECT_ID"]
|
||||
self._auth_token = auth_token or os.environ["CF_SW_AUTH_TOKEN"]
|
||||
self._space_url = space_url or os.environ["CF_SW_SPACE_URL"]
|
||||
self._client = SWClient(
|
||||
self._project_id,
|
||||
self._auth_token,
|
||||
signalwire_space_url=self._space_url,
|
||||
)
|
||||
self._loop = asyncio.get_event_loop()
|
||||
|
||||
async def dial(
|
||||
self,
|
||||
to: str,
|
||||
from_: str,
|
||||
webhook_url: str,
|
||||
*,
|
||||
amd: bool = False,
|
||||
) -> CallSession:
|
||||
call_kwargs: dict = dict(
|
||||
to=to,
|
||||
from_=from_,
|
||||
url=webhook_url,
|
||||
status_callback=webhook_url,
|
||||
)
|
||||
if amd:
|
||||
call_kwargs["machine_detection"] = "Enable"
|
||||
call_kwargs["async_amd"] = True
|
||||
|
||||
call = await asyncio.get_event_loop().run_in_executor(
|
||||
None,
|
||||
lambda: self._client.calls.create(**call_kwargs),
|
||||
)
|
||||
return CallSession(
|
||||
call_sid=call.sid,
|
||||
to=to,
|
||||
from_=from_,
|
||||
state="dialing",
|
||||
)
|
||||
|
||||
async def send_dtmf(self, call_sid: str, digits: str) -> None:
|
||||
await asyncio.get_event_loop().run_in_executor(
|
||||
None,
|
||||
lambda: self._client.calls(call_sid).update(
|
||||
twiml=f"<Response><Play digits='{digits}'/></Response>"
|
||||
),
|
||||
)
|
||||
|
||||
async def bridge(self, call_sid: str, target: str) -> None:
|
||||
await asyncio.get_event_loop().run_in_executor(
|
||||
None,
|
||||
lambda: self._client.calls(call_sid).update(
|
||||
twiml=(
|
||||
f"<Response><Dial><Number>{target}</Number></Dial></Response>"
|
||||
)
|
||||
),
|
||||
)
|
||||
|
||||
async def hangup(self, call_sid: str) -> None:
|
||||
await asyncio.get_event_loop().run_in_executor(
|
||||
None,
|
||||
lambda: self._client.calls(call_sid).update(status="completed"),
|
||||
)
|
||||
|
||||
async def announce(
|
||||
self,
|
||||
call_sid: str,
|
||||
text: str,
|
||||
voice: str = "alice",
|
||||
) -> None:
|
||||
await asyncio.get_event_loop().run_in_executor(
|
||||
None,
|
||||
lambda: self._client.calls(call_sid).update(
|
||||
twiml=f"<Response><Say voice='{voice}'>{text}</Say></Response>"
|
||||
),
|
||||
)
|
||||
|
||||
async def get_state(self, call_sid: str) -> CallState:
|
||||
call = await asyncio.get_event_loop().run_in_executor(
|
||||
None,
|
||||
lambda: self._client.calls(call_sid).fetch(),
|
||||
)
|
||||
_sw_map: dict[str, CallState] = {
|
||||
"queued": "dialing", "ringing": "ringing", "in-progress": "in_progress",
|
||||
"completed": "completed", "failed": "failed", "busy": "busy",
|
||||
"no-answer": "no_answer",
|
||||
}
|
||||
return _sw_map.get(call.status, "failed")
|
||||
|
||||
|
||||
# ── FreeSWITCH backend (BSL 1.1) ─────────────────────────────────────────────
|
||||
|
||||
|
||||
class FreeSWITCHBackend:
|
||||
"""
|
||||
Self-hosted FreeSWITCH outbound telephony via ESL (event socket layer).
|
||||
|
||||
BSL 1.1 — requires free tier + user-provisioned FreeSWITCH + VoIP.ms SIP trunk.
|
||||
|
||||
Credentials sourced from environment:
|
||||
CF_ESL_HOST — FreeSWITCH ESL host (default: 127.0.0.1)
|
||||
CF_ESL_PORT — FreeSWITCH ESL port (default: 8021)
|
||||
CF_ESL_PASSWORD — FreeSWITCH ESL password
|
||||
|
||||
Requires: pip install cf-voice[freeswitch]
|
||||
|
||||
Note: FreeSWITCH AMD (mod_vad + custom heuristic or Whisper pipe) is not
|
||||
yet implemented. The amd parameter is accepted but amd_result stays "unknown".
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
host: str | None = None,
|
||||
port: int | None = None,
|
||||
password: str | None = None,
|
||||
) -> None:
|
||||
try:
|
||||
import ESL # type: ignore[import]
|
||||
except ImportError as exc:
|
||||
raise ImportError(
|
||||
"FreeSWITCH ESL bindings are required for FreeSWITCHBackend. "
|
||||
"Install with: pip install cf-voice[freeswitch]"
|
||||
) from exc
|
||||
|
||||
self._host = host or os.environ.get("CF_ESL_HOST", "127.0.0.1")
|
||||
self._port = int(port or os.environ.get("CF_ESL_PORT", 8021))
|
||||
self._password = password or os.environ["CF_ESL_PASSWORD"]
|
||||
self._esl = ESL
|
||||
|
||||
def _connect(self):
|
||||
conn = self._esl.ESLconnection(self._host, str(self._port), self._password)
|
||||
if not conn.connected():
|
||||
raise RuntimeError(
|
||||
f"Could not connect to FreeSWITCH ESL at {self._host}:{self._port}"
|
||||
)
|
||||
return conn
|
||||
|
||||
async def dial(
|
||||
self,
|
||||
to: str,
|
||||
from_: str,
|
||||
webhook_url: str,
|
||||
*,
|
||||
amd: bool = False,
|
||||
) -> CallSession:
|
||||
def _originate() -> str:
|
||||
conn = self._connect()
|
||||
# ESL originate: sofia/gateway/voipms/{to} {from_} XML default
|
||||
cmd = (
|
||||
f"originate {{origination_caller_id_number={from_},"
|
||||
f"origination_caller_id_name=CircuitForge}}"
|
||||
f"sofia/gateway/voipms/{to.lstrip('+')} &park()"
|
||||
)
|
||||
result = conn.api("originate", cmd)
|
||||
return result.getBody().strip()
|
||||
|
||||
body = await asyncio.get_event_loop().run_in_executor(None, _originate)
|
||||
# FreeSWITCH returns "+OK <uuid>" on success
|
||||
if not body.startswith("+OK"):
|
||||
raise RuntimeError(f"FreeSWITCH originate failed: {body}")
|
||||
uuid = body.removeprefix("+OK").strip()
|
||||
return CallSession(call_sid=uuid, to=to, from_=from_, state="dialing")
|
||||
|
||||
async def send_dtmf(self, call_sid: str, digits: str) -> None:
|
||||
def _dtmf() -> None:
|
||||
conn = self._connect()
|
||||
conn.api("uuid_send_dtmf", f"{call_sid} {digits}")
|
||||
|
||||
await asyncio.get_event_loop().run_in_executor(None, _dtmf)
|
||||
|
||||
async def bridge(self, call_sid: str, target: str) -> None:
|
||||
def _bridge() -> None:
|
||||
conn = self._connect()
|
||||
conn.api(
|
||||
"uuid_bridge",
|
||||
f"{call_sid} sofia/gateway/voipms/{target.lstrip('+')}",
|
||||
)
|
||||
|
||||
await asyncio.get_event_loop().run_in_executor(None, _bridge)
|
||||
|
||||
async def hangup(self, call_sid: str) -> None:
|
||||
def _hangup() -> None:
|
||||
conn = self._connect()
|
||||
conn.api("uuid_kill", call_sid)
|
||||
|
||||
await asyncio.get_event_loop().run_in_executor(None, _hangup)
|
||||
|
||||
async def announce(
|
||||
self,
|
||||
call_sid: str,
|
||||
text: str,
|
||||
voice: str = "default",
|
||||
) -> None:
|
||||
# FreeSWITCH TTS via mod_tts_commandline or Piper pipe
|
||||
def _say() -> None:
|
||||
conn = self._connect()
|
||||
conn.api("uuid_broadcast", f"{call_sid} say::en CHAT SPOKEN {text}")
|
||||
|
||||
await asyncio.get_event_loop().run_in_executor(None, _say)
|
||||
|
||||
async def get_state(self, call_sid: str) -> CallState:
|
||||
def _fetch() -> str:
|
||||
conn = self._connect()
|
||||
return conn.api("uuid_getvar", f"{call_sid} call_state").getBody().strip()
|
||||
|
||||
raw = await asyncio.get_event_loop().run_in_executor(None, _fetch)
|
||||
_fs_map: dict[str, CallState] = {
|
||||
"CS_INIT": "dialing", "CS_ROUTING": "ringing",
|
||||
"CS_EXECUTE": "in_progress", "CS_HANGUP": "completed",
|
||||
"CS_DESTROY": "completed",
|
||||
}
|
||||
return _fs_map.get(raw, "failed")
|
||||
|
||||
|
||||
# ── Factory ───────────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def make_telephony(
|
||||
mock: bool | None = None,
|
||||
backend: str | None = None,
|
||||
) -> MockTelephonyBackend | SignalWireBackend | FreeSWITCHBackend:
|
||||
"""
|
||||
Factory: return a TelephonyBackend appropriate for the current environment.
|
||||
|
||||
Resolution order:
|
||||
1. mock=True or CF_VOICE_MOCK=1 → MockTelephonyBackend
|
||||
2. backend="signalwire" or CF_SW_PROJECT_ID present → SignalWireBackend
|
||||
3. backend="freeswitch" or CF_ESL_PASSWORD present → FreeSWITCHBackend
|
||||
4. Raises RuntimeError — no usable backend configured
|
||||
|
||||
In production, backend selection is driven by the tier system:
|
||||
Free tier → FreeSWITCHBackend (BYOK VoIP)
|
||||
Paid tier → SignalWireBackend (CF-provisioned)
|
||||
"""
|
||||
use_mock = mock if mock is not None else os.environ.get("CF_VOICE_MOCK", "") == "1"
|
||||
if use_mock:
|
||||
return MockTelephonyBackend()
|
||||
|
||||
resolved_backend = backend or (
|
||||
"signalwire" if os.environ.get("CF_SW_PROJECT_ID") else
|
||||
"freeswitch" if os.environ.get("CF_ESL_PASSWORD") else
|
||||
None
|
||||
)
|
||||
|
||||
if resolved_backend == "signalwire":
|
||||
return SignalWireBackend()
|
||||
|
||||
if resolved_backend == "freeswitch":
|
||||
return FreeSWITCHBackend()
|
||||
|
||||
raise RuntimeError(
|
||||
"No telephony backend configured. "
|
||||
"Set CF_VOICE_MOCK=1 for mock mode, or provide SignalWire / FreeSWITCH credentials."
|
||||
)
|
||||
288
cf_voice/trajectory.py
Normal file
288
cf_voice/trajectory.py
Normal file
|
|
@ -0,0 +1,288 @@
|
|||
# cf_voice/trajectory.py — affect trajectory and SER/VAD coherence signals
|
||||
#
|
||||
# MIT licensed — derived computation only, no inference models.
|
||||
#
|
||||
# Two signal families:
|
||||
#
|
||||
# 1. TrajectorySignal — rolling arousal/valence trend across the last N windows.
|
||||
# Detects escalation, de-escalation, suppression, worsening, improving.
|
||||
#
|
||||
# 2. CoherenceSignal — cross-model comparison between SER (categorical affect)
|
||||
# and VAD (continuous dimensional valence). Disagreement indicates affect
|
||||
# suppression, controlled presentation, or surface-only semantic reframe.
|
||||
#
|
||||
# Both signals activate only after BASELINE_MIN windows per speaker are buffered.
|
||||
# All thresholds are relative to the per-speaker rolling mean, not absolute —
|
||||
# this is required for ND/neurodivergent speaker safety (see design doc).
|
||||
#
|
||||
# Safety note: these signals must never be labelled "deception" in any
|
||||
# user-facing context. Use: "affect divergence", "controlled presentation",
|
||||
# "framing shift". The user interprets; the system observes.
|
||||
from __future__ import annotations
|
||||
|
||||
from collections import deque
|
||||
from dataclasses import dataclass
|
||||
|
||||
from cf_voice.dimensional import DimensionalResult
|
||||
|
||||
# Rolling window depth per speaker
|
||||
BUFFER_WINDOW = 5
|
||||
|
||||
# Minimum frames before signals activate (relative baseline requirement)
|
||||
BASELINE_MIN = 3
|
||||
|
||||
# Minimum arousal/valence delta per window to count as directional movement
|
||||
_DELTA_THRESHOLD = 0.05
|
||||
|
||||
# Arousal threshold above which "neutral SER + high arousal" = suppression candidate
|
||||
_SUPPRESSION_AROUSAL_MIN = 0.65
|
||||
|
||||
# SER affects that imply low arousal presentation (used for suppression detection)
|
||||
_LOW_PRESENTATION_AFFECTS = frozenset({"neutral", "scripted", "tired", "apologetic"})
|
||||
|
||||
# Expected valence ranges derived from MSP-Podcast emotion distribution.
|
||||
# Used to determine whether SER affect label and dimensional valence agree.
|
||||
_AFFECT_VALENCE_PRIOR: dict[str, tuple[float, float]] = {
|
||||
"warm": (0.60, 1.00),
|
||||
"genuine": (0.55, 1.00),
|
||||
"optimistic": (0.55, 0.90),
|
||||
"neutral": (0.35, 0.65),
|
||||
"confused": (0.30, 0.60),
|
||||
"scripted": (0.30, 0.65),
|
||||
"apologetic": (0.20, 0.55),
|
||||
"tired": (0.10, 0.50),
|
||||
"frustrated": (0.10, 0.45),
|
||||
"dismissive": (0.15, 0.50),
|
||||
"condescending": (0.10, 0.45),
|
||||
"urgent": (0.15, 0.55),
|
||||
}
|
||||
|
||||
# Ordinal positivity for reframe direction detection.
|
||||
# Higher = more positive presentation.
|
||||
_AFFECT_POSITIVITY: dict[str, int] = {
|
||||
"urgent": 1,
|
||||
"frustrated": 1,
|
||||
"condescending": 1,
|
||||
"dismissive": 2,
|
||||
"tired": 2,
|
||||
"apologetic": 3,
|
||||
"confused": 3,
|
||||
"scripted": 4,
|
||||
"neutral": 4,
|
||||
"optimistic": 5,
|
||||
"genuine": 5,
|
||||
"warm": 6,
|
||||
}
|
||||
|
||||
|
||||
@dataclass
|
||||
class TrajectorySignal:
|
||||
"""
|
||||
Rolling trend across recent dimensional frames for one speaker.
|
||||
|
||||
All delta values: current_frame_value - mean(buffer_values).
|
||||
Positive arousal_delta = current frame is more activated than baseline.
|
||||
Negative valence_delta = current frame is more negative than baseline.
|
||||
|
||||
trend values:
|
||||
"calibrating" not enough frames yet (< BASELINE_MIN)
|
||||
"stable" no significant directional movement
|
||||
"escalating" arousal rising: current > mean by DELTA_THRESHOLD, consecutive
|
||||
"de-escalating" arousal falling after elevated period
|
||||
"worsening" valence falling: current < mean, consecutive
|
||||
"improving" valence rising after depressed period
|
||||
"suppressed" SER affect is calm/neutral, VAD arousal is elevated
|
||||
"""
|
||||
arousal_delta: float
|
||||
valence_delta: float
|
||||
dominance_delta: float
|
||||
arousal_trend: str # "rising" | "falling" | "flat"
|
||||
valence_trend: str # "rising" | "falling" | "flat"
|
||||
trend: str
|
||||
frames_in_buffer: int
|
||||
baseline_established: bool
|
||||
|
||||
|
||||
@dataclass
|
||||
class CoherenceSignal:
|
||||
"""
|
||||
Cross-signal comparison: SER categorical affect vs. VAD dimensional valence.
|
||||
|
||||
coherence_score:
|
||||
1.0 = SER label and VAD valence are fully consistent.
|
||||
0.0 = maximum disagreement.
|
||||
|
||||
suppression_flag:
|
||||
True when the speaker is presenting as calm/neutral (SER) but VAD arousal
|
||||
is elevated. Indicates controlled presentation with activation underneath.
|
||||
This is relative to a per-session threshold — not a universal claim.
|
||||
|
||||
reframe_type:
|
||||
"none" no SER category shift this window
|
||||
"genuine" SER shifted toward more positive AND dimensional valence also
|
||||
improved (>= DELTA_THRESHOLD in this window)
|
||||
"surface" SER shifted toward more positive BUT dimensional valence
|
||||
continued its prior trajectory unchanged or worsening
|
||||
|
||||
affect_divergence:
|
||||
Signed: VAD-implied valence minus SER-implied valence midpoint.
|
||||
Negative = VAD more negative than SER label implies (masking candidate).
|
||||
Positive = VAD more positive than SER label implies (unusual).
|
||||
"""
|
||||
coherence_score: float
|
||||
suppression_flag: bool
|
||||
reframe_type: str # "none" | "genuine" | "surface"
|
||||
affect_divergence: float
|
||||
|
||||
|
||||
# ── Public helpers ─────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def affect_coherence(affect: str, valence: float) -> float:
|
||||
"""
|
||||
Compute coherence between a SER affect category and a VAD valence score.
|
||||
|
||||
Returns 1.0 when valence falls inside the expected range for the affect.
|
||||
Returns 0.0 when the gap between valence and the nearest range boundary
|
||||
exceeds 0.40 (the full range of a typical incoherence gap).
|
||||
"""
|
||||
lo, hi = _AFFECT_VALENCE_PRIOR.get(affect, (0.30, 0.70))
|
||||
if lo <= valence <= hi:
|
||||
return 1.0
|
||||
gap = min(abs(valence - lo), abs(valence - hi))
|
||||
return round(max(0.0, 1.0 - (gap / 0.40)), 3)
|
||||
|
||||
|
||||
def affect_divergence_score(affect: str, valence: float) -> float:
|
||||
"""
|
||||
Signed divergence: actual VAD valence minus the midpoint of the expected range.
|
||||
|
||||
Negative = VAD more negative than SER label implies.
|
||||
Positive = VAD more positive than SER label implies.
|
||||
"""
|
||||
lo, hi = _AFFECT_VALENCE_PRIOR.get(affect, (0.30, 0.70))
|
||||
midpoint = (lo + hi) / 2.0
|
||||
return round(valence - midpoint, 3)
|
||||
|
||||
|
||||
def compute_trajectory(
|
||||
buffer: deque,
|
||||
current: DimensionalResult,
|
||||
ser_affect: str,
|
||||
prior_ser_affect: str | None,
|
||||
) -> tuple[TrajectorySignal, CoherenceSignal]:
|
||||
"""
|
||||
Compute trajectory and coherence signals for one speaker at one window.
|
||||
|
||||
buffer Rolling deque of prior DimensionalResult for this speaker.
|
||||
Must be updated AFTER this call (append current to buffer).
|
||||
current DimensionalResult for the window being classified.
|
||||
ser_affect SER affect label for this window (from ToneClassifier).
|
||||
prior_ser_affect SER affect label from the previous window, for reframe detection.
|
||||
Pass None on the first window or when not tracking.
|
||||
|
||||
Returns (TrajectorySignal, CoherenceSignal). Both have baseline_established=False
|
||||
and trend="calibrating" when buffer has fewer than BASELINE_MIN entries.
|
||||
"""
|
||||
n = len(buffer)
|
||||
|
||||
# Coherence can be computed without a buffer
|
||||
coh_score = affect_coherence(ser_affect, current.valence)
|
||||
div_score = affect_divergence_score(ser_affect, current.valence)
|
||||
|
||||
suppression = (
|
||||
ser_affect in _LOW_PRESENTATION_AFFECTS
|
||||
and current.arousal > _SUPPRESSION_AROUSAL_MIN
|
||||
and current.valence < 0.50
|
||||
)
|
||||
|
||||
reframe = "none"
|
||||
if prior_ser_affect and prior_ser_affect != ser_affect:
|
||||
if _is_more_positive(ser_affect, prior_ser_affect):
|
||||
# Valence actually improved in this window vs. single prior frame
|
||||
if n >= 1:
|
||||
prev_valence = list(buffer)[-1].valence
|
||||
dim_improved = (current.valence - prev_valence) >= _DELTA_THRESHOLD
|
||||
else:
|
||||
dim_improved = False
|
||||
reframe = "genuine" if dim_improved else "surface"
|
||||
|
||||
coher = CoherenceSignal(
|
||||
coherence_score=coh_score,
|
||||
suppression_flag=suppression,
|
||||
reframe_type=reframe,
|
||||
affect_divergence=div_score,
|
||||
)
|
||||
|
||||
if n < BASELINE_MIN:
|
||||
traj = TrajectorySignal(
|
||||
arousal_delta=0.0,
|
||||
valence_delta=0.0,
|
||||
dominance_delta=0.0,
|
||||
arousal_trend="flat",
|
||||
valence_trend="flat",
|
||||
trend="calibrating",
|
||||
frames_in_buffer=n,
|
||||
baseline_established=False,
|
||||
)
|
||||
return traj, coher
|
||||
|
||||
mean_arousal = sum(f.arousal for f in buffer) / n
|
||||
mean_valence = sum(f.valence for f in buffer) / n
|
||||
mean_dominance = sum(f.dominance for f in buffer) / n
|
||||
|
||||
a_delta = current.arousal - mean_arousal
|
||||
v_delta = current.valence - mean_valence
|
||||
d_delta = current.dominance - mean_dominance
|
||||
|
||||
a_trend = (
|
||||
"rising" if a_delta > _DELTA_THRESHOLD else
|
||||
"falling" if a_delta < -_DELTA_THRESHOLD else
|
||||
"flat"
|
||||
)
|
||||
v_trend = (
|
||||
"rising" if v_delta > _DELTA_THRESHOLD else
|
||||
"falling" if v_delta < -_DELTA_THRESHOLD else
|
||||
"flat"
|
||||
)
|
||||
|
||||
# Consecutive movement: check whether the most recent buffered frame
|
||||
# was already moving in the same direction as the current frame.
|
||||
buf_list = list(buffer)
|
||||
prev = buf_list[-1]
|
||||
a_consecutive = a_trend == "rising" and (current.arousal - prev.arousal) > 0.03
|
||||
v_consecutive = v_trend == "falling" and (current.valence - prev.valence) < -0.03
|
||||
|
||||
# Composite trend label
|
||||
if suppression:
|
||||
trend = "suppressed"
|
||||
elif a_trend == "rising" and a_consecutive:
|
||||
trend = "escalating"
|
||||
elif a_trend == "falling" and mean_arousal > 0.55:
|
||||
trend = "de-escalating"
|
||||
elif v_trend == "falling" and v_consecutive:
|
||||
trend = "worsening"
|
||||
elif v_trend == "rising" and mean_valence < 0.45:
|
||||
trend = "improving"
|
||||
else:
|
||||
trend = "stable"
|
||||
|
||||
traj = TrajectorySignal(
|
||||
arousal_delta=round(a_delta, 3),
|
||||
valence_delta=round(v_delta, 3),
|
||||
dominance_delta=round(d_delta, 3),
|
||||
arousal_trend=a_trend,
|
||||
valence_trend=v_trend,
|
||||
trend=trend,
|
||||
frames_in_buffer=n,
|
||||
baseline_established=True,
|
||||
)
|
||||
return traj, coher
|
||||
|
||||
|
||||
# ── Internal helpers ───────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def _is_more_positive(current: str, prior: str) -> bool:
|
||||
"""True when the current SER affect is ranked more positive than prior."""
|
||||
return _AFFECT_POSITIVITY.get(current, 4) > _AFFECT_POSITIVITY.get(prior, 4)
|
||||
|
|
@ -11,6 +11,8 @@ requires-python = ">=3.11"
|
|||
license = {text = "MIT"}
|
||||
dependencies = [
|
||||
"pydantic>=2.0",
|
||||
"fastapi>=0.111",
|
||||
"uvicorn[standard]>=0.29",
|
||||
]
|
||||
|
||||
[project.optional-dependencies]
|
||||
|
|
@ -26,6 +28,14 @@ inference = [
|
|||
"pyannote.audio>=3.1",
|
||||
"python-dotenv>=1.0",
|
||||
]
|
||||
signalwire = [
|
||||
"signalwire>=2.0",
|
||||
]
|
||||
freeswitch = [
|
||||
# ESL Python bindings are compiled from FreeSWITCH source.
|
||||
# See: https://developer.signalwire.com/freeswitch/FreeSWITCH-Explained/Client-and-Developer-Interfaces/Event-Socket-Library/
|
||||
"python-ESL",
|
||||
]
|
||||
dev = [
|
||||
"pytest>=8.0",
|
||||
"pytest-asyncio>=0.23",
|
||||
|
|
|
|||
69
scripts/test_classify_e2e.py
Normal file
69
scripts/test_classify_e2e.py
Normal file
|
|
@ -0,0 +1,69 @@
|
|||
"""
|
||||
End-to-end integration test for the cf-voice /classify endpoint.
|
||||
|
||||
Extracts a 2-second window from a local media file, base64-encodes the
|
||||
raw PCM, and POSTs it to the running cf-voice service at localhost:8009.
|
||||
Prints each returned AudioEvent for quick inspection.
|
||||
|
||||
Requires:
|
||||
- cf-voice running at localhost:8009 (CF_VOICE_DIARIZE=1 for speaker labels)
|
||||
- ffmpeg on PATH
|
||||
- A local audio/video file (edit MEDIA_FILE below)
|
||||
|
||||
Run:
|
||||
python scripts/test_classify_e2e.py
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import base64
|
||||
import json
|
||||
import subprocess
|
||||
import urllib.request
|
||||
|
||||
import numpy as np
|
||||
|
||||
MEDIA_FILE = "/Library/Series/Hogan's Heroes/Season 3/Hogan's Heroes - S03E19 - Hogan, Go Home.mkv"
|
||||
START_S = 120
|
||||
DURATION_S = 2
|
||||
SAMPLE_RATE = 16_000
|
||||
CF_VOICE_URL = "http://localhost:8009"
|
||||
|
||||
proc = subprocess.run(
|
||||
[
|
||||
"ffmpeg", "-i", MEDIA_FILE,
|
||||
"-ss", str(START_S),
|
||||
"-t", str(DURATION_S),
|
||||
"-ar", str(SAMPLE_RATE),
|
||||
"-ac", "1",
|
||||
"-f", "s16le",
|
||||
"-",
|
||||
],
|
||||
capture_output=True,
|
||||
check=True,
|
||||
)
|
||||
|
||||
pcm = proc.stdout
|
||||
audio = np.frombuffer(pcm, dtype=np.int16)
|
||||
print(f"audio samples: {len(audio)}, duration: {len(audio) / SAMPLE_RATE:.2f}s")
|
||||
|
||||
payload = json.dumps({
|
||||
"audio_chunk": base64.b64encode(pcm).decode(),
|
||||
"timestamp": float(START_S),
|
||||
"session_id": "test",
|
||||
}).encode()
|
||||
|
||||
req = urllib.request.Request(
|
||||
f"{CF_VOICE_URL}/classify",
|
||||
data=payload,
|
||||
headers={"Content-Type": "application/json"},
|
||||
method="POST",
|
||||
)
|
||||
with urllib.request.urlopen(req, timeout=30) as resp:
|
||||
result = json.loads(resp.read())
|
||||
|
||||
for ev in result["events"]:
|
||||
print(
|
||||
f" {ev['event_type']:10}"
|
||||
f" speaker_id={ev.get('speaker_id', 'N/A'):14}"
|
||||
f" label={ev.get('label', '')}"
|
||||
)
|
||||
65
scripts/test_diarize_real.py
Normal file
65
scripts/test_diarize_real.py
Normal file
|
|
@ -0,0 +1,65 @@
|
|||
"""
|
||||
Manual integration test for speaker diarization via pyannote.
|
||||
|
||||
Requires:
|
||||
- HF_TOKEN env var (or set below)
|
||||
- CF_VOICE_DIARIZE=1
|
||||
- ffmpeg on PATH
|
||||
- A local audio/video file (edit MEDIA_FILE below)
|
||||
- pip install cf-voice[inference]
|
||||
|
||||
Run:
|
||||
HF_TOKEN=hf_... CF_VOICE_DIARIZE=1 python scripts/test_diarize_real.py
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import asyncio
|
||||
import os
|
||||
import subprocess
|
||||
|
||||
import numpy as np
|
||||
|
||||
# Override if not in env
|
||||
if not os.environ.get("HF_TOKEN"):
|
||||
raise SystemExit("Set HF_TOKEN in env before running this script.")
|
||||
os.environ.setdefault("CF_VOICE_DIARIZE", "1")
|
||||
|
||||
MEDIA_FILE = "/Library/Series/Hogan's Heroes/Season 3/Hogan's Heroes - S03E19 - Hogan, Go Home.mkv"
|
||||
START_S = 120
|
||||
DURATION_S = 2
|
||||
SAMPLE_RATE = 16_000
|
||||
|
||||
from cf_voice.diarize import Diarizer, SpeakerTracker # noqa: E402
|
||||
|
||||
|
||||
async def main() -> None:
|
||||
d = Diarizer.from_env()
|
||||
tracker = SpeakerTracker()
|
||||
|
||||
proc = subprocess.run(
|
||||
[
|
||||
"ffmpeg", "-i", MEDIA_FILE,
|
||||
"-ss", str(START_S),
|
||||
"-t", str(DURATION_S),
|
||||
"-ar", str(SAMPLE_RATE),
|
||||
"-ac", "1",
|
||||
"-f", "s16le",
|
||||
"-",
|
||||
],
|
||||
capture_output=True,
|
||||
check=True,
|
||||
)
|
||||
audio = np.frombuffer(proc.stdout, dtype=np.int16).astype(np.float32) / 32768.0
|
||||
rms = float(np.sqrt(np.mean(audio**2)))
|
||||
print(f"audio: {len(audio)} samples, {len(audio) / SAMPLE_RATE:.2f}s, rms={rms:.4f}")
|
||||
|
||||
segs = await d.diarize_async(audio)
|
||||
print(f"segments ({len(segs)}): {segs}")
|
||||
|
||||
mid = len(audio) / 2.0 / SAMPLE_RATE
|
||||
label = d.speaker_at(segs, mid, tracker)
|
||||
print(f"speaker_at({mid:.2f}s): {label}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
119
tests/test_acoustic.py
Normal file
119
tests/test_acoustic.py
Normal file
|
|
@ -0,0 +1,119 @@
|
|||
import pytest
|
||||
from cf_voice.acoustic import (
|
||||
AcousticBackend,
|
||||
AcousticResult,
|
||||
ASTAcousticBackend,
|
||||
MockAcousticBackend,
|
||||
make_acoustic,
|
||||
)
|
||||
from cf_voice.events import AudioEvent
|
||||
|
||||
|
||||
class TestAcousticResult:
|
||||
def test_fields(self):
|
||||
evt = AudioEvent(timestamp=1.0, event_type="queue", label="ringback", confidence=0.9)
|
||||
result = AcousticResult(queue=evt, speaker=None, environ=None, scene=None, timestamp=1.0)
|
||||
assert result.queue.label == "ringback"
|
||||
assert result.speaker is None
|
||||
assert result.environ is None
|
||||
assert result.scene is None
|
||||
|
||||
|
||||
class TestMockAcousticBackend:
|
||||
def test_classify_returns_result(self):
|
||||
backend = MockAcousticBackend(seed=0)
|
||||
result = backend.classify_window(b"", timestamp=0.0)
|
||||
assert isinstance(result, AcousticResult)
|
||||
assert result.timestamp == 0.0
|
||||
|
||||
def test_all_events_present(self):
|
||||
backend = MockAcousticBackend(seed=1)
|
||||
result = backend.classify_window(b"", timestamp=1.0)
|
||||
assert result.queue is not None
|
||||
assert result.speaker is not None
|
||||
assert result.environ is not None
|
||||
assert result.scene is not None
|
||||
|
||||
def test_event_types_correct(self):
|
||||
backend = MockAcousticBackend(seed=2)
|
||||
result = backend.classify_window(b"", timestamp=2.0)
|
||||
assert result.queue.event_type == "queue"
|
||||
assert result.speaker.event_type == "speaker"
|
||||
assert result.environ.event_type == "environ"
|
||||
assert result.scene.event_type == "scene"
|
||||
|
||||
def test_confidence_in_range(self):
|
||||
backend = MockAcousticBackend(seed=3)
|
||||
for _ in range(5):
|
||||
result = backend.classify_window(b"", timestamp=0.0)
|
||||
assert 0.0 <= result.queue.confidence <= 1.0
|
||||
assert 0.0 <= result.speaker.confidence <= 1.0
|
||||
assert 0.0 <= result.environ.confidence <= 1.0
|
||||
assert 0.0 <= result.scene.confidence <= 1.0
|
||||
|
||||
def test_lifecycle_advances(self):
|
||||
"""Phases should change after their duration elapses."""
|
||||
import time
|
||||
backend = MockAcousticBackend(seed=42)
|
||||
# Force phase to advance by manipulating phase_start
|
||||
backend._phase_start -= 1000 # pretend 1000s elapsed
|
||||
result = backend.classify_window(b"", timestamp=0.0)
|
||||
# Should have advanced — just verify it doesn't crash and returns valid
|
||||
assert result.queue.label in (
|
||||
"hold_music", "silence", "ringback", "busy", "dead_air", "dtmf_tone"
|
||||
)
|
||||
|
||||
def test_isinstance_protocol(self):
|
||||
backend = MockAcousticBackend()
|
||||
assert isinstance(backend, AcousticBackend)
|
||||
|
||||
def test_deterministic_with_seed(self):
|
||||
b1 = MockAcousticBackend(seed=99)
|
||||
b2 = MockAcousticBackend(seed=99)
|
||||
r1 = b1.classify_window(b"", timestamp=0.0)
|
||||
r2 = b2.classify_window(b"", timestamp=0.0)
|
||||
assert r1.queue.label == r2.queue.label
|
||||
assert r1.queue.confidence == r2.queue.confidence
|
||||
|
||||
|
||||
class TestASTAcousticBackend:
|
||||
def test_raises_import_error_without_deps(self, monkeypatch):
|
||||
"""ASTAcousticBackend should raise ImportError when transformers is unavailable."""
|
||||
import builtins
|
||||
real_import = builtins.__import__
|
||||
|
||||
def mock_import(name, *args, **kwargs):
|
||||
if name in ("transformers",):
|
||||
raise ImportError(f"Mocked: {name} not available")
|
||||
return real_import(name, *args, **kwargs)
|
||||
|
||||
monkeypatch.setattr(builtins, "__import__", mock_import)
|
||||
with pytest.raises(ImportError, match="transformers"):
|
||||
ASTAcousticBackend()
|
||||
|
||||
|
||||
class TestMakeAcoustic:
|
||||
def test_mock_flag(self):
|
||||
backend = make_acoustic(mock=True)
|
||||
assert isinstance(backend, MockAcousticBackend)
|
||||
|
||||
def test_mock_env(self, monkeypatch):
|
||||
monkeypatch.setenv("CF_VOICE_MOCK", "1")
|
||||
backend = make_acoustic()
|
||||
assert isinstance(backend, MockAcousticBackend)
|
||||
|
||||
def test_real_falls_back_to_mock_without_deps(self, monkeypatch, capsys):
|
||||
"""make_acoustic(mock=False) falls back to mock when deps are missing."""
|
||||
import builtins
|
||||
real_import = builtins.__import__
|
||||
|
||||
def mock_import(name, *args, **kwargs):
|
||||
if name in ("transformers",):
|
||||
raise ImportError(f"Mocked: {name} not available")
|
||||
return real_import(name, *args, **kwargs)
|
||||
|
||||
monkeypatch.delenv("CF_VOICE_MOCK", raising=False)
|
||||
monkeypatch.setattr(builtins, "__import__", mock_import)
|
||||
backend = make_acoustic(mock=False)
|
||||
# Should fall back gracefully, never raise
|
||||
assert isinstance(backend, MockAcousticBackend)
|
||||
131
tests/test_diarize.py
Normal file
131
tests/test_diarize.py
Normal file
|
|
@ -0,0 +1,131 @@
|
|||
# tests/test_diarize.py — SpeakerTracker and speaker_at() diarization logic
|
||||
#
|
||||
# All tests are pure Python — no GPU, no pyannote, no HF_TOKEN required.
|
||||
# The Diarizer class itself is only tested for its from_env() guard and the
|
||||
# speaker_at() method, both of which run without loading the model.
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import pytest
|
||||
|
||||
from cf_voice.diarize import (
|
||||
Diarizer,
|
||||
SpeakerSegment,
|
||||
SpeakerTracker,
|
||||
SPEAKER_MULTIPLE,
|
||||
SPEAKER_UNKNOWN,
|
||||
)
|
||||
|
||||
|
||||
# ── SpeakerTracker ────────────────────────────────────────────────────────────
|
||||
|
||||
def test_tracker_first_speaker_is_a():
|
||||
t = SpeakerTracker()
|
||||
assert t.label("SPEAKER_00") == "Speaker A"
|
||||
|
||||
|
||||
def test_tracker_second_speaker_is_b():
|
||||
t = SpeakerTracker()
|
||||
t.label("SPEAKER_00")
|
||||
assert t.label("SPEAKER_01") == "Speaker B"
|
||||
|
||||
|
||||
def test_tracker_same_id_returns_same_label():
|
||||
t = SpeakerTracker()
|
||||
first = t.label("SPEAKER_00")
|
||||
second = t.label("SPEAKER_00")
|
||||
assert first == second == "Speaker A"
|
||||
|
||||
|
||||
def test_tracker_26_speakers():
|
||||
t = SpeakerTracker()
|
||||
labels = [t.label(f"SPEAKER_{i:02d}") for i in range(26)]
|
||||
assert labels[0] == "Speaker A"
|
||||
assert labels[25] == "Speaker Z"
|
||||
|
||||
|
||||
def test_tracker_27th_speaker_wraps():
|
||||
t = SpeakerTracker()
|
||||
for i in range(26):
|
||||
t.label(f"SPEAKER_{i:02d}")
|
||||
label_27 = t.label("SPEAKER_26")
|
||||
assert label_27 == "Speaker AA"
|
||||
|
||||
|
||||
def test_tracker_reset_clears_map():
|
||||
t = SpeakerTracker()
|
||||
t.label("SPEAKER_00")
|
||||
t.label("SPEAKER_01")
|
||||
t.reset()
|
||||
# After reset, SPEAKER_01 is seen as new and maps to "Speaker A" again
|
||||
assert t.label("SPEAKER_01") == "Speaker A"
|
||||
|
||||
|
||||
# ── Diarizer.speaker_at() ─────────────────────────────────────────────────────
|
||||
|
||||
def _segs(*items: tuple[str, float, float]) -> list[SpeakerSegment]:
|
||||
return [SpeakerSegment(speaker_id=s, start_s=st, end_s=en) for s, st, en in items]
|
||||
|
||||
|
||||
def test_speaker_at_single_speaker():
|
||||
d = object.__new__(Diarizer) # bypass __init__ (no GPU needed)
|
||||
segs = _segs(("SPEAKER_00", 0.0, 2.0))
|
||||
t = SpeakerTracker()
|
||||
assert d.speaker_at(segs, 1.0, tracker=t) == "Speaker A"
|
||||
|
||||
|
||||
def test_speaker_at_no_coverage_returns_unknown():
|
||||
d = object.__new__(Diarizer)
|
||||
segs = _segs(("SPEAKER_00", 0.0, 1.0))
|
||||
assert d.speaker_at(segs, 1.5) == SPEAKER_UNKNOWN
|
||||
|
||||
|
||||
def test_speaker_at_empty_segments_returns_unknown():
|
||||
d = object.__new__(Diarizer)
|
||||
assert d.speaker_at([], 1.0) == SPEAKER_UNKNOWN
|
||||
|
||||
|
||||
def test_speaker_at_overlap_returns_multiple():
|
||||
d = object.__new__(Diarizer)
|
||||
segs = _segs(
|
||||
("SPEAKER_00", 0.0, 2.0),
|
||||
("SPEAKER_01", 0.5, 2.0), # overlaps SPEAKER_00 from 0.5s
|
||||
)
|
||||
assert d.speaker_at(segs, 1.0) == SPEAKER_MULTIPLE
|
||||
|
||||
|
||||
def test_speaker_at_boundary_inclusive():
|
||||
d = object.__new__(Diarizer)
|
||||
segs = _segs(("SPEAKER_00", 1.0, 2.0))
|
||||
t = SpeakerTracker()
|
||||
# Exact boundary timestamps are included
|
||||
assert d.speaker_at(segs, 1.0, tracker=t) == "Speaker A"
|
||||
assert d.speaker_at(segs, 2.0, tracker=t) == "Speaker A"
|
||||
|
||||
|
||||
def test_speaker_at_without_tracker_returns_raw_id():
|
||||
d = object.__new__(Diarizer)
|
||||
segs = _segs(("SPEAKER_00", 0.0, 2.0))
|
||||
assert d.speaker_at(segs, 1.0) == "SPEAKER_00"
|
||||
|
||||
|
||||
def test_speaker_at_two_speakers_no_overlap():
|
||||
d = object.__new__(Diarizer)
|
||||
t = SpeakerTracker()
|
||||
segs = _segs(
|
||||
("SPEAKER_00", 0.0, 1.0),
|
||||
("SPEAKER_01", 1.5, 2.5),
|
||||
)
|
||||
assert d.speaker_at(segs, 0.5, tracker=t) == "Speaker A"
|
||||
assert d.speaker_at(segs, 2.0, tracker=t) == "Speaker B"
|
||||
# Gap at 1.2s: window [0.7, 1.7] → SPEAKER_00 has 0.3s, SPEAKER_01 has 0.2s
|
||||
# Dominant speaker (SPEAKER_00 = "Speaker A") is returned, not SPEAKER_UNKNOWN.
|
||||
assert d.speaker_at(segs, 1.2, tracker=t) == "Speaker A"
|
||||
|
||||
|
||||
# ── Diarizer.from_env() guard ─────────────────────────────────────────────────
|
||||
|
||||
def test_from_env_raises_without_hf_token(monkeypatch):
|
||||
monkeypatch.delenv("HF_TOKEN", raising=False)
|
||||
with pytest.raises(EnvironmentError, match="HF_TOKEN"):
|
||||
Diarizer.from_env()
|
||||
|
|
@ -75,10 +75,60 @@ class TestMockVoiceIO:
|
|||
io = make_io()
|
||||
assert isinstance(io, MockVoiceIO)
|
||||
|
||||
def test_make_io_real_raises(self, monkeypatch):
|
||||
def test_make_io_real_returns_mic_io(self, monkeypatch):
|
||||
"""make_io(mock=False) returns MicVoiceIO when sounddevice/numpy are installed."""
|
||||
from cf_voice.capture import MicVoiceIO
|
||||
monkeypatch.delenv("CF_VOICE_MOCK", raising=False)
|
||||
with pytest.raises(NotImplementedError):
|
||||
make_io(mock=False)
|
||||
io = make_io(mock=False)
|
||||
assert isinstance(io, MicVoiceIO)
|
||||
|
||||
|
||||
class TestContextClassifierChunk:
|
||||
"""Tests for classify_chunk() — multi-class event output."""
|
||||
|
||||
def test_mock_returns_four_event_types(self):
|
||||
classifier = ContextClassifier.mock(interval_s=0.05, seed=10)
|
||||
events = classifier.classify_chunk(timestamp=1.0)
|
||||
types = {e.event_type for e in events}
|
||||
# In mock mode all four event types should be present
|
||||
assert "tone" in types
|
||||
assert "queue" in types
|
||||
assert "speaker" in types
|
||||
assert "environ" in types
|
||||
|
||||
def test_mock_tone_event_has_subtext(self):
|
||||
classifier = ContextClassifier.mock(interval_s=0.05, seed=11)
|
||||
events = classifier.classify_chunk(timestamp=0.0)
|
||||
tone_events = [e for e in events if e.event_type == "tone"]
|
||||
assert len(tone_events) == 1
|
||||
assert tone_events[0].subtext is not None
|
||||
|
||||
def test_elcor_override_flag(self):
|
||||
classifier = ContextClassifier.mock(interval_s=0.05, seed=12)
|
||||
events_generic = classifier.classify_chunk(timestamp=0.0, elcor=False)
|
||||
events_elcor = classifier.classify_chunk(timestamp=0.0, elcor=True)
|
||||
|
||||
def subtext(evs):
|
||||
return next(e.subtext for e in evs if e.event_type == "tone")
|
||||
|
||||
generic_sub = subtext(events_generic)
|
||||
elcor_sub = subtext(events_elcor)
|
||||
# Generic format: "Tone: X". Elcor format: "With X:" or "Warmly:" etc.
|
||||
assert generic_sub.startswith("Tone:") or not generic_sub.endswith(":")
|
||||
# Elcor format ends with ":"
|
||||
assert elcor_sub.endswith(":")
|
||||
|
||||
def test_session_id_propagates(self):
|
||||
classifier = ContextClassifier.mock(interval_s=0.05, seed=13)
|
||||
events = classifier.classify_chunk(timestamp=0.0, session_id="ses_test")
|
||||
tone_events = [e for e in events if e.event_type == "tone"]
|
||||
assert tone_events[0].session_id == "ses_test"
|
||||
|
||||
def test_prior_frames_zero_means_no_shift(self):
|
||||
classifier = ContextClassifier.mock(interval_s=0.05, seed=14)
|
||||
events = classifier.classify_chunk(timestamp=0.0, prior_frames=0)
|
||||
tone_events = [e for e in events if e.event_type == "tone"]
|
||||
assert tone_events[0].shift_magnitude == 0.0
|
||||
|
||||
|
||||
class TestContextClassifier:
|
||||
|
|
|
|||
109
tests/test_prefs.py
Normal file
109
tests/test_prefs.py
Normal file
|
|
@ -0,0 +1,109 @@
|
|||
import os
|
||||
import pytest
|
||||
from cf_voice.prefs import (
|
||||
PREF_CONFIDENCE_THRESHOLD,
|
||||
PREF_ELCOR_MODE,
|
||||
PREF_ELCOR_PRIOR_FRAMES,
|
||||
PREF_WHISPER_MODEL,
|
||||
get_confidence_threshold,
|
||||
get_elcor_prior_frames,
|
||||
get_voice_pref,
|
||||
get_whisper_model,
|
||||
is_elcor_enabled,
|
||||
set_voice_pref,
|
||||
)
|
||||
|
||||
|
||||
class _DictStore:
|
||||
"""In-memory preference store for testing."""
|
||||
|
||||
def __init__(self, data: dict | None = None) -> None:
|
||||
self._data: dict = data or {}
|
||||
|
||||
def get(self, user_id, path, default=None):
|
||||
return self._data.get(path, default)
|
||||
|
||||
def set(self, user_id, path, value):
|
||||
self._data[path] = value
|
||||
|
||||
|
||||
class TestGetVoicePref:
|
||||
def test_returns_default_when_nothing_set(self, monkeypatch):
|
||||
monkeypatch.delenv("CF_VOICE_ELCOR", raising=False)
|
||||
val = get_voice_pref(PREF_ELCOR_MODE, store=_DictStore())
|
||||
assert val is False
|
||||
|
||||
def test_explicit_store_takes_priority(self):
|
||||
store = _DictStore({PREF_ELCOR_MODE: True})
|
||||
assert get_voice_pref(PREF_ELCOR_MODE, store=store) is True
|
||||
|
||||
def test_env_fallback_bool(self, monkeypatch):
|
||||
monkeypatch.setenv("CF_VOICE_ELCOR", "1")
|
||||
assert get_voice_pref(PREF_ELCOR_MODE, store=_DictStore()) is True
|
||||
|
||||
def test_env_fallback_false(self, monkeypatch):
|
||||
monkeypatch.setenv("CF_VOICE_ELCOR", "0")
|
||||
assert get_voice_pref(PREF_ELCOR_MODE, store=_DictStore()) is False
|
||||
|
||||
def test_env_fallback_float(self, monkeypatch):
|
||||
monkeypatch.setenv("CF_VOICE_CONFIDENCE_THRESHOLD", "0.7")
|
||||
val = get_voice_pref(PREF_CONFIDENCE_THRESHOLD, store=_DictStore())
|
||||
assert abs(val - 0.7) < 1e-9
|
||||
|
||||
def test_env_fallback_int(self, monkeypatch):
|
||||
monkeypatch.setenv("CF_VOICE_ELCOR_PRIOR_FRAMES", "6")
|
||||
val = get_voice_pref(PREF_ELCOR_PRIOR_FRAMES, store=_DictStore())
|
||||
assert val == 6
|
||||
|
||||
def test_env_fallback_str(self, monkeypatch):
|
||||
monkeypatch.setenv("CF_VOICE_WHISPER_MODEL", "medium")
|
||||
val = get_voice_pref(PREF_WHISPER_MODEL, store=_DictStore())
|
||||
assert val == "medium"
|
||||
|
||||
def test_store_beats_env(self, monkeypatch):
|
||||
monkeypatch.setenv("CF_VOICE_ELCOR", "1")
|
||||
store = _DictStore({PREF_ELCOR_MODE: False})
|
||||
# store has explicit False — but store.get returns None for falsy values
|
||||
# only if the key is absent; here key IS set so store wins
|
||||
store._data[PREF_ELCOR_MODE] = True
|
||||
assert get_voice_pref(PREF_ELCOR_MODE, store=store) is True
|
||||
|
||||
def test_unknown_key_returns_none(self):
|
||||
val = get_voice_pref("voice.nonexistent", store=_DictStore())
|
||||
assert val is None
|
||||
|
||||
|
||||
class TestSetVoicePref:
|
||||
def test_sets_in_store(self):
|
||||
store = _DictStore()
|
||||
set_voice_pref(PREF_ELCOR_MODE, True, store=store)
|
||||
assert store._data[PREF_ELCOR_MODE] is True
|
||||
|
||||
def test_no_store_raises(self, monkeypatch):
|
||||
# Patch _cf_core_store to return None (simulates no cf-core installed)
|
||||
import cf_voice.prefs as prefs_mod
|
||||
monkeypatch.setattr(prefs_mod, "_cf_core_store", lambda: None)
|
||||
with pytest.raises(RuntimeError, match="No writable preference store"):
|
||||
set_voice_pref(PREF_ELCOR_MODE, True)
|
||||
|
||||
|
||||
class TestConvenienceHelpers:
|
||||
def test_is_elcor_enabled_false_default(self, monkeypatch):
|
||||
monkeypatch.delenv("CF_VOICE_ELCOR", raising=False)
|
||||
assert is_elcor_enabled(store=_DictStore()) is False
|
||||
|
||||
def test_is_elcor_enabled_true_from_store(self):
|
||||
store = _DictStore({PREF_ELCOR_MODE: True})
|
||||
assert is_elcor_enabled(store=store) is True
|
||||
|
||||
def test_get_confidence_threshold_default(self, monkeypatch):
|
||||
monkeypatch.delenv("CF_VOICE_CONFIDENCE_THRESHOLD", raising=False)
|
||||
assert get_confidence_threshold(store=_DictStore()) == pytest.approx(0.55)
|
||||
|
||||
def test_get_whisper_model_default(self, monkeypatch):
|
||||
monkeypatch.delenv("CF_VOICE_WHISPER_MODEL", raising=False)
|
||||
assert get_whisper_model(store=_DictStore()) == "small"
|
||||
|
||||
def test_get_elcor_prior_frames_default(self, monkeypatch):
|
||||
monkeypatch.delenv("CF_VOICE_ELCOR_PRIOR_FRAMES", raising=False)
|
||||
assert get_elcor_prior_frames(store=_DictStore()) == 4
|
||||
141
tests/test_telephony.py
Normal file
141
tests/test_telephony.py
Normal file
|
|
@ -0,0 +1,141 @@
|
|||
import asyncio
|
||||
import pytest
|
||||
from cf_voice.telephony import (
|
||||
CallSession,
|
||||
MockTelephonyBackend,
|
||||
TelephonyBackend,
|
||||
make_telephony,
|
||||
)
|
||||
|
||||
|
||||
class TestCallSession:
|
||||
def test_defaults(self):
|
||||
s = CallSession(call_sid="sid_1", to="+15551234567", from_="+18005550000")
|
||||
assert s.state == "dialing"
|
||||
assert s.amd_result == "unknown"
|
||||
assert s.duration_s == 0.0
|
||||
assert s.error is None
|
||||
|
||||
def test_state_mutation(self):
|
||||
s = CallSession(call_sid="sid_2", to="+1", from_="+2", state="in_progress")
|
||||
s.state = "completed"
|
||||
assert s.state == "completed"
|
||||
|
||||
|
||||
class TestMockTelephonyBackend:
|
||||
@pytest.mark.asyncio
|
||||
async def test_dial_returns_session(self):
|
||||
backend = MockTelephonyBackend()
|
||||
session = await backend.dial("+15551234567", "+18005550000", "https://example.com/wh")
|
||||
assert isinstance(session, CallSession)
|
||||
assert session.call_sid.startswith("mock_sid_")
|
||||
assert session.to == "+15551234567"
|
||||
assert session.from_ == "+18005550000"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_dial_transitions_to_in_progress(self):
|
||||
backend = MockTelephonyBackend()
|
||||
session = await backend.dial("+15551234567", "+18005550000", "https://x.com")
|
||||
# give the background task a moment to transition
|
||||
await asyncio.sleep(0.1)
|
||||
assert session.state == "in_progress"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_amd_resolves_human(self):
|
||||
backend = MockTelephonyBackend(amd_delay_s=0.05)
|
||||
session = await backend.dial("+1555", "+1800", "https://x.com", amd=True)
|
||||
await asyncio.sleep(0.2)
|
||||
assert session.amd_result == "human"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_send_dtmf(self):
|
||||
backend = MockTelephonyBackend()
|
||||
session = await backend.dial("+1", "+2", "https://x.com")
|
||||
# should not raise
|
||||
await backend.send_dtmf(session.call_sid, "1234#")
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_send_dtmf_unknown_sid_raises(self):
|
||||
backend = MockTelephonyBackend()
|
||||
with pytest.raises(KeyError):
|
||||
await backend.send_dtmf("nonexistent_sid", "1")
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_bridge_updates_state(self):
|
||||
backend = MockTelephonyBackend()
|
||||
session = await backend.dial("+1", "+2", "https://x.com")
|
||||
await backend.bridge(session.call_sid, "+15559999999")
|
||||
assert session.state == "bridged"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_hangup_sets_completed(self):
|
||||
backend = MockTelephonyBackend()
|
||||
session = await backend.dial("+1", "+2", "https://x.com")
|
||||
await backend.hangup(session.call_sid)
|
||||
assert session.state == "completed"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_hangup_idempotent(self):
|
||||
backend = MockTelephonyBackend()
|
||||
session = await backend.dial("+1", "+2", "https://x.com")
|
||||
await backend.hangup(session.call_sid)
|
||||
await backend.hangup(session.call_sid)
|
||||
assert session.state == "completed"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_announce_does_not_raise(self):
|
||||
backend = MockTelephonyBackend()
|
||||
session = await backend.dial("+1", "+2", "https://x.com")
|
||||
await backend.announce(session.call_sid, "Hello, this is an automated assistant.")
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_get_state(self):
|
||||
backend = MockTelephonyBackend()
|
||||
session = await backend.dial("+1", "+2", "https://x.com")
|
||||
state = await backend.get_state(session.call_sid)
|
||||
assert state in ("ringing", "in_progress", "dialing")
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_multiple_calls_unique_sids(self):
|
||||
backend = MockTelephonyBackend()
|
||||
s1 = await backend.dial("+1", "+2", "https://x.com")
|
||||
s2 = await backend.dial("+3", "+4", "https://x.com")
|
||||
assert s1.call_sid != s2.call_sid
|
||||
|
||||
def test_isinstance_protocol(self):
|
||||
backend = MockTelephonyBackend()
|
||||
assert isinstance(backend, TelephonyBackend)
|
||||
|
||||
|
||||
class TestMakeTelephony:
|
||||
def test_mock_flag(self):
|
||||
backend = make_telephony(mock=True)
|
||||
assert isinstance(backend, MockTelephonyBackend)
|
||||
|
||||
def test_mock_env(self, monkeypatch):
|
||||
monkeypatch.setenv("CF_VOICE_MOCK", "1")
|
||||
backend = make_telephony()
|
||||
assert isinstance(backend, MockTelephonyBackend)
|
||||
|
||||
def test_no_config_raises(self, monkeypatch):
|
||||
monkeypatch.delenv("CF_VOICE_MOCK", raising=False)
|
||||
monkeypatch.delenv("CF_SW_PROJECT_ID", raising=False)
|
||||
monkeypatch.delenv("CF_ESL_PASSWORD", raising=False)
|
||||
with pytest.raises(RuntimeError, match="No telephony backend configured"):
|
||||
make_telephony()
|
||||
|
||||
def test_signalwire_selected_by_env(self, monkeypatch):
|
||||
monkeypatch.delenv("CF_VOICE_MOCK", raising=False)
|
||||
monkeypatch.setenv("CF_SW_PROJECT_ID", "proj_123")
|
||||
# SignalWireBackend will raise ImportError (signalwire SDK not installed)
|
||||
# but only at instantiation — make_telephony should call the constructor
|
||||
with pytest.raises((ImportError, RuntimeError)):
|
||||
make_telephony()
|
||||
|
||||
def test_freeswitch_selected_by_env(self, monkeypatch):
|
||||
monkeypatch.delenv("CF_VOICE_MOCK", raising=False)
|
||||
monkeypatch.delenv("CF_SW_PROJECT_ID", raising=False)
|
||||
monkeypatch.setenv("CF_ESL_PASSWORD", "s3cret")
|
||||
# FreeSWITCHBackend will raise ImportError (ESL not installed)
|
||||
with pytest.raises((ImportError, RuntimeError)):
|
||||
make_telephony()
|
||||
Loading…
Reference in a new issue