cf-voice/README.md
pyr0ball 24f04b67db feat: full voice pipeline — AST acoustic, accent, privacy, prosody, dimensional, trajectory, telephony, FastAPI app
New modules shipped (from Linnet integration):
- acoustic.py: AST (MIT/ast-finetuned-audioset-10-10-0.4593) replaces YAMNet stub;
  527 AudioSet classes mapped to queue/speaker/environ/scene labels; _LABEL_MAP
  includes hold_music, ringback, DTMF, background_shift, AMD signal chain
- accent.py: facebook/mms-lid-126 language ID → regional accent labels
  (en_gb, en_us, en_au, fr, es, de, zh, …); lazy-loaded, gated by CF_VOICE_ACCENT
- privacy.py: compound privacy risk scorer — public_env, background_voices,
  nature scene, accent signals; returns 0–3 score without storing any audio
- prosody.py: openSMILE-backed prosody extractor (sarcasm_risk, flat_f0_score,
  speech_rate, pitch_range); mock mode returns neutral values
- dimensional.py: audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim
  valence/arousal/dominance scorer; gated by CF_VOICE_DIMENSIONAL
- trajectory.py: rolling buffer for arousal/valence deltas, trend detection
  (escalating/suppressed/stable), coherence scoring, suppression/reframe flags
- telephony.py: TelephonyBackend Protocol + MockTelephonyBackend + SignalWireBackend
  + FreeSWITCHBackend; CallSession dataclass; make_telephony() factory
- app.py: FastAPI service (port 8007) — /health + /classify; accepts base64 PCM
  chunks, returns full AudioEventOut including dimensional/prosody/accent fields
- prefs.py: voice preference helpers (elcor_mode, confidence_threshold,
  whisper_model, elcor_prior_frames); cf-core and env-var fallback

Tests: fix stale tests (YAMNetAcousticBackend → ASTAcousticBackend, scene field
added to AcousticResult, speaker_at gap now resolves dominant speaker not UNKNOWN,
make_io real path returns MicVoiceIO when sounddevice installed). 78 tests passing.

Closes #2, #3.
2026-04-18 22:36:58 -07:00

228 lines
7.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# cf-voice
CircuitForge voice annotation pipeline. Produces `VoiceFrame` objects from a live audio stream — tone label, confidence, speaker identity, and shift magnitude — and exposes `ToneEvent` as the stable SSE wire type for downstream consumers (Linnet, Osprey, Falcon).
**Status:** Notation v0.1.x — real inference pipeline live (faster-whisper STT, wav2vec2 SER, librosa prosody, pyannote diarization). Mock mode available for dev/CI without GPU or mic.
## Install
```bash
# Mock mode only (no GPU required)
pip install -e ../cf-voice
# Real inference (STT + tone classifier + diarization)
pip install -e "../cf-voice[inference]"
```
Copy `.env.example` to `.env` and fill in `HF_TOKEN` for diarization.
## Quick start
```python
from cf_voice.context import ContextClassifier
# Mock mode (no hardware needed)
classifier = ContextClassifier.mock()
async for frame in classifier.stream():
print(frame.label, frame.confidence)
# Real mic capture (requires [inference] extras + CF_VOICE_MOCK unset)
classifier = ContextClassifier.from_env()
async for frame in classifier.stream():
print(frame.label, frame.confidence)
```
CLI smoke-test:
```bash
CF_VOICE_MOCK=1 cf-voice-demo
```
---
## VoiceFrame
Produced by `cf_voice.io` (audio capture layer). MIT licensed.
```python
@dataclass
class VoiceFrame:
label: str # tone descriptor, e.g. "Warmly impatient"
confidence: float # 0.01.0
speaker_id: str # ephemeral local label, e.g. "speaker_a"
shift_magnitude: float # delta from previous frame, 0.01.0
timestamp: float # session-relative seconds
def is_reliable(self, threshold=0.6) -> bool: ...
def is_shift(self, threshold=0.3) -> bool: ...
```
---
## ToneEvent — SSE wire format
`ToneEvent` is the stable SSE wire type emitted by Linnet's annotation stream
and consumed by `<LinnetWidget />` embeds in Osprey, Falcon, and other products.
**Field names are locked as of cf-voice v0.1.0** (cf-core#40).
### JSON shape
```json
{
"event_type": "tone",
"timestamp": 4.82,
"label": "Warmly impatient",
"confidence": 0.79,
"speaker_id": "speaker_a",
"subtext": "Tone: Frustrated",
"affect": "frustrated",
"shift_magnitude": 0.74,
"shift_direction": "more_urgent",
"prosody_flags": ["fast_rate", "rising"],
"session_id": "ses_abc123"
}
```
### Field reference
| Field | Type | Stable | Description |
|---|---|---|---|
| `event_type` | `"tone"` | yes | Always `"tone"` for ToneEvent |
| `timestamp` | `float` | yes | Seconds since session start |
| `label` | `str` | yes | Human-readable tone descriptor ("Warmly impatient") |
| `confidence` | `float` | yes | 0.01.0. Below ~0.55 = speculative |
| `speaker_id` | `str` | yes | Ephemeral diarization label ("speaker_a"). Resets per session |
| `subtext` | `str \| null` | yes | Annotation text. Generic: `"Tone: Frustrated"`. Elcor: `"With barely concealed frustration:"` |
| `affect` | `str` | yes | AFFECT_LABELS key ("frustrated"). See `cf_voice.events.AFFECT_LABELS` |
| `shift_magnitude` | `float` | yes | 0.01.0. High = meaningful register change from previous frame |
| `shift_direction` | `str` | yes | `"warmer"` \| `"colder"` \| `"more_urgent"` \| `"stable"` |
| `prosody_flags` | `str[]` | no | Raw prosody signals ("fast_rate", "rising", "flat_pitch", "low_energy"). Subject to change |
| `session_id` | `str` | yes | Caller-assigned. Correlates events to a conversation session |
### SSE envelope
Linnet emits events in standard SSE format:
```
event: tone-event
data: {"event_type":"tone","timestamp":4.82,...}
```
Host apps subscribing via `<LinnetWidget />` receive `MessageEvent` with `type === "tone-event"`.
### Elcor mode
`subtext` switches format when the session is in Elcor mode (easter egg, unlocked by cumulative session time). Generic is always available; Elcor is opt-in via the session flag:
| Affect | Generic | Elcor |
|---|---|---|
| frustrated | `Tone: Frustrated` | `With barely concealed frustration:` |
| warm | `Tone: Warm` | `Warmly:` |
| scripted | `Tone: Scripted` | `Reading from a script:` |
| dismissive | `Tone: Dismissive` | `With polite dismissiveness:` |
| tired | `Tone: Tired` | `With audible fatigue:` |
---
---
## Telephony
`cf_voice.telephony` provides the outbound call abstraction for Osprey, Harrier, Ibis, and Kestrel.
### Quick start
```python
from cf_voice.telephony import make_telephony
# Mock mode — no real calls placed (CF_VOICE_MOCK=1 or mock=True)
backend = make_telephony(mock=True)
session = await backend.dial(
to="+15551234567",
from_="+18005550000",
webhook_url="https://yourapp.example.com/voice/events",
amd=True, # answering machine detection
)
# Adaptive service identification (osprey#21)
await backend.announce(session.call_sid, "This is an automated assistant.")
# Navigate IVR
await backend.send_dtmf(session.call_sid, "2") # Press 2 for billing
# Bridge to user's phone once human agent answers
await backend.bridge(session.call_sid, "+14155550100")
await backend.hangup(session.call_sid)
```
### Backend selection
`make_telephony()` resolves the backend in this order:
| Condition | Backend |
|---|---|
| `CF_VOICE_MOCK=1` or `mock=True` | `MockTelephonyBackend` (dev/CI) |
| `CF_SW_PROJECT_ID` env set | `SignalWireBackend` (paid tier) |
| `CF_ESL_PASSWORD` env set | `FreeSWITCHBackend` (free tier, self-hosted) |
| none | `RuntimeError` |
### Installing real backends
```bash
# Paid tier — SignalWire managed telephony
pip install cf-voice[signalwire]
# Free tier — self-hosted FreeSWITCH (requires compiled ESL bindings)
pip install cf-voice[freeswitch]
```
Set credentials in `.env` (see `.env.example`).
---
## Mock mode
Set `CF_VOICE_MOCK=1` or pass `mock=True` to `make_io()`. Emits synthetic `VoiceFrame` objects on a timer. No GPU, microphone, or `HF_TOKEN` required. All API surface is identical to real mode.
---
## Module structure
| Module | License | Purpose |
|--------|---------|---------|
| `cf_voice.models` | MIT | `VoiceFrame` dataclass |
| `cf_voice.events` | MIT | `AudioEvent`, `ToneEvent`, wire format types |
| `cf_voice.io` | MIT | `VoiceIO` base, `MockVoiceIO`, `make_io()` factory |
| `cf_voice.telephony` | MIT (Protocol + Mock), BSL (backends) | `TelephonyBackend` Protocol, `MockTelephonyBackend`, `SignalWireBackend`, `FreeSWITCHBackend`, `make_telephony()` |
| `cf_voice.capture` | BSL 1.1 | `MicVoiceIO` — real mic capture, 2s windowing |
| `cf_voice.stt` | BSL 1.1 | `WhisperSTT` — faster-whisper async wrapper |
| `cf_voice.classify` | BSL 1.1 | `ToneClassifier` — wav2vec2 SER + librosa prosody |
| `cf_voice.diarize` | BSL 1.1 | `Diarizer` — pyannote.audio async wrapper |
| `cf_voice.context` | BSL 1.1 | `ContextClassifier` — high-level consumer API |
BSL applies to inference modules. IO + types + wire format = MIT.
---
---
## Attribution
Speaker diarization uses [pyannote.audio](https://github.com/pyannote/pyannote-audio) (MIT) and the following gated HuggingFace models (CC BY 4.0):
- `pyannote/speaker-diarization-3.1` — Hervé Bredin et al.
- `pyannote/segmentation-3.0` — Hervé Bredin et al.
CC BY 4.0 requires attribution in any distributed product. The models are gated: each user must accept the license terms on HuggingFace before their `HF_TOKEN` will authorize a download.
---
## Consumed by
- `Circuit-Forge/linnet` — real-time tone annotation PWA (primary consumer)
- `Circuit-Forge/osprey` — telephony bridge voice context (Navigation v0.2.x)
- `Circuit-Forge/falcon` (planned) — phone form-filling, IVR navigation