New modules shipped (from Linnet integration): - acoustic.py: AST (MIT/ast-finetuned-audioset-10-10-0.4593) replaces YAMNet stub; 527 AudioSet classes mapped to queue/speaker/environ/scene labels; _LABEL_MAP includes hold_music, ringback, DTMF, background_shift, AMD signal chain - accent.py: facebook/mms-lid-126 language ID → regional accent labels (en_gb, en_us, en_au, fr, es, de, zh, …); lazy-loaded, gated by CF_VOICE_ACCENT - privacy.py: compound privacy risk scorer — public_env, background_voices, nature scene, accent signals; returns 0–3 score without storing any audio - prosody.py: openSMILE-backed prosody extractor (sarcasm_risk, flat_f0_score, speech_rate, pitch_range); mock mode returns neutral values - dimensional.py: audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim valence/arousal/dominance scorer; gated by CF_VOICE_DIMENSIONAL - trajectory.py: rolling buffer for arousal/valence deltas, trend detection (escalating/suppressed/stable), coherence scoring, suppression/reframe flags - telephony.py: TelephonyBackend Protocol + MockTelephonyBackend + SignalWireBackend + FreeSWITCHBackend; CallSession dataclass; make_telephony() factory - app.py: FastAPI service (port 8007) — /health + /classify; accepts base64 PCM chunks, returns full AudioEventOut including dimensional/prosody/accent fields - prefs.py: voice preference helpers (elcor_mode, confidence_threshold, whisper_model, elcor_prior_frames); cf-core and env-var fallback Tests: fix stale tests (YAMNetAcousticBackend → ASTAcousticBackend, scene field added to AcousticResult, speaker_at gap now resolves dominant speaker not UNKNOWN, make_io real path returns MicVoiceIO when sounddevice installed). 78 tests passing. Closes #2, #3.
228 lines
7.4 KiB
Markdown
228 lines
7.4 KiB
Markdown
# cf-voice
|
||
|
||
CircuitForge voice annotation pipeline. Produces `VoiceFrame` objects from a live audio stream — tone label, confidence, speaker identity, and shift magnitude — and exposes `ToneEvent` as the stable SSE wire type for downstream consumers (Linnet, Osprey, Falcon).
|
||
|
||
**Status:** Notation v0.1.x — real inference pipeline live (faster-whisper STT, wav2vec2 SER, librosa prosody, pyannote diarization). Mock mode available for dev/CI without GPU or mic.
|
||
|
||
## Install
|
||
|
||
```bash
|
||
# Mock mode only (no GPU required)
|
||
pip install -e ../cf-voice
|
||
|
||
# Real inference (STT + tone classifier + diarization)
|
||
pip install -e "../cf-voice[inference]"
|
||
```
|
||
|
||
Copy `.env.example` to `.env` and fill in `HF_TOKEN` for diarization.
|
||
|
||
## Quick start
|
||
|
||
```python
|
||
from cf_voice.context import ContextClassifier
|
||
|
||
# Mock mode (no hardware needed)
|
||
classifier = ContextClassifier.mock()
|
||
async for frame in classifier.stream():
|
||
print(frame.label, frame.confidence)
|
||
|
||
# Real mic capture (requires [inference] extras + CF_VOICE_MOCK unset)
|
||
classifier = ContextClassifier.from_env()
|
||
async for frame in classifier.stream():
|
||
print(frame.label, frame.confidence)
|
||
```
|
||
|
||
CLI smoke-test:
|
||
|
||
```bash
|
||
CF_VOICE_MOCK=1 cf-voice-demo
|
||
```
|
||
|
||
---
|
||
|
||
## VoiceFrame
|
||
|
||
Produced by `cf_voice.io` (audio capture layer). MIT licensed.
|
||
|
||
```python
|
||
@dataclass
|
||
class VoiceFrame:
|
||
label: str # tone descriptor, e.g. "Warmly impatient"
|
||
confidence: float # 0.0–1.0
|
||
speaker_id: str # ephemeral local label, e.g. "speaker_a"
|
||
shift_magnitude: float # delta from previous frame, 0.0–1.0
|
||
timestamp: float # session-relative seconds
|
||
|
||
def is_reliable(self, threshold=0.6) -> bool: ...
|
||
def is_shift(self, threshold=0.3) -> bool: ...
|
||
```
|
||
|
||
---
|
||
|
||
## ToneEvent — SSE wire format
|
||
|
||
`ToneEvent` is the stable SSE wire type emitted by Linnet's annotation stream
|
||
and consumed by `<LinnetWidget />` embeds in Osprey, Falcon, and other products.
|
||
|
||
**Field names are locked as of cf-voice v0.1.0** (cf-core#40).
|
||
|
||
### JSON shape
|
||
|
||
```json
|
||
{
|
||
"event_type": "tone",
|
||
"timestamp": 4.82,
|
||
"label": "Warmly impatient",
|
||
"confidence": 0.79,
|
||
"speaker_id": "speaker_a",
|
||
"subtext": "Tone: Frustrated",
|
||
"affect": "frustrated",
|
||
"shift_magnitude": 0.74,
|
||
"shift_direction": "more_urgent",
|
||
"prosody_flags": ["fast_rate", "rising"],
|
||
"session_id": "ses_abc123"
|
||
}
|
||
```
|
||
|
||
### Field reference
|
||
|
||
| Field | Type | Stable | Description |
|
||
|---|---|---|---|
|
||
| `event_type` | `"tone"` | yes | Always `"tone"` for ToneEvent |
|
||
| `timestamp` | `float` | yes | Seconds since session start |
|
||
| `label` | `str` | yes | Human-readable tone descriptor ("Warmly impatient") |
|
||
| `confidence` | `float` | yes | 0.0–1.0. Below ~0.55 = speculative |
|
||
| `speaker_id` | `str` | yes | Ephemeral diarization label ("speaker_a"). Resets per session |
|
||
| `subtext` | `str \| null` | yes | Annotation text. Generic: `"Tone: Frustrated"`. Elcor: `"With barely concealed frustration:"` |
|
||
| `affect` | `str` | yes | AFFECT_LABELS key ("frustrated"). See `cf_voice.events.AFFECT_LABELS` |
|
||
| `shift_magnitude` | `float` | yes | 0.0–1.0. High = meaningful register change from previous frame |
|
||
| `shift_direction` | `str` | yes | `"warmer"` \| `"colder"` \| `"more_urgent"` \| `"stable"` |
|
||
| `prosody_flags` | `str[]` | no | Raw prosody signals ("fast_rate", "rising", "flat_pitch", "low_energy"). Subject to change |
|
||
| `session_id` | `str` | yes | Caller-assigned. Correlates events to a conversation session |
|
||
|
||
### SSE envelope
|
||
|
||
Linnet emits events in standard SSE format:
|
||
|
||
```
|
||
event: tone-event
|
||
data: {"event_type":"tone","timestamp":4.82,...}
|
||
|
||
```
|
||
|
||
Host apps subscribing via `<LinnetWidget />` receive `MessageEvent` with `type === "tone-event"`.
|
||
|
||
### Elcor mode
|
||
|
||
`subtext` switches format when the session is in Elcor mode (easter egg, unlocked by cumulative session time). Generic is always available; Elcor is opt-in via the session flag:
|
||
|
||
| Affect | Generic | Elcor |
|
||
|---|---|---|
|
||
| frustrated | `Tone: Frustrated` | `With barely concealed frustration:` |
|
||
| warm | `Tone: Warm` | `Warmly:` |
|
||
| scripted | `Tone: Scripted` | `Reading from a script:` |
|
||
| dismissive | `Tone: Dismissive` | `With polite dismissiveness:` |
|
||
| tired | `Tone: Tired` | `With audible fatigue:` |
|
||
|
||
---
|
||
|
||
---
|
||
|
||
## Telephony
|
||
|
||
`cf_voice.telephony` provides the outbound call abstraction for Osprey, Harrier, Ibis, and Kestrel.
|
||
|
||
### Quick start
|
||
|
||
```python
|
||
from cf_voice.telephony import make_telephony
|
||
|
||
# Mock mode — no real calls placed (CF_VOICE_MOCK=1 or mock=True)
|
||
backend = make_telephony(mock=True)
|
||
|
||
session = await backend.dial(
|
||
to="+15551234567",
|
||
from_="+18005550000",
|
||
webhook_url="https://yourapp.example.com/voice/events",
|
||
amd=True, # answering machine detection
|
||
)
|
||
|
||
# Adaptive service identification (osprey#21)
|
||
await backend.announce(session.call_sid, "This is an automated assistant.")
|
||
|
||
# Navigate IVR
|
||
await backend.send_dtmf(session.call_sid, "2") # Press 2 for billing
|
||
|
||
# Bridge to user's phone once human agent answers
|
||
await backend.bridge(session.call_sid, "+14155550100")
|
||
|
||
await backend.hangup(session.call_sid)
|
||
```
|
||
|
||
### Backend selection
|
||
|
||
`make_telephony()` resolves the backend in this order:
|
||
|
||
| Condition | Backend |
|
||
|---|---|
|
||
| `CF_VOICE_MOCK=1` or `mock=True` | `MockTelephonyBackend` (dev/CI) |
|
||
| `CF_SW_PROJECT_ID` env set | `SignalWireBackend` (paid tier) |
|
||
| `CF_ESL_PASSWORD` env set | `FreeSWITCHBackend` (free tier, self-hosted) |
|
||
| none | `RuntimeError` |
|
||
|
||
### Installing real backends
|
||
|
||
```bash
|
||
# Paid tier — SignalWire managed telephony
|
||
pip install cf-voice[signalwire]
|
||
|
||
# Free tier — self-hosted FreeSWITCH (requires compiled ESL bindings)
|
||
pip install cf-voice[freeswitch]
|
||
```
|
||
|
||
Set credentials in `.env` (see `.env.example`).
|
||
|
||
---
|
||
|
||
## Mock mode
|
||
|
||
Set `CF_VOICE_MOCK=1` or pass `mock=True` to `make_io()`. Emits synthetic `VoiceFrame` objects on a timer. No GPU, microphone, or `HF_TOKEN` required. All API surface is identical to real mode.
|
||
|
||
---
|
||
|
||
## Module structure
|
||
|
||
| Module | License | Purpose |
|
||
|--------|---------|---------|
|
||
| `cf_voice.models` | MIT | `VoiceFrame` dataclass |
|
||
| `cf_voice.events` | MIT | `AudioEvent`, `ToneEvent`, wire format types |
|
||
| `cf_voice.io` | MIT | `VoiceIO` base, `MockVoiceIO`, `make_io()` factory |
|
||
| `cf_voice.telephony` | MIT (Protocol + Mock), BSL (backends) | `TelephonyBackend` Protocol, `MockTelephonyBackend`, `SignalWireBackend`, `FreeSWITCHBackend`, `make_telephony()` |
|
||
| `cf_voice.capture` | BSL 1.1 | `MicVoiceIO` — real mic capture, 2s windowing |
|
||
| `cf_voice.stt` | BSL 1.1 | `WhisperSTT` — faster-whisper async wrapper |
|
||
| `cf_voice.classify` | BSL 1.1 | `ToneClassifier` — wav2vec2 SER + librosa prosody |
|
||
| `cf_voice.diarize` | BSL 1.1 | `Diarizer` — pyannote.audio async wrapper |
|
||
| `cf_voice.context` | BSL 1.1 | `ContextClassifier` — high-level consumer API |
|
||
|
||
BSL applies to inference modules. IO + types + wire format = MIT.
|
||
|
||
---
|
||
|
||
---
|
||
|
||
## Attribution
|
||
|
||
Speaker diarization uses [pyannote.audio](https://github.com/pyannote/pyannote-audio) (MIT) and the following gated HuggingFace models (CC BY 4.0):
|
||
|
||
- `pyannote/speaker-diarization-3.1` — Hervé Bredin et al.
|
||
- `pyannote/segmentation-3.0` — Hervé Bredin et al.
|
||
|
||
CC BY 4.0 requires attribution in any distributed product. The models are gated: each user must accept the license terms on HuggingFace before their `HF_TOKEN` will authorize a download.
|
||
|
||
---
|
||
|
||
## Consumed by
|
||
|
||
- `Circuit-Forge/linnet` — real-time tone annotation PWA (primary consumer)
|
||
- `Circuit-Forge/osprey` — telephony bridge voice context (Navigation v0.2.x)
|
||
- `Circuit-Forge/falcon` (planned) — phone form-filling, IVR navigation
|