New modules shipped (from Linnet integration): - acoustic.py: AST (MIT/ast-finetuned-audioset-10-10-0.4593) replaces YAMNet stub; 527 AudioSet classes mapped to queue/speaker/environ/scene labels; _LABEL_MAP includes hold_music, ringback, DTMF, background_shift, AMD signal chain - accent.py: facebook/mms-lid-126 language ID → regional accent labels (en_gb, en_us, en_au, fr, es, de, zh, …); lazy-loaded, gated by CF_VOICE_ACCENT - privacy.py: compound privacy risk scorer — public_env, background_voices, nature scene, accent signals; returns 0–3 score without storing any audio - prosody.py: openSMILE-backed prosody extractor (sarcasm_risk, flat_f0_score, speech_rate, pitch_range); mock mode returns neutral values - dimensional.py: audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim valence/arousal/dominance scorer; gated by CF_VOICE_DIMENSIONAL - trajectory.py: rolling buffer for arousal/valence deltas, trend detection (escalating/suppressed/stable), coherence scoring, suppression/reframe flags - telephony.py: TelephonyBackend Protocol + MockTelephonyBackend + SignalWireBackend + FreeSWITCHBackend; CallSession dataclass; make_telephony() factory - app.py: FastAPI service (port 8007) — /health + /classify; accepts base64 PCM chunks, returns full AudioEventOut including dimensional/prosody/accent fields - prefs.py: voice preference helpers (elcor_mode, confidence_threshold, whisper_model, elcor_prior_frames); cf-core and env-var fallback Tests: fix stale tests (YAMNetAcousticBackend → ASTAcousticBackend, scene field added to AcousticResult, speaker_at gap now resolves dominant speaker not UNKNOWN, make_io real path returns MicVoiceIO when sounddevice installed). 78 tests passing. Closes #2, #3.
7.4 KiB
cf-voice
CircuitForge voice annotation pipeline. Produces VoiceFrame objects from a live audio stream — tone label, confidence, speaker identity, and shift magnitude — and exposes ToneEvent as the stable SSE wire type for downstream consumers (Linnet, Osprey, Falcon).
Status: Notation v0.1.x — real inference pipeline live (faster-whisper STT, wav2vec2 SER, librosa prosody, pyannote diarization). Mock mode available for dev/CI without GPU or mic.
Install
# Mock mode only (no GPU required)
pip install -e ../cf-voice
# Real inference (STT + tone classifier + diarization)
pip install -e "../cf-voice[inference]"
Copy .env.example to .env and fill in HF_TOKEN for diarization.
Quick start
from cf_voice.context import ContextClassifier
# Mock mode (no hardware needed)
classifier = ContextClassifier.mock()
async for frame in classifier.stream():
print(frame.label, frame.confidence)
# Real mic capture (requires [inference] extras + CF_VOICE_MOCK unset)
classifier = ContextClassifier.from_env()
async for frame in classifier.stream():
print(frame.label, frame.confidence)
CLI smoke-test:
CF_VOICE_MOCK=1 cf-voice-demo
VoiceFrame
Produced by cf_voice.io (audio capture layer). MIT licensed.
@dataclass
class VoiceFrame:
label: str # tone descriptor, e.g. "Warmly impatient"
confidence: float # 0.0–1.0
speaker_id: str # ephemeral local label, e.g. "speaker_a"
shift_magnitude: float # delta from previous frame, 0.0–1.0
timestamp: float # session-relative seconds
def is_reliable(self, threshold=0.6) -> bool: ...
def is_shift(self, threshold=0.3) -> bool: ...
ToneEvent — SSE wire format
ToneEvent is the stable SSE wire type emitted by Linnet's annotation stream
and consumed by <LinnetWidget /> embeds in Osprey, Falcon, and other products.
Field names are locked as of cf-voice v0.1.0 (cf-core#40).
JSON shape
{
"event_type": "tone",
"timestamp": 4.82,
"label": "Warmly impatient",
"confidence": 0.79,
"speaker_id": "speaker_a",
"subtext": "Tone: Frustrated",
"affect": "frustrated",
"shift_magnitude": 0.74,
"shift_direction": "more_urgent",
"prosody_flags": ["fast_rate", "rising"],
"session_id": "ses_abc123"
}
Field reference
| Field | Type | Stable | Description |
|---|---|---|---|
event_type |
"tone" |
yes | Always "tone" for ToneEvent |
timestamp |
float |
yes | Seconds since session start |
label |
str |
yes | Human-readable tone descriptor ("Warmly impatient") |
confidence |
float |
yes | 0.0–1.0. Below ~0.55 = speculative |
speaker_id |
str |
yes | Ephemeral diarization label ("speaker_a"). Resets per session |
subtext |
str | null |
yes | Annotation text. Generic: "Tone: Frustrated". Elcor: "With barely concealed frustration:" |
affect |
str |
yes | AFFECT_LABELS key ("frustrated"). See cf_voice.events.AFFECT_LABELS |
shift_magnitude |
float |
yes | 0.0–1.0. High = meaningful register change from previous frame |
shift_direction |
str |
yes | "warmer" | "colder" | "more_urgent" | "stable" |
prosody_flags |
str[] |
no | Raw prosody signals ("fast_rate", "rising", "flat_pitch", "low_energy"). Subject to change |
session_id |
str |
yes | Caller-assigned. Correlates events to a conversation session |
SSE envelope
Linnet emits events in standard SSE format:
event: tone-event
data: {"event_type":"tone","timestamp":4.82,...}
Host apps subscribing via <LinnetWidget /> receive MessageEvent with type === "tone-event".
Elcor mode
subtext switches format when the session is in Elcor mode (easter egg, unlocked by cumulative session time). Generic is always available; Elcor is opt-in via the session flag:
| Affect | Generic | Elcor |
|---|---|---|
| frustrated | Tone: Frustrated |
With barely concealed frustration: |
| warm | Tone: Warm |
Warmly: |
| scripted | Tone: Scripted |
Reading from a script: |
| dismissive | Tone: Dismissive |
With polite dismissiveness: |
| tired | Tone: Tired |
With audible fatigue: |
Telephony
cf_voice.telephony provides the outbound call abstraction for Osprey, Harrier, Ibis, and Kestrel.
Quick start
from cf_voice.telephony import make_telephony
# Mock mode — no real calls placed (CF_VOICE_MOCK=1 or mock=True)
backend = make_telephony(mock=True)
session = await backend.dial(
to="+15551234567",
from_="+18005550000",
webhook_url="https://yourapp.example.com/voice/events",
amd=True, # answering machine detection
)
# Adaptive service identification (osprey#21)
await backend.announce(session.call_sid, "This is an automated assistant.")
# Navigate IVR
await backend.send_dtmf(session.call_sid, "2") # Press 2 for billing
# Bridge to user's phone once human agent answers
await backend.bridge(session.call_sid, "+14155550100")
await backend.hangup(session.call_sid)
Backend selection
make_telephony() resolves the backend in this order:
| Condition | Backend |
|---|---|
CF_VOICE_MOCK=1 or mock=True |
MockTelephonyBackend (dev/CI) |
CF_SW_PROJECT_ID env set |
SignalWireBackend (paid tier) |
CF_ESL_PASSWORD env set |
FreeSWITCHBackend (free tier, self-hosted) |
| none | RuntimeError |
Installing real backends
# Paid tier — SignalWire managed telephony
pip install cf-voice[signalwire]
# Free tier — self-hosted FreeSWITCH (requires compiled ESL bindings)
pip install cf-voice[freeswitch]
Set credentials in .env (see .env.example).
Mock mode
Set CF_VOICE_MOCK=1 or pass mock=True to make_io(). Emits synthetic VoiceFrame objects on a timer. No GPU, microphone, or HF_TOKEN required. All API surface is identical to real mode.
Module structure
| Module | License | Purpose |
|---|---|---|
cf_voice.models |
MIT | VoiceFrame dataclass |
cf_voice.events |
MIT | AudioEvent, ToneEvent, wire format types |
cf_voice.io |
MIT | VoiceIO base, MockVoiceIO, make_io() factory |
cf_voice.telephony |
MIT (Protocol + Mock), BSL (backends) | TelephonyBackend Protocol, MockTelephonyBackend, SignalWireBackend, FreeSWITCHBackend, make_telephony() |
cf_voice.capture |
BSL 1.1 | MicVoiceIO — real mic capture, 2s windowing |
cf_voice.stt |
BSL 1.1 | WhisperSTT — faster-whisper async wrapper |
cf_voice.classify |
BSL 1.1 | ToneClassifier — wav2vec2 SER + librosa prosody |
cf_voice.diarize |
BSL 1.1 | Diarizer — pyannote.audio async wrapper |
cf_voice.context |
BSL 1.1 | ContextClassifier — high-level consumer API |
BSL applies to inference modules. IO + types + wire format = MIT.
Attribution
Speaker diarization uses pyannote.audio (MIT) and the following gated HuggingFace models (CC BY 4.0):
pyannote/speaker-diarization-3.1— Hervé Bredin et al.pyannote/segmentation-3.0— Hervé Bredin et al.
CC BY 4.0 requires attribution in any distributed product. The models are gated: each user must accept the license terms on HuggingFace before their HF_TOKEN will authorize a download.
Consumed by
Circuit-Forge/linnet— real-time tone annotation PWA (primary consumer)Circuit-Forge/osprey— telephony bridge voice context (Navigation v0.2.x)Circuit-Forge/falcon(planned) — phone form-filling, IVR navigation