Surfaces mock mode as the default starting point before any GPU path. Adds HuggingFace gated model callout (pyannote, CC BY 4.0) with individual acceptance requirement before install steps. Adds missing LICENSE file — pyproject.toml declared MIT but no LICENSE text was present.
8.9 KiB
cf-voice
CircuitForge voice annotation pipeline. Produces VoiceFrame objects from a live audio stream — tone label, confidence, speaker identity, and shift magnitude — and exposes ToneEvent as the stable SSE wire type for downstream consumers (Linnet, Osprey, Falcon).
Status: Notation v0.1.x — real inference pipeline live (faster-whisper STT, wav2vec2 SER, librosa prosody, pyannote diarization). Mock mode available for dev/CI without GPU or mic.
Prerequisites
Start here: mock mode (no GPU, no HuggingFace account)
If you are integrating against the cf-voice API or running CI, start with mock mode. No hardware, no model download, no accounts required:
pip install -e ../cf-voice
CF_VOICE_MOCK=1 python -m cf_voice.app --port 8007
Mock mode emits synthetic VoiceFrame objects on a timer. All API surface is identical to real inference.
Moving to real inference: HuggingFace gated models
Speaker diarization uses two gated HuggingFace models. Before your HF_TOKEN will authorise a download, you must individually accept the licence terms for each:
- pyannote/speaker-diarization-3.1 — click "Agree and access repository"
- pyannote/segmentation-3.0 — click "Agree and access repository"
This is a one-time step per HuggingFace account. If you skip it, the service will fail at startup with a 401 Unauthorized error from HuggingFace.
Licence note: The pyannote models are CC BY 4.0. Attribution is required in any distributed product. Set
HF_TOKENto a token belonging to an account that has accepted the above terms — using a shared or third-party token that has not individually accepted the terms violates the licence.
Hardware
| Component | Minimum |
|---|---|
| GPU | Any CUDA-capable GPU |
| VRAM | 4GB+ recommended |
| Microphone | Required for live capture (not needed for file processing or mock mode) |
Install
# Mock mode only (no GPU required)
pip install -e ../cf-voice
# Real inference (STT + tone classifier + diarization)
pip install -e "../cf-voice[inference]"
Copy .env.example to .env and fill in HF_TOKEN for diarization.
Quick start
from cf_voice.context import ContextClassifier
# Mock mode (no hardware needed)
classifier = ContextClassifier.mock()
async for frame in classifier.stream():
print(frame.label, frame.confidence)
# Real mic capture (requires [inference] extras + CF_VOICE_MOCK unset)
classifier = ContextClassifier.from_env()
async for frame in classifier.stream():
print(frame.label, frame.confidence)
CLI smoke-test:
CF_VOICE_MOCK=1 cf-voice-demo
VoiceFrame
Produced by cf_voice.io (audio capture layer). MIT licensed.
@dataclass
class VoiceFrame:
label: str # tone descriptor, e.g. "Warmly impatient"
confidence: float # 0.0–1.0
speaker_id: str # ephemeral local label, e.g. "speaker_a"
shift_magnitude: float # delta from previous frame, 0.0–1.0
timestamp: float # session-relative seconds
def is_reliable(self, threshold=0.6) -> bool: ...
def is_shift(self, threshold=0.3) -> bool: ...
ToneEvent — SSE wire format
ToneEvent is the stable SSE wire type emitted by Linnet's annotation stream
and consumed by <LinnetWidget /> embeds in Osprey, Falcon, and other products.
Field names are locked as of cf-voice v0.1.0 (cf-core#40).
JSON shape
{
"event_type": "tone",
"timestamp": 4.82,
"label": "Warmly impatient",
"confidence": 0.79,
"speaker_id": "speaker_a",
"subtext": "Tone: Frustrated",
"affect": "frustrated",
"shift_magnitude": 0.74,
"shift_direction": "more_urgent",
"prosody_flags": ["fast_rate", "rising"],
"session_id": "ses_abc123"
}
Field reference
| Field | Type | Stable | Description |
|---|---|---|---|
event_type |
"tone" |
yes | Always "tone" for ToneEvent |
timestamp |
float |
yes | Seconds since session start |
label |
str |
yes | Human-readable tone descriptor ("Warmly impatient") |
confidence |
float |
yes | 0.0–1.0. Below ~0.55 = speculative |
speaker_id |
str |
yes | Ephemeral diarization label ("speaker_a"). Resets per session |
subtext |
str | null |
yes | Annotation text. Generic: "Tone: Frustrated". Elcor: "With barely concealed frustration:" |
affect |
str |
yes | AFFECT_LABELS key ("frustrated"). See cf_voice.events.AFFECT_LABELS |
shift_magnitude |
float |
yes | 0.0–1.0. High = meaningful register change from previous frame |
shift_direction |
str |
yes | "warmer" | "colder" | "more_urgent" | "stable" |
prosody_flags |
str[] |
no | Raw prosody signals ("fast_rate", "rising", "flat_pitch", "low_energy"). Subject to change |
session_id |
str |
yes | Caller-assigned. Correlates events to a conversation session |
SSE envelope
Linnet emits events in standard SSE format:
event: tone-event
data: {"event_type":"tone","timestamp":4.82,...}
Host apps subscribing via <LinnetWidget /> receive MessageEvent with type === "tone-event".
Elcor mode
subtext switches format when the session is in Elcor mode (easter egg, unlocked by cumulative session time). Generic is always available; Elcor is opt-in via the session flag:
| Affect | Generic | Elcor |
|---|---|---|
| frustrated | Tone: Frustrated |
With barely concealed frustration: |
| warm | Tone: Warm |
Warmly: |
| scripted | Tone: Scripted |
Reading from a script: |
| dismissive | Tone: Dismissive |
With polite dismissiveness: |
| tired | Tone: Tired |
With audible fatigue: |
Telephony
cf_voice.telephony provides the outbound call abstraction for Osprey, Harrier, Ibis, and Kestrel.
Quick start
from cf_voice.telephony import make_telephony
# Mock mode — no real calls placed (CF_VOICE_MOCK=1 or mock=True)
backend = make_telephony(mock=True)
session = await backend.dial(
to="+15551234567",
from_="+18005550000",
webhook_url="https://yourapp.example.com/voice/events",
amd=True, # answering machine detection
)
# Adaptive service identification (osprey#21)
await backend.announce(session.call_sid, "This is an automated assistant.")
# Navigate IVR
await backend.send_dtmf(session.call_sid, "2") # Press 2 for billing
# Bridge to user's phone once human agent answers
await backend.bridge(session.call_sid, "+14155550100")
await backend.hangup(session.call_sid)
Backend selection
make_telephony() resolves the backend in this order:
| Condition | Backend |
|---|---|
CF_VOICE_MOCK=1 or mock=True |
MockTelephonyBackend (dev/CI) |
CF_SW_PROJECT_ID env set |
SignalWireBackend (paid tier) |
CF_ESL_PASSWORD env set |
FreeSWITCHBackend (free tier, self-hosted) |
| none | RuntimeError |
Installing real backends
# Paid tier — SignalWire managed telephony
pip install cf-voice[signalwire]
# Free tier — self-hosted FreeSWITCH (requires compiled ESL bindings)
pip install cf-voice[freeswitch]
Set credentials in .env (see .env.example).
Mock mode
Set CF_VOICE_MOCK=1 or pass mock=True to make_io(). Emits synthetic VoiceFrame objects on a timer. No GPU, microphone, or HF_TOKEN required. All API surface is identical to real mode.
Module structure
| Module | License | Purpose |
|---|---|---|
cf_voice.models |
MIT | VoiceFrame dataclass |
cf_voice.events |
MIT | AudioEvent, ToneEvent, wire format types |
cf_voice.io |
MIT | VoiceIO base, MockVoiceIO, make_io() factory |
cf_voice.telephony |
MIT (Protocol + Mock), BSL (backends) | TelephonyBackend Protocol, MockTelephonyBackend, SignalWireBackend, FreeSWITCHBackend, make_telephony() |
cf_voice.capture |
BSL 1.1 | MicVoiceIO — real mic capture, 2s windowing |
cf_voice.stt |
BSL 1.1 | WhisperSTT — faster-whisper async wrapper |
cf_voice.classify |
BSL 1.1 | ToneClassifier — wav2vec2 SER + librosa prosody |
cf_voice.diarize |
BSL 1.1 | Diarizer — pyannote.audio async wrapper |
cf_voice.context |
BSL 1.1 | ContextClassifier — high-level consumer API |
BSL applies to inference modules. IO + types + wire format = MIT.
Attribution
Speaker diarization uses pyannote.audio (MIT) and the following gated HuggingFace models (CC BY 4.0):
pyannote/speaker-diarization-3.1— Hervé Bredin et al.pyannote/segmentation-3.0— Hervé Bredin et al.
CC BY 4.0 requires attribution in any distributed product. The models are gated: each user must accept the license terms on HuggingFace before their HF_TOKEN will authorize a download.
Consumed by
Circuit-Forge/linnet— real-time tone annotation PWA (primary consumer)Circuit-Forge/osprey— telephony bridge voice context (Navigation v0.2.x)Circuit-Forge/falcon(planned) — phone form-filling, IVR navigation