# cf-voice CircuitForge voice annotation pipeline. Produces `VoiceFrame` objects from a live audio stream — tone label, confidence, speaker identity, and shift magnitude — and exposes `ToneEvent` as the stable SSE wire type for downstream consumers (Linnet, Osprey, Falcon). **Status:** Notation v0.1.x — real inference pipeline live (faster-whisper STT, wav2vec2 SER, librosa prosody, pyannote diarization). Mock mode available for dev/CI without GPU or mic. ## Install ```bash # Mock mode only (no GPU required) pip install -e ../cf-voice # Real inference (STT + tone classifier + diarization) pip install -e "../cf-voice[inference]" ``` Copy `.env.example` to `.env` and fill in `HF_TOKEN` for diarization. ## Quick start ```python from cf_voice.context import ContextClassifier # Mock mode (no hardware needed) classifier = ContextClassifier.mock() async for frame in classifier.stream(): print(frame.label, frame.confidence) # Real mic capture (requires [inference] extras + CF_VOICE_MOCK unset) classifier = ContextClassifier.from_env() async for frame in classifier.stream(): print(frame.label, frame.confidence) ``` CLI smoke-test: ```bash CF_VOICE_MOCK=1 cf-voice-demo ``` --- ## VoiceFrame Produced by `cf_voice.io` (audio capture layer). MIT licensed. ```python @dataclass class VoiceFrame: label: str # tone descriptor, e.g. "Warmly impatient" confidence: float # 0.0–1.0 speaker_id: str # ephemeral local label, e.g. "speaker_a" shift_magnitude: float # delta from previous frame, 0.0–1.0 timestamp: float # session-relative seconds def is_reliable(self, threshold=0.6) -> bool: ... def is_shift(self, threshold=0.3) -> bool: ... ``` --- ## ToneEvent — SSE wire format `ToneEvent` is the stable SSE wire type emitted by Linnet's annotation stream and consumed by `` embeds in Osprey, Falcon, and other products. **Field names are locked as of cf-voice v0.1.0** (cf-core#40). ### JSON shape ```json { "event_type": "tone", "timestamp": 4.82, "label": "Warmly impatient", "confidence": 0.79, "speaker_id": "speaker_a", "subtext": "Tone: Frustrated", "affect": "frustrated", "shift_magnitude": 0.74, "shift_direction": "more_urgent", "prosody_flags": ["fast_rate", "rising"], "session_id": "ses_abc123" } ``` ### Field reference | Field | Type | Stable | Description | |---|---|---|---| | `event_type` | `"tone"` | yes | Always `"tone"` for ToneEvent | | `timestamp` | `float` | yes | Seconds since session start | | `label` | `str` | yes | Human-readable tone descriptor ("Warmly impatient") | | `confidence` | `float` | yes | 0.0–1.0. Below ~0.55 = speculative | | `speaker_id` | `str` | yes | Ephemeral diarization label ("speaker_a"). Resets per session | | `subtext` | `str \| null` | yes | Annotation text. Generic: `"Tone: Frustrated"`. Elcor: `"With barely concealed frustration:"` | | `affect` | `str` | yes | AFFECT_LABELS key ("frustrated"). See `cf_voice.events.AFFECT_LABELS` | | `shift_magnitude` | `float` | yes | 0.0–1.0. High = meaningful register change from previous frame | | `shift_direction` | `str` | yes | `"warmer"` \| `"colder"` \| `"more_urgent"` \| `"stable"` | | `prosody_flags` | `str[]` | no | Raw prosody signals ("fast_rate", "rising", "flat_pitch", "low_energy"). Subject to change | | `session_id` | `str` | yes | Caller-assigned. Correlates events to a conversation session | ### SSE envelope Linnet emits events in standard SSE format: ``` event: tone-event data: {"event_type":"tone","timestamp":4.82,...} ``` Host apps subscribing via `` receive `MessageEvent` with `type === "tone-event"`. ### Elcor mode `subtext` switches format when the session is in Elcor mode (easter egg, unlocked by cumulative session time). Generic is always available; Elcor is opt-in via the session flag: | Affect | Generic | Elcor | |---|---|---| | frustrated | `Tone: Frustrated` | `With barely concealed frustration:` | | warm | `Tone: Warm` | `Warmly:` | | scripted | `Tone: Scripted` | `Reading from a script:` | | dismissive | `Tone: Dismissive` | `With polite dismissiveness:` | | tired | `Tone: Tired` | `With audible fatigue:` | --- --- ## Telephony `cf_voice.telephony` provides the outbound call abstraction for Osprey, Harrier, Ibis, and Kestrel. ### Quick start ```python from cf_voice.telephony import make_telephony # Mock mode — no real calls placed (CF_VOICE_MOCK=1 or mock=True) backend = make_telephony(mock=True) session = await backend.dial( to="+15551234567", from_="+18005550000", webhook_url="https://yourapp.example.com/voice/events", amd=True, # answering machine detection ) # Adaptive service identification (osprey#21) await backend.announce(session.call_sid, "This is an automated assistant.") # Navigate IVR await backend.send_dtmf(session.call_sid, "2") # Press 2 for billing # Bridge to user's phone once human agent answers await backend.bridge(session.call_sid, "+14155550100") await backend.hangup(session.call_sid) ``` ### Backend selection `make_telephony()` resolves the backend in this order: | Condition | Backend | |---|---| | `CF_VOICE_MOCK=1` or `mock=True` | `MockTelephonyBackend` (dev/CI) | | `CF_SW_PROJECT_ID` env set | `SignalWireBackend` (paid tier) | | `CF_ESL_PASSWORD` env set | `FreeSWITCHBackend` (free tier, self-hosted) | | none | `RuntimeError` | ### Installing real backends ```bash # Paid tier — SignalWire managed telephony pip install cf-voice[signalwire] # Free tier — self-hosted FreeSWITCH (requires compiled ESL bindings) pip install cf-voice[freeswitch] ``` Set credentials in `.env` (see `.env.example`). --- ## Mock mode Set `CF_VOICE_MOCK=1` or pass `mock=True` to `make_io()`. Emits synthetic `VoiceFrame` objects on a timer. No GPU, microphone, or `HF_TOKEN` required. All API surface is identical to real mode. --- ## Module structure | Module | License | Purpose | |--------|---------|---------| | `cf_voice.models` | MIT | `VoiceFrame` dataclass | | `cf_voice.events` | MIT | `AudioEvent`, `ToneEvent`, wire format types | | `cf_voice.io` | MIT | `VoiceIO` base, `MockVoiceIO`, `make_io()` factory | | `cf_voice.telephony` | MIT (Protocol + Mock), BSL (backends) | `TelephonyBackend` Protocol, `MockTelephonyBackend`, `SignalWireBackend`, `FreeSWITCHBackend`, `make_telephony()` | | `cf_voice.capture` | BSL 1.1 | `MicVoiceIO` — real mic capture, 2s windowing | | `cf_voice.stt` | BSL 1.1 | `WhisperSTT` — faster-whisper async wrapper | | `cf_voice.classify` | BSL 1.1 | `ToneClassifier` — wav2vec2 SER + librosa prosody | | `cf_voice.diarize` | BSL 1.1 | `Diarizer` — pyannote.audio async wrapper | | `cf_voice.context` | BSL 1.1 | `ContextClassifier` — high-level consumer API | BSL applies to inference modules. IO + types + wire format = MIT. --- --- ## Attribution Speaker diarization uses [pyannote.audio](https://github.com/pyannote/pyannote-audio) (MIT) and the following gated HuggingFace models (CC BY 4.0): - `pyannote/speaker-diarization-3.1` — Hervé Bredin et al. - `pyannote/segmentation-3.0` — Hervé Bredin et al. CC BY 4.0 requires attribution in any distributed product. The models are gated: each user must accept the license terms on HuggingFace before their `HF_TOKEN` will authorize a download. --- ## Consumed by - `Circuit-Forge/linnet` — real-time tone annotation PWA (primary consumer) - `Circuit-Forge/osprey` — telephony bridge voice context (Navigation v0.2.x) - `Circuit-Forge/falcon` (planned) — phone form-filling, IVR navigation