cf-voice/README.md
pyr0ball 24f04b67db feat: full voice pipeline — AST acoustic, accent, privacy, prosody, dimensional, trajectory, telephony, FastAPI app
New modules shipped (from Linnet integration):
- acoustic.py: AST (MIT/ast-finetuned-audioset-10-10-0.4593) replaces YAMNet stub;
  527 AudioSet classes mapped to queue/speaker/environ/scene labels; _LABEL_MAP
  includes hold_music, ringback, DTMF, background_shift, AMD signal chain
- accent.py: facebook/mms-lid-126 language ID → regional accent labels
  (en_gb, en_us, en_au, fr, es, de, zh, …); lazy-loaded, gated by CF_VOICE_ACCENT
- privacy.py: compound privacy risk scorer — public_env, background_voices,
  nature scene, accent signals; returns 0–3 score without storing any audio
- prosody.py: openSMILE-backed prosody extractor (sarcasm_risk, flat_f0_score,
  speech_rate, pitch_range); mock mode returns neutral values
- dimensional.py: audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim
  valence/arousal/dominance scorer; gated by CF_VOICE_DIMENSIONAL
- trajectory.py: rolling buffer for arousal/valence deltas, trend detection
  (escalating/suppressed/stable), coherence scoring, suppression/reframe flags
- telephony.py: TelephonyBackend Protocol + MockTelephonyBackend + SignalWireBackend
  + FreeSWITCHBackend; CallSession dataclass; make_telephony() factory
- app.py: FastAPI service (port 8007) — /health + /classify; accepts base64 PCM
  chunks, returns full AudioEventOut including dimensional/prosody/accent fields
- prefs.py: voice preference helpers (elcor_mode, confidence_threshold,
  whisper_model, elcor_prior_frames); cf-core and env-var fallback

Tests: fix stale tests (YAMNetAcousticBackend → ASTAcousticBackend, scene field
added to AcousticResult, speaker_at gap now resolves dominant speaker not UNKNOWN,
make_io real path returns MicVoiceIO when sounddevice installed). 78 tests passing.

Closes #2, #3.
2026-04-18 22:36:58 -07:00

7.4 KiB
Raw Permalink Blame History

cf-voice

CircuitForge voice annotation pipeline. Produces VoiceFrame objects from a live audio stream — tone label, confidence, speaker identity, and shift magnitude — and exposes ToneEvent as the stable SSE wire type for downstream consumers (Linnet, Osprey, Falcon).

Status: Notation v0.1.x — real inference pipeline live (faster-whisper STT, wav2vec2 SER, librosa prosody, pyannote diarization). Mock mode available for dev/CI without GPU or mic.

Install

# Mock mode only (no GPU required)
pip install -e ../cf-voice

# Real inference (STT + tone classifier + diarization)
pip install -e "../cf-voice[inference]"

Copy .env.example to .env and fill in HF_TOKEN for diarization.

Quick start

from cf_voice.context import ContextClassifier

# Mock mode (no hardware needed)
classifier = ContextClassifier.mock()
async for frame in classifier.stream():
    print(frame.label, frame.confidence)

# Real mic capture (requires [inference] extras + CF_VOICE_MOCK unset)
classifier = ContextClassifier.from_env()
async for frame in classifier.stream():
    print(frame.label, frame.confidence)

CLI smoke-test:

CF_VOICE_MOCK=1 cf-voice-demo

VoiceFrame

Produced by cf_voice.io (audio capture layer). MIT licensed.

@dataclass
class VoiceFrame:
    label: str            # tone descriptor, e.g. "Warmly impatient"
    confidence: float     # 0.01.0
    speaker_id: str       # ephemeral local label, e.g. "speaker_a"
    shift_magnitude: float  # delta from previous frame, 0.01.0
    timestamp: float      # session-relative seconds

    def is_reliable(self, threshold=0.6) -> bool: ...
    def is_shift(self, threshold=0.3) -> bool: ...

ToneEvent — SSE wire format

ToneEvent is the stable SSE wire type emitted by Linnet's annotation stream and consumed by <LinnetWidget /> embeds in Osprey, Falcon, and other products.

Field names are locked as of cf-voice v0.1.0 (cf-core#40).

JSON shape

{
  "event_type": "tone",
  "timestamp": 4.82,
  "label": "Warmly impatient",
  "confidence": 0.79,
  "speaker_id": "speaker_a",
  "subtext": "Tone: Frustrated",
  "affect": "frustrated",
  "shift_magnitude": 0.74,
  "shift_direction": "more_urgent",
  "prosody_flags": ["fast_rate", "rising"],
  "session_id": "ses_abc123"
}

Field reference

Field Type Stable Description
event_type "tone" yes Always "tone" for ToneEvent
timestamp float yes Seconds since session start
label str yes Human-readable tone descriptor ("Warmly impatient")
confidence float yes 0.01.0. Below ~0.55 = speculative
speaker_id str yes Ephemeral diarization label ("speaker_a"). Resets per session
subtext str | null yes Annotation text. Generic: "Tone: Frustrated". Elcor: "With barely concealed frustration:"
affect str yes AFFECT_LABELS key ("frustrated"). See cf_voice.events.AFFECT_LABELS
shift_magnitude float yes 0.01.0. High = meaningful register change from previous frame
shift_direction str yes "warmer" | "colder" | "more_urgent" | "stable"
prosody_flags str[] no Raw prosody signals ("fast_rate", "rising", "flat_pitch", "low_energy"). Subject to change
session_id str yes Caller-assigned. Correlates events to a conversation session

SSE envelope

Linnet emits events in standard SSE format:

event: tone-event
data: {"event_type":"tone","timestamp":4.82,...}

Host apps subscribing via <LinnetWidget /> receive MessageEvent with type === "tone-event".

Elcor mode

subtext switches format when the session is in Elcor mode (easter egg, unlocked by cumulative session time). Generic is always available; Elcor is opt-in via the session flag:

Affect Generic Elcor
frustrated Tone: Frustrated With barely concealed frustration:
warm Tone: Warm Warmly:
scripted Tone: Scripted Reading from a script:
dismissive Tone: Dismissive With polite dismissiveness:
tired Tone: Tired With audible fatigue:


Telephony

cf_voice.telephony provides the outbound call abstraction for Osprey, Harrier, Ibis, and Kestrel.

Quick start

from cf_voice.telephony import make_telephony

# Mock mode — no real calls placed (CF_VOICE_MOCK=1 or mock=True)
backend = make_telephony(mock=True)

session = await backend.dial(
    to="+15551234567",
    from_="+18005550000",
    webhook_url="https://yourapp.example.com/voice/events",
    amd=True,   # answering machine detection
)

# Adaptive service identification (osprey#21)
await backend.announce(session.call_sid, "This is an automated assistant.")

# Navigate IVR
await backend.send_dtmf(session.call_sid, "2")   # Press 2 for billing

# Bridge to user's phone once human agent answers
await backend.bridge(session.call_sid, "+14155550100")

await backend.hangup(session.call_sid)

Backend selection

make_telephony() resolves the backend in this order:

Condition Backend
CF_VOICE_MOCK=1 or mock=True MockTelephonyBackend (dev/CI)
CF_SW_PROJECT_ID env set SignalWireBackend (paid tier)
CF_ESL_PASSWORD env set FreeSWITCHBackend (free tier, self-hosted)
none RuntimeError

Installing real backends

# Paid tier — SignalWire managed telephony
pip install cf-voice[signalwire]

# Free tier — self-hosted FreeSWITCH (requires compiled ESL bindings)
pip install cf-voice[freeswitch]

Set credentials in .env (see .env.example).


Mock mode

Set CF_VOICE_MOCK=1 or pass mock=True to make_io(). Emits synthetic VoiceFrame objects on a timer. No GPU, microphone, or HF_TOKEN required. All API surface is identical to real mode.


Module structure

Module License Purpose
cf_voice.models MIT VoiceFrame dataclass
cf_voice.events MIT AudioEvent, ToneEvent, wire format types
cf_voice.io MIT VoiceIO base, MockVoiceIO, make_io() factory
cf_voice.telephony MIT (Protocol + Mock), BSL (backends) TelephonyBackend Protocol, MockTelephonyBackend, SignalWireBackend, FreeSWITCHBackend, make_telephony()
cf_voice.capture BSL 1.1 MicVoiceIO — real mic capture, 2s windowing
cf_voice.stt BSL 1.1 WhisperSTT — faster-whisper async wrapper
cf_voice.classify BSL 1.1 ToneClassifier — wav2vec2 SER + librosa prosody
cf_voice.diarize BSL 1.1 Diarizer — pyannote.audio async wrapper
cf_voice.context BSL 1.1 ContextClassifier — high-level consumer API

BSL applies to inference modules. IO + types + wire format = MIT.



Attribution

Speaker diarization uses pyannote.audio (MIT) and the following gated HuggingFace models (CC BY 4.0):

  • pyannote/speaker-diarization-3.1 — Hervé Bredin et al.
  • pyannote/segmentation-3.0 — Hervé Bredin et al.

CC BY 4.0 requires attribution in any distributed product. The models are gated: each user must accept the license terms on HuggingFace before their HF_TOKEN will authorize a download.


Consumed by

  • Circuit-Forge/linnet — real-time tone annotation PWA (primary consumer)
  • Circuit-Forge/osprey — telephony bridge voice context (Navigation v0.2.x)
  • Circuit-Forge/falcon (planned) — phone form-filling, IVR navigation