tracking: cf-core integration points for cf-voice (SSE wire format, preferences hooks) #34

Closed
opened 2026-04-06 10:35:39 -07:00 by pyr0ball · 3 comments
Owner

Summary

Graduate the voice stub to a full cf_voice module. Two active consumers: Osprey (telephony + IVR) and Peregrine (voice I/O for nonverbal users). Design doc: circuitforge-plans/circuitforge-core/2026-04-06-cf-voice-design.md.

Three sub-modules

cf_voice.io — Speech I/O

STT (Whisper local, cloud fallback) and TTS (Piper local, cloud fallback).

Key output: TranscriptResult with confidence and flagged_low_confidence (cf-orch#17 a11y blocker).

Protocol interface:

class STTBackend(Protocol):
    async def transcribe(self, audio: bytes, language: str = "en") -> TranscriptResult: ...

class TTSBackend(Protocol):
    async def synthesize(self, text: str, voice: str = "default") -> bytes: ...

cf_voice.context — Parallel audio classifier

Non-conversational. Runs alongside STT, never tries to understand words — classifies acoustic features only.

Event classes:

  • queue: hold_music | silence | ringback | busy | dead_air
  • speaker: ivr_synth | human_single | human_multi | transfer
  • environ: call_center | music | background_shift
  • tone: affect + shift + prosody + subtext (Elcor label)

Model strategy:

Event class Model
Queue / environ YAMNet or PANN
Speaker type / VAD pyannote.audio
IVR synth vs. human Custom fine-tuned head
Tone / affect SER (wav2vec2-based)
Prosody features librosa
Elcor label Small local LLM via LLMRouter

Elcor mode (accessibility feature)

Tone shifts and affect are converted to a human-readable annotation prepended to the transcript:

"With escalating impatience:" | "Apologetically, trailing off:" | "Warmly:"

This is an explicit accessibility feature for autistic and ND users who may not reliably perceive implicit tonal/emotional cues in voice interactions. It is opt-in, user-configurable, and rendered locally — no audio or labels leave the device.

The classifier also doubles as local AMD (answering machine detection): background_shift from hold music to call-center ambient is a reliable pre-speech human-answered signal, resolving the FreeSWITCH AMD open question from the telephony spec.

cf_voice.telephony — Outbound telephony abstraction

class TelephonyBackend(Protocol):
    async def dial(self, to: str, from_: str, webhook_url: str) -> str: ...
    async def send_dtmf(self, call_sid: str, digits: str) -> None: ...
    async def bridge(self, call_sid: str, target: str) -> None: ...
    async def hangup(self, call_sid: str) -> None: ...
    async def announce(self, call_sid: str, text: str, voice: str) -> None: ...

class SignalWireBackend(TelephonyBackend): ...
class FreeSWITCHBackend(TelephonyBackend): ...

announce() implements the adaptive service identification requirement (cf-orch#18, osprey#21).

Combined output type

@dataclass
class VoiceFrame:
    timestamp: float
    transcript: str | None
    events: list[AudioEvent]
    speaker_id: str | None

Tier mapping

Tier STT TTS Classifier Telephony
Free Whisper local Piper local All local FreeSWITCH + BYOK VoIP
Paid Cloud STT fallback Cloud TTS Same + cloud SER fallback SignalWire via CF

Build sequence

  1. cf_voice.io — unblocks peregrine#74 (voice I/O for nonverbal users)
  2. cf_voice.telephony — unblocks osprey#1
  3. cf_voice.context — queue state + speaker first (AMD), tone + Elcor second

Open questions

  • Piper vs. Coqui TTS for local synthesis
  • wav2vec2 SER model size vs. accuracy trade-off on available VRAM
  • Elcor label prompt design — how many prior frames to include as context
  • Event-driven vs. continuous stream output for context classifier
  • pyannote.audio license (CC BY 4.0, requires HuggingFace account acceptance — document in setup)

References

  • Telephony spec: circuitforge-plans/osprey/superpowers/specs/2026-04-04-telephony-backend-design.md
  • cf-orch#17: a11y service endpoint audit (STT confidence + context checkpoint blockers)
  • cf-orch#18: adaptive service identification contract
  • osprey#1: FTB dialer (blocked on SignalWire credentials)
  • osprey#21: bridge identification implementation
  • peregrine#74: messaging tab + voice I/O for nonverbal users
## Summary Graduate the `voice` stub to a full `cf_voice` module. Two active consumers: Osprey (telephony + IVR) and Peregrine (voice I/O for nonverbal users). Design doc: `circuitforge-plans/circuitforge-core/2026-04-06-cf-voice-design.md`. ## Three sub-modules ### cf_voice.io — Speech I/O STT (Whisper local, cloud fallback) and TTS (Piper local, cloud fallback). Key output: `TranscriptResult` with `confidence` and `flagged_low_confidence` (cf-orch#17 a11y blocker). Protocol interface: ```python class STTBackend(Protocol): async def transcribe(self, audio: bytes, language: str = "en") -> TranscriptResult: ... class TTSBackend(Protocol): async def synthesize(self, text: str, voice: str = "default") -> bytes: ... ``` ### cf_voice.context — Parallel audio classifier Non-conversational. Runs alongside STT, never tries to understand words — classifies acoustic features only. Event classes: - **queue**: `hold_music | silence | ringback | busy | dead_air` - **speaker**: `ivr_synth | human_single | human_multi | transfer` - **environ**: `call_center | music | background_shift` - **tone**: `affect` + `shift` + `prosody` + `subtext` (Elcor label) Model strategy: | Event class | Model | |---|---| | Queue / environ | YAMNet or PANN | | Speaker type / VAD | pyannote.audio | | IVR synth vs. human | Custom fine-tuned head | | Tone / affect | SER (wav2vec2-based) | | Prosody features | librosa | | Elcor label | Small local LLM via LLMRouter | #### Elcor mode (accessibility feature) Tone shifts and affect are converted to a human-readable annotation prepended to the transcript: > `"With escalating impatience:"` | `"Apologetically, trailing off:"` | `"Warmly:"` This is an explicit accessibility feature for autistic and ND users who may not reliably perceive implicit tonal/emotional cues in voice interactions. It is opt-in, user-configurable, and rendered locally — no audio or labels leave the device. The classifier also doubles as local AMD (answering machine detection): `background_shift` from hold music to call-center ambient is a reliable pre-speech human-answered signal, resolving the FreeSWITCH AMD open question from the telephony spec. ### cf_voice.telephony — Outbound telephony abstraction ```python class TelephonyBackend(Protocol): async def dial(self, to: str, from_: str, webhook_url: str) -> str: ... async def send_dtmf(self, call_sid: str, digits: str) -> None: ... async def bridge(self, call_sid: str, target: str) -> None: ... async def hangup(self, call_sid: str) -> None: ... async def announce(self, call_sid: str, text: str, voice: str) -> None: ... class SignalWireBackend(TelephonyBackend): ... class FreeSWITCHBackend(TelephonyBackend): ... ``` `announce()` implements the adaptive service identification requirement (cf-orch#18, osprey#21). ## Combined output type ```python @dataclass class VoiceFrame: timestamp: float transcript: str | None events: list[AudioEvent] speaker_id: str | None ``` ## Tier mapping | Tier | STT | TTS | Classifier | Telephony | |---|---|---|---|---| | Free | Whisper local | Piper local | All local | FreeSWITCH + BYOK VoIP | | Paid | Cloud STT fallback | Cloud TTS | Same + cloud SER fallback | SignalWire via CF | ## Build sequence 1. `cf_voice.io` — unblocks peregrine#74 (voice I/O for nonverbal users) 2. `cf_voice.telephony` — unblocks osprey#1 3. `cf_voice.context` — queue state + speaker first (AMD), tone + Elcor second ## Open questions - [ ] Piper vs. Coqui TTS for local synthesis - [ ] wav2vec2 SER model size vs. accuracy trade-off on available VRAM - [ ] Elcor label prompt design — how many prior frames to include as context - [ ] Event-driven vs. continuous stream output for context classifier - [ ] pyannote.audio license (CC BY 4.0, requires HuggingFace account acceptance — document in setup) ## References - Telephony spec: `circuitforge-plans/osprey/superpowers/specs/2026-04-04-telephony-backend-design.md` - cf-orch#17: a11y service endpoint audit (STT confidence + context checkpoint blockers) - cf-orch#18: adaptive service identification contract - osprey#1: FTB dialer (blocked on SignalWire credentials) - osprey#21: bridge identification implementation - peregrine#74: messaging tab + voice I/O for nonverbal users
pyr0ball added this to the v0.8.0 — Pipeline + Hardware + Documents modules milestone 2026-04-06 10:35:39 -07:00
pyr0ball added the
architecture
enhancement
priority:high
module:voice
labels 2026-04-06 10:35:39 -07:00
Author
Owner

cf_voice is now a standalone repo (Circuit-Forge/cf-voice, MIT/BSL split) rather than a cf-core module — see #35 (closed) and cf-core#39 (closed).

This issue should track any cf-core integration points that depend on cf-voice (e.g. shared VoiceFrame SSE wire format #40, preferences.prefers_reduced_motion #38). Retitling.

cf_voice is now a standalone repo (Circuit-Forge/cf-voice, MIT/BSL split) rather than a cf-core module — see #35 (closed) and cf-core#39 (closed). This issue should track any cf-core integration points that depend on cf-voice (e.g. shared VoiceFrame SSE wire format #40, preferences.prefers_reduced_motion #38). Retitling.
pyr0ball changed title from New module: cf_voice — STT/TTS, parallel audio classifier, telephony abstraction to tracking: cf-core integration points for cf-voice (SSE wire format, preferences hooks) 2026-04-06 16:43:24 -07:00
Author
Owner

Progress update

cf_voice.telephony shipped (Notation v0.1.x):

  • TelephonyBackend Protocol (MIT)
  • MockTelephonyBackend — dev/CI, no real calls, AMD simulation (MIT)
  • SignalWireBackend — paid tier (BSL)
  • FreeSWITCHBackend — free tier self-hosted (BSL)
  • make_telephony() factory — env-driven backend selection
  • 19 new tests, all passing (31 total)
  • README telephony section added
  • Optional extras: cf-voice[signalwire], cf-voice[freeswitch]

Unblocks osprey#1. cf_voice.io build fix: real backend raises NotImplementedError instead of ImportError when inference extras missing.

Also closed: #40 (SSE wire format was already documented in README as of previous session).

Remaining:

  • cf-core preferences hooks integration
  • Queue state + speaker type classifier (Navigation v0.2.x)
## Progress update **cf_voice.telephony shipped** (Notation v0.1.x): - `TelephonyBackend` Protocol (MIT) - `MockTelephonyBackend` — dev/CI, no real calls, AMD simulation (MIT) - `SignalWireBackend` — paid tier (BSL) - `FreeSWITCHBackend` — free tier self-hosted (BSL) - `make_telephony()` factory — env-driven backend selection - 19 new tests, all passing (31 total) - README telephony section added - Optional extras: `cf-voice[signalwire]`, `cf-voice[freeswitch]` Unblocks osprey#1. `cf_voice.io` build fix: real backend raises `NotImplementedError` instead of `ImportError` when inference extras missing. **Also closed:** #40 (SSE wire format was already documented in README as of previous session). **Remaining:** - [ ] cf-core preferences hooks integration - [ ] Queue state + speaker type classifier (Navigation v0.2.x)
Author
Owner

Closing — all deliverables complete

What shipped in this pass

cf_voice.prefs (new, MIT):

  • Preference key constants: PREF_ELCOR_MODE, PREF_CONFIDENCE_THRESHOLD, PREF_WHISPER_MODEL, PREF_ELCOR_PRIOR_FRAMES
  • get_voice_pref() / set_voice_pref() — optional cf-core integration, env var fallback, built-in defaults
  • is_elcor_enabled(), get_confidence_threshold(), get_whisper_model(), get_elcor_prior_frames() convenience helpers
  • Graceful degradation: works without circuitforge_core installed

cf_voice.acoustic (new, MIT Protocol + BSL stub):

  • AcousticBackend Protocol (@runtime_checkable)
  • MockAcousticBackend — simulates full call lifecycle: ringback → IVR → hold music → AMD signal (background_shift) → human answered
  • YAMNetAcousticBackend — Navigation v0.2.x stub, clear NotImplementedError
  • make_acoustic() factory

cf_voice.context (extended):

  • Wired to cf_voice.prefs — Elcor mode and prior_frames read from user preference store
  • classify_chunk() now returns all four event types: tone + queue + speaker + environ
  • Acoustic events from MockAcousticBackend in mock mode; YAMNetAcousticBackend stub in real mode (graceful passthrough on NotImplementedError)
  • session_id propagates into ToneEvent
  • user_id + store plumbed through from construction

Tests: 64 passing (was 31)


Open questions resolved

Piper vs. Coqui TTS: Piper. Coqui-TTS is effectively abandoned (last release 2023). Piper is maintained by Nabu Casa (Home Assistant), ships as a single binary with pre-built voices, and runs on CPU without Python binding issues. .env.example will note CF_VOICE_TTS_BACKEND=piper.

wav2vec2 SER model VRAM: ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition (~1.5GB VRAM) is the right pick. On the RTX 4000 SFF Ada (8GB) it fits alongside Whisper small (500MB) with 6GB headroom for the LLM router. If VRAM is tight, the model can run on CPU at ~2× realtime — acceptable for 2s windows.

Elcor label prompt design: 4 prior frames (~8–10 seconds of context at 2.5s intervals). The prompt includes the last N affect labels in sequence so the LLM can label the shift not just the instantaneous affect. Configured via PREF_ELCOR_PRIOR_FRAMES (default 4, user-adjustable).

Event-driven vs. continuous stream: Event-driven. The streaming path suppresses frames where shift_magnitude < 0.15 (the is_shift() threshold). Queue/environ events only emit on label change. This prevents SSE flooding on long hold-music segments.

pyannote.audio license: CC BY 4.0 — commercial use is permitted with attribution. The two gated models (pyannote/speaker-diarization-3.1 and pyannote/segmentation-3.0) require HuggingFace account acceptance, already documented in .env.example and the diarize.py module header.


Navigation v0.2.x (real YAMNet + pyannote wiring into ContextClassifier.stream()) will be a separate issue.

## Closing — all deliverables complete ### What shipped in this pass **cf_voice.prefs** (new, MIT): - Preference key constants: `PREF_ELCOR_MODE`, `PREF_CONFIDENCE_THRESHOLD`, `PREF_WHISPER_MODEL`, `PREF_ELCOR_PRIOR_FRAMES` - `get_voice_pref()` / `set_voice_pref()` — optional cf-core integration, env var fallback, built-in defaults - `is_elcor_enabled()`, `get_confidence_threshold()`, `get_whisper_model()`, `get_elcor_prior_frames()` convenience helpers - Graceful degradation: works without circuitforge_core installed **cf_voice.acoustic** (new, MIT Protocol + BSL stub): - `AcousticBackend` Protocol (`@runtime_checkable`) - `MockAcousticBackend` — simulates full call lifecycle: ringback → IVR → hold music → AMD signal (`background_shift`) → human answered - `YAMNetAcousticBackend` — Navigation v0.2.x stub, clear `NotImplementedError` - `make_acoustic()` factory **cf_voice.context** (extended): - Wired to `cf_voice.prefs` — Elcor mode and prior_frames read from user preference store - `classify_chunk()` now returns all four event types: tone + queue + speaker + environ - Acoustic events from `MockAcousticBackend` in mock mode; `YAMNetAcousticBackend` stub in real mode (graceful passthrough on NotImplementedError) - `session_id` propagates into ToneEvent - `user_id` + `store` plumbed through from construction **Tests:** 64 passing (was 31) --- ### Open questions resolved **Piper vs. Coqui TTS:** Piper. Coqui-TTS is effectively abandoned (last release 2023). Piper is maintained by Nabu Casa (Home Assistant), ships as a single binary with pre-built voices, and runs on CPU without Python binding issues. `.env.example` will note `CF_VOICE_TTS_BACKEND=piper`. **wav2vec2 SER model VRAM:** `ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition` (~1.5GB VRAM) is the right pick. On the RTX 4000 SFF Ada (8GB) it fits alongside Whisper small (500MB) with 6GB headroom for the LLM router. If VRAM is tight, the model can run on CPU at ~2× realtime — acceptable for 2s windows. **Elcor label prompt design:** 4 prior frames (~8–10 seconds of context at 2.5s intervals). The prompt includes the last N `affect` labels in sequence so the LLM can label the *shift* not just the instantaneous affect. Configured via `PREF_ELCOR_PRIOR_FRAMES` (default 4, user-adjustable). **Event-driven vs. continuous stream:** Event-driven. The streaming path suppresses frames where `shift_magnitude < 0.15` (the `is_shift()` threshold). Queue/environ events only emit on label change. This prevents SSE flooding on long hold-music segments. **pyannote.audio license:** CC BY 4.0 — commercial use is permitted with attribution. The two gated models (`pyannote/speaker-diarization-3.1` and `pyannote/segmentation-3.0`) require HuggingFace account acceptance, already documented in `.env.example` and the diarize.py module header. --- Navigation v0.2.x (real YAMNet + pyannote wiring into ContextClassifier.stream()) will be a separate issue.
Sign in to join this conversation.
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/circuitforge-core#34
No description provided.