Circuit-Forge/circuitforge-core

Fork 0

tracking: cf-core integration points for cf-voice (SSE wire format, preferences hooks) #34

New issue

Closed

opened 2026-04-06 10:35:39 -07:00 by pyr0ball · 3 comments

pyr0ball commented

2026-04-06 10:35:39 -07:00

Owner

Summary

Graduate the voice stub to a full cf_voice module. Two active consumers: Osprey (telephony + IVR) and Peregrine (voice I/O for nonverbal users). Design doc: circuitforge-plans/circuitforge-core/2026-04-06-cf-voice-design.md.

Three sub-modules

cf_voice.io — Speech I/O

STT (Whisper local, cloud fallback) and TTS (Piper local, cloud fallback).

Key output: TranscriptResult with confidence and flagged_low_confidence (cf-orch#17 a11y blocker).

Protocol interface:

class STTBackend(Protocol):
    async def transcribe(self, audio: bytes, language: str = "en") -> TranscriptResult: ...

class TTSBackend(Protocol):
    async def synthesize(self, text: str, voice: str = "default") -> bytes: ...

cf_voice.context — Parallel audio classifier

Non-conversational. Runs alongside STT, never tries to understand words — classifies acoustic features only.

Event classes:

queue: hold_music | silence | ringback | busy | dead_air
speaker: ivr_synth | human_single | human_multi | transfer
environ: call_center | music | background_shift
tone: affect + shift + prosody + subtext (Elcor label)

Model strategy:

Event class	Model
Queue / environ	YAMNet or PANN
Speaker type / VAD	pyannote.audio
IVR synth vs. human	Custom fine-tuned head
Tone / affect	SER (wav2vec2-based)
Prosody features	librosa
Elcor label	Small local LLM via LLMRouter

Elcor mode (accessibility feature)

Tone shifts and affect are converted to a human-readable annotation prepended to the transcript:

"With escalating impatience:" | "Apologetically, trailing off:" | "Warmly:"

This is an explicit accessibility feature for autistic and ND users who may not reliably perceive implicit tonal/emotional cues in voice interactions. It is opt-in, user-configurable, and rendered locally — no audio or labels leave the device.

The classifier also doubles as local AMD (answering machine detection): background_shift from hold music to call-center ambient is a reliable pre-speech human-answered signal, resolving the FreeSWITCH AMD open question from the telephony spec.

cf_voice.telephony — Outbound telephony abstraction

class TelephonyBackend(Protocol):
    async def dial(self, to: str, from_: str, webhook_url: str) -> str: ...
    async def send_dtmf(self, call_sid: str, digits: str) -> None: ...
    async def bridge(self, call_sid: str, target: str) -> None: ...
    async def hangup(self, call_sid: str) -> None: ...
    async def announce(self, call_sid: str, text: str, voice: str) -> None: ...

class SignalWireBackend(TelephonyBackend): ...
class FreeSWITCHBackend(TelephonyBackend): ...

announce() implements the adaptive service identification requirement (cf-orch#18, osprey#21).

Combined output type

@dataclass
class VoiceFrame:
    timestamp: float
    transcript: str | None
    events: list[AudioEvent]
    speaker_id: str | None

Tier mapping

Tier	STT	TTS	Classifier	Telephony
Free	Whisper local	Piper local	All local	FreeSWITCH + BYOK VoIP
Paid	Cloud STT fallback	Cloud TTS	Same + cloud SER fallback	SignalWire via CF

Build sequence

cf_voice.io — unblocks peregrine#74 (voice I/O for nonverbal users)
cf_voice.telephony — unblocks osprey#1
cf_voice.context — queue state + speaker first (AMD), tone + Elcor second

Open questions

Piper vs. Coqui TTS for local synthesis
wav2vec2 SER model size vs. accuracy trade-off on available VRAM
Elcor label prompt design — how many prior frames to include as context
Event-driven vs. continuous stream output for context classifier
pyannote.audio license (CC BY 4.0, requires HuggingFace account acceptance — document in setup)

References

Telephony spec: circuitforge-plans/osprey/superpowers/specs/2026-04-04-telephony-backend-design.md
cf-orch#17: a11y service endpoint audit (STT confidence + context checkpoint blockers)
cf-orch#18: adaptive service identification contract
osprey#1: FTB dialer (blocked on SignalWire credentials)
osprey#21: bridge identification implementation
peregrine#74: messaging tab + voice I/O for nonverbal users

## Summary Graduate the `voice` stub to a full `cf_voice` module. Two active consumers: Osprey (telephony + IVR) and Peregrine (voice I/O for nonverbal users). Design doc: `circuitforge-plans/circuitforge-core/2026-04-06-cf-voice-design.md`. ## Three sub-modules ### cf_voice.io — Speech I/O STT (Whisper local, cloud fallback) and TTS (Piper local, cloud fallback). Key output: `TranscriptResult` with `confidence` and `flagged_low_confidence` (cf-orch#17 a11y blocker). Protocol interface: ```python class STTBackend(Protocol): async def transcribe(self, audio: bytes, language: str = "en") -> TranscriptResult: ... class TTSBackend(Protocol): async def synthesize(self, text: str, voice: str = "default") -> bytes: ... ``` ### cf_voice.context — Parallel audio classifier Non-conversational. Runs alongside STT, never tries to understand words — classifies acoustic features only. Event classes: - **queue**: `hold_music | silence | ringback | busy | dead_air` - **speaker**: `ivr_synth | human_single | human_multi | transfer` - **environ**: `call_center | music | background_shift` - **tone**: `affect` + `shift` + `prosody` + `subtext` (Elcor label) Model strategy: | Event class | Model | |---|---| | Queue / environ | YAMNet or PANN | | Speaker type / VAD | pyannote.audio | | IVR synth vs. human | Custom fine-tuned head | | Tone / affect | SER (wav2vec2-based) | | Prosody features | librosa | | Elcor label | Small local LLM via LLMRouter | #### Elcor mode (accessibility feature) Tone shifts and affect are converted to a human-readable annotation prepended to the transcript: > `"With escalating impatience:"` | `"Apologetically, trailing off:"` | `"Warmly:"` This is an explicit accessibility feature for autistic and ND users who may not reliably perceive implicit tonal/emotional cues in voice interactions. It is opt-in, user-configurable, and rendered locally — no audio or labels leave the device. The classifier also doubles as local AMD (answering machine detection): `background_shift` from hold music to call-center ambient is a reliable pre-speech human-answered signal, resolving the FreeSWITCH AMD open question from the telephony spec. ### cf_voice.telephony — Outbound telephony abstraction ```python class TelephonyBackend(Protocol): async def dial(self, to: str, from_: str, webhook_url: str) -> str: ... async def send_dtmf(self, call_sid: str, digits: str) -> None: ... async def bridge(self, call_sid: str, target: str) -> None: ... async def hangup(self, call_sid: str) -> None: ... async def announce(self, call_sid: str, text: str, voice: str) -> None: ... class SignalWireBackend(TelephonyBackend): ... class FreeSWITCHBackend(TelephonyBackend): ... ``` `announce()` implements the adaptive service identification requirement (cf-orch#18, osprey#21). ## Combined output type ```python @dataclass class VoiceFrame: timestamp: float transcript: str | None events: list[AudioEvent] speaker_id: str | None ``` ## Tier mapping | Tier | STT | TTS | Classifier | Telephony | |---|---|---|---|---| | Free | Whisper local | Piper local | All local | FreeSWITCH + BYOK VoIP | | Paid | Cloud STT fallback | Cloud TTS | Same + cloud SER fallback | SignalWire via CF | ## Build sequence 1. `cf_voice.io` — unblocks peregrine#74 (voice I/O for nonverbal users) 2. `cf_voice.telephony` — unblocks osprey#1 3. `cf_voice.context` — queue state + speaker first (AMD), tone + Elcor second ## Open questions - [ ] Piper vs. Coqui TTS for local synthesis - [ ] wav2vec2 SER model size vs. accuracy trade-off on available VRAM - [ ] Elcor label prompt design — how many prior frames to include as context - [ ] Event-driven vs. continuous stream output for context classifier - [ ] pyannote.audio license (CC BY 4.0, requires HuggingFace account acceptance — document in setup) ## References - Telephony spec: `circuitforge-plans/osprey/superpowers/specs/2026-04-04-telephony-backend-design.md` - cf-orch#17: a11y service endpoint audit (STT confidence + context checkpoint blockers) - cf-orch#18: adaptive service identification contract - osprey#1: FTB dialer (blocked on SignalWire credentials) - osprey#21: bridge identification implementation - peregrine#74: messaging tab + voice I/O for nonverbal users

pyr0ball added this to the v0.8.0 — Pipeline + Hardware + Documents modules milestone 2026-04-06 10:35:39 -07:00

pyr0ball added the

labels 2026-04-06 10:35:39 -07:00

pyr0ball referenced this issue from Circuit-Forge/harrier

2026-04-06 10:48:37 -07:00

Voice accessibility: wire falcon phone workflows through linnet #4

pyr0ball referenced this issue from Circuit-Forge/ibis

2026-04-06 10:48:37 -07:00

Voice accessibility: wire ibis phone workflows through linnet #3

pyr0ball referenced this issue from Circuit-Forge/rufous

2026-04-06 10:48:37 -07:00

Voice accessibility: wire harrier phone workflows through linnet #4

pyr0ball commented

2026-04-06 16:43:24 -07:00

Author

Owner

cf_voice is now a standalone repo (Circuit-Forge/cf-voice, MIT/BSL split) rather than a cf-core module — see #35 (closed) and cf-core#39 (closed).

This issue should track any cf-core integration points that depend on cf-voice (e.g. shared VoiceFrame SSE wire format #40, preferences.prefers_reduced_motion #38). Retitling.

cf_voice is now a standalone repo (Circuit-Forge/cf-voice, MIT/BSL split) rather than a cf-core module — see #35 (closed) and cf-core#39 (closed). This issue should track any cf-core integration points that depend on cf-voice (e.g. shared VoiceFrame SSE wire format #40, preferences.prefers_reduced_motion #38). Retitling.

pyr0ball changed title from ~~New module: cf_voice — STT/TTS, parallel audio classifier, telephony abstraction~~ to tracking: cf-core integration points for cf-voice (SSE wire format, preferences hooks)

2026-04-06 16:43:24 -07:00

pyr0ball commented

2026-04-06 17:57:33 -07:00

Author

Owner

Progress update

cf_voice.telephony shipped (Notation v0.1.x):

TelephonyBackend Protocol (MIT)
MockTelephonyBackend — dev/CI, no real calls, AMD simulation (MIT)
SignalWireBackend — paid tier (BSL)
FreeSWITCHBackend — free tier self-hosted (BSL)
make_telephony() factory — env-driven backend selection
19 new tests, all passing (31 total)
README telephony section added
Optional extras: cf-voice[signalwire], cf-voice[freeswitch]

Unblocks osprey#1. cf_voice.io build fix: real backend raises NotImplementedError instead of ImportError when inference extras missing.

Also closed: #40 (SSE wire format was already documented in README as of previous session).

Remaining:

cf-core preferences hooks integration
Queue state + speaker type classifier (Navigation v0.2.x)

## Progress update **cf_voice.telephony shipped** (Notation v0.1.x): - `TelephonyBackend` Protocol (MIT) - `MockTelephonyBackend` — dev/CI, no real calls, AMD simulation (MIT) - `SignalWireBackend` — paid tier (BSL) - `FreeSWITCHBackend` — free tier self-hosted (BSL) - `make_telephony()` factory — env-driven backend selection - 19 new tests, all passing (31 total) - README telephony section added - Optional extras: `cf-voice[signalwire]`, `cf-voice[freeswitch]` Unblocks osprey#1. `cf_voice.io` build fix: real backend raises `NotImplementedError` instead of `ImportError` when inference extras missing. **Also closed:** #40 (SSE wire format was already documented in README as of previous session). **Remaining:** - [ ] cf-core preferences hooks integration - [ ] Queue state + speaker type classifier (Navigation v0.2.x)

pyr0ball commented

2026-04-06 18:03:28 -07:00

Author

Owner

Closing — all deliverables complete

What shipped in this pass

cf_voice.prefs (new, MIT):

Preference key constants: PREF_ELCOR_MODE, PREF_CONFIDENCE_THRESHOLD, PREF_WHISPER_MODEL, PREF_ELCOR_PRIOR_FRAMES
get_voice_pref() / set_voice_pref() — optional cf-core integration, env var fallback, built-in defaults
is_elcor_enabled(), get_confidence_threshold(), get_whisper_model(), get_elcor_prior_frames() convenience helpers
Graceful degradation: works without circuitforge_core installed

cf_voice.acoustic (new, MIT Protocol + BSL stub):

AcousticBackend Protocol (@runtime_checkable)
MockAcousticBackend — simulates full call lifecycle: ringback → IVR → hold music → AMD signal (background_shift) → human answered
YAMNetAcousticBackend — Navigation v0.2.x stub, clear NotImplementedError
make_acoustic() factory

cf_voice.context (extended):

Wired to cf_voice.prefs — Elcor mode and prior_frames read from user preference store
classify_chunk() now returns all four event types: tone + queue + speaker + environ
Acoustic events from MockAcousticBackend in mock mode; YAMNetAcousticBackend stub in real mode (graceful passthrough on NotImplementedError)
session_id propagates into ToneEvent
user_id + store plumbed through from construction

Tests: 64 passing (was 31)

Open questions resolved

Piper vs. Coqui TTS: Piper. Coqui-TTS is effectively abandoned (last release 2023). Piper is maintained by Nabu Casa (Home Assistant), ships as a single binary with pre-built voices, and runs on CPU without Python binding issues. .env.example will note CF_VOICE_TTS_BACKEND=piper.

wav2vec2 SER model VRAM: ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition (~1.5GB VRAM) is the right pick. On the RTX 4000 SFF Ada (8GB) it fits alongside Whisper small (500MB) with 6GB headroom for the LLM router. If VRAM is tight, the model can run on CPU at ~2× realtime — acceptable for 2s windows.

Elcor label prompt design: 4 prior frames (~8–10 seconds of context at 2.5s intervals). The prompt includes the last N affect labels in sequence so the LLM can label the shift not just the instantaneous affect. Configured via PREF_ELCOR_PRIOR_FRAMES (default 4, user-adjustable).

Event-driven vs. continuous stream: Event-driven. The streaming path suppresses frames where shift_magnitude < 0.15 (the is_shift() threshold). Queue/environ events only emit on label change. This prevents SSE flooding on long hold-music segments.

pyannote.audio license: CC BY 4.0 — commercial use is permitted with attribution. The two gated models (pyannote/speaker-diarization-3.1 and pyannote/segmentation-3.0) require HuggingFace account acceptance, already documented in .env.example and the diarize.py module header.

Navigation v0.2.x (real YAMNet + pyannote wiring into ContextClassifier.stream()) will be a separate issue.

## Closing — all deliverables complete ### What shipped in this pass **cf_voice.prefs** (new, MIT): - Preference key constants: `PREF_ELCOR_MODE`, `PREF_CONFIDENCE_THRESHOLD`, `PREF_WHISPER_MODEL`, `PREF_ELCOR_PRIOR_FRAMES` - `get_voice_pref()` / `set_voice_pref()` — optional cf-core integration, env var fallback, built-in defaults - `is_elcor_enabled()`, `get_confidence_threshold()`, `get_whisper_model()`, `get_elcor_prior_frames()` convenience helpers - Graceful degradation: works without circuitforge_core installed **cf_voice.acoustic** (new, MIT Protocol + BSL stub): - `AcousticBackend` Protocol (`@runtime_checkable`) - `MockAcousticBackend` — simulates full call lifecycle: ringback → IVR → hold music → AMD signal (`background_shift`) → human answered - `YAMNetAcousticBackend` — Navigation v0.2.x stub, clear `NotImplementedError` - `make_acoustic()` factory **cf_voice.context** (extended): - Wired to `cf_voice.prefs` — Elcor mode and prior_frames read from user preference store - `classify_chunk()` now returns all four event types: tone + queue + speaker + environ - Acoustic events from `MockAcousticBackend` in mock mode; `YAMNetAcousticBackend` stub in real mode (graceful passthrough on NotImplementedError) - `session_id` propagates into ToneEvent - `user_id` + `store` plumbed through from construction **Tests:** 64 passing (was 31) --- ### Open questions resolved **Piper vs. Coqui TTS:** Piper. Coqui-TTS is effectively abandoned (last release 2023). Piper is maintained by Nabu Casa (Home Assistant), ships as a single binary with pre-built voices, and runs on CPU without Python binding issues. `.env.example` will note `CF_VOICE_TTS_BACKEND=piper`. **wav2vec2 SER model VRAM:** `ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition` (~1.5GB VRAM) is the right pick. On the RTX 4000 SFF Ada (8GB) it fits alongside Whisper small (500MB) with 6GB headroom for the LLM router. If VRAM is tight, the model can run on CPU at ~2× realtime — acceptable for 2s windows. **Elcor label prompt design:** 4 prior frames (~8–10 seconds of context at 2.5s intervals). The prompt includes the last N `affect` labels in sequence so the LLM can label the *shift* not just the instantaneous affect. Configured via `PREF_ELCOR_PRIOR_FRAMES` (default 4, user-adjustable). **Event-driven vs. continuous stream:** Event-driven. The streaming path suppresses frames where `shift_magnitude < 0.15` (the `is_shift()` threshold). Queue/environ events only emit on label change. This prevents SSE flooding on long hold-music segments. **pyannote.audio license:** CC BY 4.0 — commercial use is permitted with attribution. The two gated models (`pyannote/speaker-diarization-3.1` and `pyannote/segmentation-3.0`) require HuggingFace account acceptance, already documented in `.env.example` and the diarize.py module header. --- Navigation v0.2.x (real YAMNet + pyannote wiring into ContextClassifier.stream()) will be a separate issue.

pyr0ball closed this issue

2026-04-06 18:03:31 -07:00