cf-voice/README.md

# cf-voice

CircuitForge voice annotation pipeline. Produces `VoiceFrame` objects from a live audio stream — tone label, confidence, speaker identity, and shift magnitude — and exposes `ToneEvent` as the stable SSE wire type for downstream consumers (Linnet, Osprey, Falcon).

**Status:** Notation v0.1.x — real inference pipeline live (faster-whisper STT, wav2vec2 SER, librosa prosody, pyannote diarization). Mock mode available for dev/CI without GPU or mic.

## Install

```bash
# Mock mode only (no GPU required)
pip install -e ../cf-voice

# Real inference (STT + tone classifier + diarization)
pip install -e "../cf-voice[inference]"
```

Copy `.env.example` to `.env` and fill in `HF_TOKEN` for diarization.

## Quick start

```python
from cf_voice.context import ContextClassifier

# Mock mode (no hardware needed)
classifier = ContextClassifier.mock()
async for frame in classifier.stream():
    print(frame.label, frame.confidence)

# Real mic capture (requires [inference] extras + CF_VOICE_MOCK unset)
classifier = ContextClassifier.from_env()
async for frame in classifier.stream():
    print(frame.label, frame.confidence)
```

CLI smoke-test:

```bash
CF_VOICE_MOCK=1 cf-voice-demo
```

---

## VoiceFrame

Produced by `cf_voice.io` (audio capture layer). MIT licensed.

```python
@dataclass
class VoiceFrame:
    label: str            # tone descriptor, e.g. "Warmly impatient"
    confidence: float     # 0.0–1.0
    speaker_id: str       # ephemeral local label, e.g. "speaker_a"
    shift_magnitude: float  # delta from previous frame, 0.0–1.0
    timestamp: float      # session-relative seconds

    def is_reliable(self, threshold=0.6) -> bool: ...
    def is_shift(self, threshold=0.3) -> bool: ...
```

---

## ToneEvent — SSE wire format

`ToneEvent` is the stable SSE wire type emitted by Linnet's annotation stream
and consumed by `<LinnetWidget />` embeds in Osprey, Falcon, and other products.

**Field names are locked as of cf-voice v0.1.0** (cf-core#40).

### JSON shape

```json
{
  "event_type": "tone",
  "timestamp": 4.82,
  "label": "Warmly impatient",
  "confidence": 0.79,
  "speaker_id": "speaker_a",
  "subtext": "Tone: Frustrated",
  "affect": "frustrated",
  "shift_magnitude": 0.74,
  "shift_direction": "more_urgent",
  "prosody_flags": ["fast_rate", "rising"],
  "session_id": "ses_abc123"
}
```

### Field reference

| Field | Type | Stable | Description |
|---|---|---|---|
| `event_type` | `"tone"` | yes | Always `"tone"` for ToneEvent |
| `timestamp` | `float` | yes | Seconds since session start |
| `label` | `str` | yes | Human-readable tone descriptor ("Warmly impatient") |
| `confidence` | `float` | yes | 0.0–1.0. Below ~0.55 = speculative |
| `speaker_id` | `str` | yes | Ephemeral diarization label ("speaker_a"). Resets per session |
| `subtext` | `str \| null` | yes | Annotation text. Generic: `"Tone: Frustrated"`. Elcor: `"With barely concealed frustration:"` |
| `affect` | `str` | yes | AFFECT_LABELS key ("frustrated"). See `cf_voice.events.AFFECT_LABELS` |
| `shift_magnitude` | `float` | yes | 0.0–1.0. High = meaningful register change from previous frame |
| `shift_direction` | `str` | yes | `"warmer"` \| `"colder"` \| `"more_urgent"` \| `"stable"` |
| `prosody_flags` | `str[]` | no | Raw prosody signals ("fast_rate", "rising", "flat_pitch", "low_energy"). Subject to change |
| `session_id` | `str` | yes | Caller-assigned. Correlates events to a conversation session |

### SSE envelope

Linnet emits events in standard SSE format:

```
event: tone-event
data: {"event_type":"tone","timestamp":4.82,...}

```

Host apps subscribing via `<LinnetWidget />` receive `MessageEvent` with `type === "tone-event"`.

### Elcor mode

`subtext` switches format when the session is in Elcor mode (easter egg, unlocked by cumulative session time). Generic is always available; Elcor is opt-in via the session flag:

| Affect | Generic | Elcor |
|---|---|---|
| frustrated | `Tone: Frustrated` | `With barely concealed frustration:` |
| warm | `Tone: Warm` | `Warmly:` |
| scripted | `Tone: Scripted` | `Reading from a script:` |
| dismissive | `Tone: Dismissive` | `With polite dismissiveness:` |
| tired | `Tone: Tired` | `With audible fatigue:` |

---

## Mock mode

Set `CF_VOICE_MOCK=1` or pass `mock=True` to `make_io()`. Emits synthetic `VoiceFrame` objects on a timer. No GPU, microphone, or `HF_TOKEN` required. All API surface is identical to real mode.

---

## Module structure

| Module | License | Purpose |
|--------|---------|---------|
| `cf_voice.models` | MIT | `VoiceFrame` dataclass |
| `cf_voice.events` | MIT | `AudioEvent`, `ToneEvent`, wire format types |
| `cf_voice.io` | MIT | `VoiceIO` base, `MockVoiceIO`, `make_io()` factory |
| `cf_voice.capture` | BSL 1.1 | `MicVoiceIO` — real mic capture, 2s windowing |
| `cf_voice.stt` | BSL 1.1 | `WhisperSTT` — faster-whisper async wrapper |
| `cf_voice.classify` | BSL 1.1 | `ToneClassifier` — wav2vec2 SER + librosa prosody |
| `cf_voice.diarize` | BSL 1.1 | `Diarizer` — pyannote.audio async wrapper |
| `cf_voice.context` | BSL 1.1 | `ContextClassifier` — high-level consumer API |

BSL applies to inference modules. IO + types + wire format = MIT.

---

## Consumed by

- `Circuit-Forge/linnet` — real-time tone annotation PWA (primary consumer)
- `Circuit-Forge/osprey` — telephony bridge voice context (Navigation v0.2.x)
- `Circuit-Forge/falcon` (planned) — phone form-filling, IVR navigation