cf-voice/.env.example
pyr0ball fed6388b99 feat: real inference pipeline — STT, tone classifier, diarization, mic capture
- cf_voice/stt.py: WhisperSTT async wrapper (faster-whisper, thread-pool executor,
  rolling 50-word session prompt for cross-chunk context continuity)
- cf_voice/classify.py: ToneClassifier — wav2vec2 SER + librosa prosody flags
  (energy, ZCR speech rate, YIN pitch contour) mapped to AFFECT_LABELS
- cf_voice/diarize.py: Diarizer async wrapper around pyannote/speaker-diarization-3.1;
  speaker_at() helper for Navigation v0.2.x wiring
- cf_voice/capture.py: MicVoiceIO — sounddevice 16kHz mono capture, 2s window
  accumulation, parallel STT+classify tasks, shift_magnitude from confidence delta
- cf_voice/io.py: make_io() now returns MicVoiceIO when CF_VOICE_MOCK is unset
- cf_voice/context.py: classify_chunk() split into mock/real paths; real path
  decodes base64 PCM and runs ToneClassifier synchronously (cf-orch endpoint)
- pyproject.toml: inference extras expanded (faster-whisper, sounddevice,
  librosa, python-dotenv)
- .env.example: HF_TOKEN, CF_VOICE_WHISPER_MODEL, CF_VOICE_DEVICE, CF_VOICE_MOCK,
  CF_VOICE_CONFIDENCE_THRESHOLD

Prior art ported from: Plex-Scripts/transcription/diarization.py (pyannote
setup), devl/ogma/backend/speech/transcription_engine.py (faster-whisper
preprocessing and session prompt pattern).
2026-04-06 17:33:51 -07:00

31 lines
2.1 KiB
Text

# cf-voice environment — copy to .env and fill in values
# cf-voice itself does not auto-load .env; consumers (Linnet, Osprey, etc.)
# load it via python-dotenv in their own startup. For standalone cf-voice
# dev/testing, source this file manually or install python-dotenv.
# ── HuggingFace ───────────────────────────────────────────────────────────────
# Required for pyannote.audio speaker diarization model download.
# Get a free token at https://huggingface.co/settings/tokens
# Also accept the gated model terms at:
# https://huggingface.co/pyannote/speaker-diarization-3.1
# https://huggingface.co/pyannote/segmentation-3.0
HF_TOKEN=
# ── Whisper STT ───────────────────────────────────────────────────────────────
# Model size: tiny | base | small | medium | large-v2 | large-v3
# Smaller = faster / less VRAM; larger = more accurate.
# Recommended: small (500MB VRAM) for real-time use.
CF_VOICE_WHISPER_MODEL=small
# ── Compute ───────────────────────────────────────────────────────────────────
# auto (detect GPU), cuda, cpu
CF_VOICE_DEVICE=auto
# ── Mock mode ─────────────────────────────────────────────────────────────────
# Set to 1 to use synthetic VoiceFrames — no GPU, mic, or HF token required.
# Unset or 0 for real audio capture.
CF_VOICE_MOCK=
# ── Tone classifier ───────────────────────────────────────────────────────────
# Minimum confidence to emit a VoiceFrame (below this = frame skipped).
CF_VOICE_CONFIDENCE_THRESHOLD=0.55