- cf_voice/stt.py: WhisperSTT async wrapper (faster-whisper, thread-pool executor, rolling 50-word session prompt for cross-chunk context continuity) - cf_voice/classify.py: ToneClassifier — wav2vec2 SER + librosa prosody flags (energy, ZCR speech rate, YIN pitch contour) mapped to AFFECT_LABELS - cf_voice/diarize.py: Diarizer async wrapper around pyannote/speaker-diarization-3.1; speaker_at() helper for Navigation v0.2.x wiring - cf_voice/capture.py: MicVoiceIO — sounddevice 16kHz mono capture, 2s window accumulation, parallel STT+classify tasks, shift_magnitude from confidence delta - cf_voice/io.py: make_io() now returns MicVoiceIO when CF_VOICE_MOCK is unset - cf_voice/context.py: classify_chunk() split into mock/real paths; real path decodes base64 PCM and runs ToneClassifier synchronously (cf-orch endpoint) - pyproject.toml: inference extras expanded (faster-whisper, sounddevice, librosa, python-dotenv) - .env.example: HF_TOKEN, CF_VOICE_WHISPER_MODEL, CF_VOICE_DEVICE, CF_VOICE_MOCK, CF_VOICE_CONFIDENCE_THRESHOLD Prior art ported from: Plex-Scripts/transcription/diarization.py (pyannote setup), devl/ogma/backend/speech/transcription_engine.py (faster-whisper preprocessing and session prompt pattern).
31 lines
2.1 KiB
Text
31 lines
2.1 KiB
Text
# cf-voice environment — copy to .env and fill in values
|
|
# cf-voice itself does not auto-load .env; consumers (Linnet, Osprey, etc.)
|
|
# load it via python-dotenv in their own startup. For standalone cf-voice
|
|
# dev/testing, source this file manually or install python-dotenv.
|
|
|
|
# ── HuggingFace ───────────────────────────────────────────────────────────────
|
|
# Required for pyannote.audio speaker diarization model download.
|
|
# Get a free token at https://huggingface.co/settings/tokens
|
|
# Also accept the gated model terms at:
|
|
# https://huggingface.co/pyannote/speaker-diarization-3.1
|
|
# https://huggingface.co/pyannote/segmentation-3.0
|
|
HF_TOKEN=
|
|
|
|
# ── Whisper STT ───────────────────────────────────────────────────────────────
|
|
# Model size: tiny | base | small | medium | large-v2 | large-v3
|
|
# Smaller = faster / less VRAM; larger = more accurate.
|
|
# Recommended: small (500MB VRAM) for real-time use.
|
|
CF_VOICE_WHISPER_MODEL=small
|
|
|
|
# ── Compute ───────────────────────────────────────────────────────────────────
|
|
# auto (detect GPU), cuda, cpu
|
|
CF_VOICE_DEVICE=auto
|
|
|
|
# ── Mock mode ─────────────────────────────────────────────────────────────────
|
|
# Set to 1 to use synthetic VoiceFrames — no GPU, mic, or HF token required.
|
|
# Unset or 0 for real audio capture.
|
|
CF_VOICE_MOCK=
|
|
|
|
# ── Tone classifier ───────────────────────────────────────────────────────────
|
|
# Minimum confidence to emit a VoiceFrame (below this = frame skipped).
|
|
CF_VOICE_CONFIDENCE_THRESHOLD=0.55
|