New modules shipped (from Linnet integration): - acoustic.py: AST (MIT/ast-finetuned-audioset-10-10-0.4593) replaces YAMNet stub; 527 AudioSet classes mapped to queue/speaker/environ/scene labels; _LABEL_MAP includes hold_music, ringback, DTMF, background_shift, AMD signal chain - accent.py: facebook/mms-lid-126 language ID → regional accent labels (en_gb, en_us, en_au, fr, es, de, zh, …); lazy-loaded, gated by CF_VOICE_ACCENT - privacy.py: compound privacy risk scorer — public_env, background_voices, nature scene, accent signals; returns 0–3 score without storing any audio - prosody.py: openSMILE-backed prosody extractor (sarcasm_risk, flat_f0_score, speech_rate, pitch_range); mock mode returns neutral values - dimensional.py: audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim valence/arousal/dominance scorer; gated by CF_VOICE_DIMENSIONAL - trajectory.py: rolling buffer for arousal/valence deltas, trend detection (escalating/suppressed/stable), coherence scoring, suppression/reframe flags - telephony.py: TelephonyBackend Protocol + MockTelephonyBackend + SignalWireBackend + FreeSWITCHBackend; CallSession dataclass; make_telephony() factory - app.py: FastAPI service (port 8007) — /health + /classify; accepts base64 PCM chunks, returns full AudioEventOut including dimensional/prosody/accent fields - prefs.py: voice preference helpers (elcor_mode, confidence_threshold, whisper_model, elcor_prior_frames); cf-core and env-var fallback Tests: fix stale tests (YAMNetAcousticBackend → ASTAcousticBackend, scene field added to AcousticResult, speaker_at gap now resolves dominant speaker not UNKNOWN, make_io real path returns MicVoiceIO when sounddevice installed). 78 tests passing. Closes #2, #3.
65 lines
1.7 KiB
Python
65 lines
1.7 KiB
Python
"""
|
|
Manual integration test for speaker diarization via pyannote.
|
|
|
|
Requires:
|
|
- HF_TOKEN env var (or set below)
|
|
- CF_VOICE_DIARIZE=1
|
|
- ffmpeg on PATH
|
|
- A local audio/video file (edit MEDIA_FILE below)
|
|
- pip install cf-voice[inference]
|
|
|
|
Run:
|
|
HF_TOKEN=hf_... CF_VOICE_DIARIZE=1 python scripts/test_diarize_real.py
|
|
"""
|
|
from __future__ import annotations
|
|
|
|
import asyncio
|
|
import os
|
|
import subprocess
|
|
|
|
import numpy as np
|
|
|
|
# Override if not in env
|
|
if not os.environ.get("HF_TOKEN"):
|
|
raise SystemExit("Set HF_TOKEN in env before running this script.")
|
|
os.environ.setdefault("CF_VOICE_DIARIZE", "1")
|
|
|
|
MEDIA_FILE = "/Library/Series/Hogan's Heroes/Season 3/Hogan's Heroes - S03E19 - Hogan, Go Home.mkv"
|
|
START_S = 120
|
|
DURATION_S = 2
|
|
SAMPLE_RATE = 16_000
|
|
|
|
from cf_voice.diarize import Diarizer, SpeakerTracker # noqa: E402
|
|
|
|
|
|
async def main() -> None:
|
|
d = Diarizer.from_env()
|
|
tracker = SpeakerTracker()
|
|
|
|
proc = subprocess.run(
|
|
[
|
|
"ffmpeg", "-i", MEDIA_FILE,
|
|
"-ss", str(START_S),
|
|
"-t", str(DURATION_S),
|
|
"-ar", str(SAMPLE_RATE),
|
|
"-ac", "1",
|
|
"-f", "s16le",
|
|
"-",
|
|
],
|
|
capture_output=True,
|
|
check=True,
|
|
)
|
|
audio = np.frombuffer(proc.stdout, dtype=np.int16).astype(np.float32) / 32768.0
|
|
rms = float(np.sqrt(np.mean(audio**2)))
|
|
print(f"audio: {len(audio)} samples, {len(audio) / SAMPLE_RATE:.2f}s, rms={rms:.4f}")
|
|
|
|
segs = await d.diarize_async(audio)
|
|
print(f"segments ({len(segs)}): {segs}")
|
|
|
|
mid = len(audio) / 2.0 / SAMPLE_RATE
|
|
label = d.speaker_at(segs, mid, tracker)
|
|
print(f"speaker_at({mid:.2f}s): {label}")
|
|
|
|
|
|
if __name__ == "__main__":
|
|
asyncio.run(main())
|