Ports prior voice assistant research and prototypes from devl/Devops into the Minerva repo. Includes: - docs/: architecture, wake word guides, ESP32-S3 spec, hardware buying guide - scripts/: voice_server.py, voice_server_enhanced.py, setup scripts - hardware/maixduino/: edge device scripts with WiFi credentials scrubbed (replaced hardcoded password with secrets.py pattern) - config/.env.example: server config template - .gitignore: excludes .env, secrets.py, model blobs, ELF firmware - CLAUDE.md: Minerva product context and connection to cf-voice roadmap
905 lines
24 KiB
Markdown
Executable file
905 lines
24 KiB
Markdown
Executable file
# Advanced Wake Word Topics - Pre-trained Models, Multiple Wake Words, and Voice Adaptation
|
|
|
|
## Pre-trained Mycroft Models
|
|
|
|
### Yes! Pre-trained Models Exist
|
|
|
|
Mycroft AI provides several pre-trained wake word models you can use immediately:
|
|
|
|
**Available Models:**
|
|
- **Hey Mycroft** - Original Mycroft wake word (most training data)
|
|
- **Hey Jarvis** - Popular alternative
|
|
- **Christopher** - Alternative wake word
|
|
- **Hey Ezra** - Another option
|
|
|
|
### Download Pre-trained Models
|
|
|
|
```bash
|
|
# On Heimdall
|
|
conda activate precise
|
|
cd ~/precise-models
|
|
|
|
# Create directory for pre-trained models
|
|
mkdir -p pretrained
|
|
cd pretrained
|
|
|
|
# Download Hey Mycroft (recommended starting point)
|
|
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
|
|
tar xzf hey-mycroft.tar.gz
|
|
|
|
# Download other models
|
|
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-jarvis.tar.gz
|
|
tar xzf hey-jarvis.tar.gz
|
|
|
|
# List available models
|
|
ls -lh *.net
|
|
```
|
|
|
|
### Test Pre-trained Model
|
|
|
|
```bash
|
|
conda activate precise
|
|
|
|
# Test Hey Mycroft
|
|
precise-listen hey-mycroft.net
|
|
|
|
# Speak "Hey Mycroft" - should see "!" when detected
|
|
# Press Ctrl+C to exit
|
|
|
|
# Test with different threshold
|
|
precise-listen hey-mycroft.net -t 0.7 # More conservative
|
|
```
|
|
|
|
### Use Pre-trained Model in Voice Server
|
|
|
|
```bash
|
|
cd ~/voice-assistant
|
|
|
|
# Start server with Hey Mycroft model
|
|
python voice_server.py \
|
|
--enable-precise \
|
|
--precise-model ~/precise-models/pretrained/hey-mycroft.net \
|
|
--precise-sensitivity 0.5
|
|
```
|
|
|
|
### Fine-tune Pre-trained Models
|
|
|
|
You can use pre-trained models as a **starting point** and fine-tune with your voice:
|
|
|
|
```bash
|
|
cd ~/precise-models
|
|
mkdir -p hey-mycroft-custom
|
|
|
|
# Copy base model
|
|
cp pretrained/hey-mycroft.net hey-mycroft-custom/
|
|
|
|
# Collect your samples
|
|
cd hey-mycroft-custom
|
|
precise-collect # Record 20-30 samples of YOUR voice
|
|
|
|
# Fine-tune from pre-trained model
|
|
precise-train -e 30 hey-mycroft-custom.net . \
|
|
--from-checkpoint ../pretrained/hey-mycroft.net
|
|
|
|
# This is MUCH faster than training from scratch!
|
|
```
|
|
|
|
**Benefits:**
|
|
- ✅ Start with proven model
|
|
- ✅ Much less training data needed (20-30 vs 100+ samples)
|
|
- ✅ Faster training (30 mins vs 60 mins)
|
|
- ✅ Good baseline accuracy
|
|
|
|
## Multiple Wake Words
|
|
|
|
### Architecture Options
|
|
|
|
#### Option 1: Multiple Models in Parallel (Server-Side Only)
|
|
|
|
Run multiple Precise instances simultaneously:
|
|
|
|
```python
|
|
# In voice_server.py - Multiple wake word detection
|
|
|
|
from precise_runner import PreciseEngine, PreciseRunner
|
|
import threading
|
|
|
|
# Global runners
|
|
precise_runners = {}
|
|
|
|
def on_wake_word_detected(wake_word_name):
|
|
"""Callback factory for different wake words"""
|
|
def callback():
|
|
print(f"Wake word detected: {wake_word_name}")
|
|
wake_word_queue.put({
|
|
'wake_word': wake_word_name,
|
|
'timestamp': time.time()
|
|
})
|
|
return callback
|
|
|
|
def start_multiple_wake_words(wake_word_configs):
|
|
"""
|
|
Start multiple wake word detectors
|
|
|
|
Args:
|
|
wake_word_configs: List of dicts with 'name', 'model', 'sensitivity'
|
|
|
|
Example:
|
|
configs = [
|
|
{'name': 'hey mycroft', 'model': 'hey-mycroft.net', 'sensitivity': 0.5},
|
|
{'name': 'hey jarvis', 'model': 'hey-jarvis.net', 'sensitivity': 0.5}
|
|
]
|
|
"""
|
|
global precise_runners
|
|
|
|
for config in wake_word_configs:
|
|
engine = PreciseEngine(
|
|
'/usr/local/bin/precise-engine',
|
|
config['model']
|
|
)
|
|
|
|
runner = PreciseRunner(
|
|
engine,
|
|
sensitivity=config['sensitivity'],
|
|
on_activation=on_wake_word_detected(config['name'])
|
|
)
|
|
|
|
runner.start()
|
|
precise_runners[config['name']] = runner
|
|
|
|
print(f"Started wake word detector: {config['name']}")
|
|
```
|
|
|
|
**Server-Side Multiple Wake Words:**
|
|
```bash
|
|
# Start server with multiple wake words
|
|
python voice_server.py \
|
|
--enable-precise \
|
|
--precise-models "hey-mycroft:~/models/hey-mycroft.net:0.5,hey-jarvis:~/models/hey-jarvis.net:0.5"
|
|
```
|
|
|
|
**Performance Impact:**
|
|
- CPU: ~5-10% per model (can run 2-3 easily)
|
|
- Memory: ~50-100MB per model
|
|
- Latency: Minimal (all run in parallel)
|
|
|
|
#### Option 2: Single Model, Multiple Phrases (Edge or Server)
|
|
|
|
Train ONE model that responds to multiple phrases:
|
|
|
|
```bash
|
|
cd ~/precise-models/multi-wake
|
|
conda activate precise
|
|
|
|
# Record samples for BOTH wake words in the SAME dataset
|
|
# Label all as "wake-word" regardless of which phrase
|
|
|
|
mkdir -p wake-word not-wake-word
|
|
|
|
# Record "Hey Mycroft" samples
|
|
precise-collect # Save to wake-word/hey-mycroft-*.wav
|
|
|
|
# Record "Hey Computer" samples
|
|
precise-collect # Save to wake-word/hey-computer-*.wav
|
|
|
|
# Record negatives
|
|
precise-collect -f not-wake-word/random.wav
|
|
|
|
# Train single model on both phrases
|
|
precise-train -e 60 multi-wake.net .
|
|
```
|
|
|
|
**Pros:**
|
|
- ✅ Single model = less compute
|
|
- ✅ Works on edge (K210)
|
|
- ✅ Easy to deploy
|
|
|
|
**Cons:**
|
|
- ❌ Can't tell which wake word was used
|
|
- ❌ May reduce accuracy for each individual phrase
|
|
- ❌ Higher false positive risk
|
|
|
|
#### Option 3: Sequential Detection (Edge)
|
|
|
|
Detect wake word, then identify which one:
|
|
|
|
```python
|
|
# Pseudo-code for edge detection
|
|
if wake_word_detected():
|
|
audio_snippet = last_2_seconds()
|
|
|
|
# Run all models on the audio snippet
|
|
scores = {
|
|
'hey-mycroft': model1.score(audio_snippet),
|
|
'hey-jarvis': model2.score(audio_snippet),
|
|
'hey-computer': model3.score(audio_snippet)
|
|
}
|
|
|
|
# Use highest scoring wake word
|
|
wake_word = max(scores, key=scores.get)
|
|
```
|
|
|
|
### Recommendations
|
|
|
|
**Server-Side (Heimdall):**
|
|
- ✅ **Use Option 1** - Multiple models in parallel
|
|
- Run 2-3 wake words easily
|
|
- Each can have different sensitivity
|
|
- Can identify which wake word was used
|
|
- Example: "Hey Mycroft" for commands, "Hey Jarvis" for queries
|
|
|
|
**Edge (Maix Duino K210):**
|
|
- ✅ **Use Option 2** - Single multi-phrase model
|
|
- K210 can handle 1 model efficiently
|
|
- Train on 2-3 phrases max
|
|
- Simpler deployment
|
|
- Lower latency
|
|
|
|
## Voice Adaptation & Multi-User Support
|
|
|
|
### Approach 1: Inclusive Training (Recommended)
|
|
|
|
Train ONE model on EVERYONE'S voices:
|
|
|
|
```bash
|
|
cd ~/precise-models/family-wake-word
|
|
conda activate precise
|
|
|
|
# Record samples from each family member
|
|
# Alice records 30 samples
|
|
precise-collect # Save as wake-word/alice-*.wav
|
|
|
|
# Bob records 30 samples
|
|
precise-collect # Save as wake-word/bob-*.wav
|
|
|
|
# Carol records 30 samples
|
|
precise-collect # Save as wake-word/carol-*.wav
|
|
|
|
# Train on all voices
|
|
precise-train -e 60 family-wake-word.net .
|
|
```
|
|
|
|
**Pros:**
|
|
- ✅ Everyone can use the system
|
|
- ✅ Single model deployment
|
|
- ✅ Works for all family members
|
|
- ✅ Simple maintenance
|
|
|
|
**Cons:**
|
|
- ❌ Can't identify who spoke
|
|
- ❌ May need more training data
|
|
- ❌ No personalization
|
|
|
|
**Best for:** Family voice assistant, shared devices
|
|
|
|
### Approach 2: Speaker Identification (Advanced)
|
|
|
|
Detect wake word, then identify speaker:
|
|
|
|
```python
|
|
# Architecture with speaker ID
|
|
|
|
# Step 1: Precise detects wake word
|
|
if wake_word_detected():
|
|
|
|
# Step 2: Capture voice sample
|
|
voice_sample = record_audio(duration=3)
|
|
|
|
# Step 3: Speaker identification
|
|
speaker = identify_speaker(voice_sample)
|
|
# Uses voice embeddings/neural network
|
|
|
|
# Step 4: Process with user context
|
|
process_command(voice_sample, user=speaker)
|
|
```
|
|
|
|
**Implementation Options:**
|
|
|
|
#### Option A: Use resemblyzer (Voice Embeddings)
|
|
```bash
|
|
pip install resemblyzer --break-system-packages
|
|
|
|
# Enrollment phase
|
|
python enroll_users.py
|
|
# Each user records 10-20 seconds of speech
|
|
# System creates voice profile (embedding)
|
|
|
|
# Runtime
|
|
python speaker_id.py
|
|
# Compares incoming audio to stored embeddings
|
|
# Returns most likely speaker
|
|
```
|
|
|
|
**Example Code:**
|
|
```python
|
|
from resemblyzer import VoiceEncoder, preprocess_wav
|
|
import numpy as np
|
|
|
|
# Initialize encoder
|
|
encoder = VoiceEncoder()
|
|
|
|
# Enrollment - do once per user
|
|
def enroll_user(name, audio_files):
|
|
"""Create voice profile for user"""
|
|
embeddings = []
|
|
|
|
for audio_file in audio_files:
|
|
wav = preprocess_wav(audio_file)
|
|
embedding = encoder.embed_utterance(wav)
|
|
embeddings.append(embedding)
|
|
|
|
# Average embeddings for robustness
|
|
user_profile = np.mean(embeddings, axis=0)
|
|
|
|
# Save profile
|
|
np.save(f'profiles/{name}.npy', user_profile)
|
|
return user_profile
|
|
|
|
# Identification - run each time
|
|
def identify_speaker(audio_file, profiles_dir='profiles'):
|
|
"""Identify which enrolled user is speaking"""
|
|
wav = preprocess_wav(audio_file)
|
|
test_embedding = encoder.embed_utterance(wav)
|
|
|
|
# Load all profiles
|
|
profiles = {}
|
|
for profile_file in os.listdir(profiles_dir):
|
|
name = profile_file.replace('.npy', '')
|
|
profile = np.load(os.path.join(profiles_dir, profile_file))
|
|
profiles[name] = profile
|
|
|
|
# Calculate similarity to each profile
|
|
similarities = {}
|
|
for name, profile in profiles.items():
|
|
similarity = np.dot(test_embedding, profile)
|
|
similarities[name] = similarity
|
|
|
|
# Return most similar
|
|
best_match = max(similarities, key=similarities.get)
|
|
confidence = similarities[best_match]
|
|
|
|
if confidence > 0.7: # Threshold
|
|
return best_match
|
|
else:
|
|
return "unknown"
|
|
```
|
|
|
|
#### Option B: Use pyannote.audio (Production-grade)
|
|
```bash
|
|
pip install pyannote.audio --break-system-packages
|
|
|
|
# Requires HuggingFace token (same as diarization)
|
|
```
|
|
|
|
**Example:**
|
|
```python
|
|
from pyannote.audio import Inference
|
|
|
|
# Initialize
|
|
inference = Inference(
|
|
"pyannote/embedding",
|
|
use_auth_token="your_hf_token"
|
|
)
|
|
|
|
# Enroll users
|
|
alice_profile = inference("alice_sample.wav")
|
|
bob_profile = inference("bob_sample.wav")
|
|
|
|
# Identify
|
|
test_embedding = inference("test_audio.wav")
|
|
|
|
# Compare
|
|
from scipy.spatial.distance import cosine
|
|
alice_similarity = 1 - cosine(test_embedding, alice_profile)
|
|
bob_similarity = 1 - cosine(test_embedding, bob_profile)
|
|
|
|
if alice_similarity > bob_similarity and alice_similarity > 0.7:
|
|
speaker = "Alice"
|
|
elif bob_similarity > 0.7:
|
|
speaker = "Bob"
|
|
else:
|
|
speaker = "Unknown"
|
|
```
|
|
|
|
**Pros:**
|
|
- ✅ Can identify individual users
|
|
- ✅ Personalized responses
|
|
- ✅ User-specific commands/permissions
|
|
- ✅ Better for privacy (know who's speaking)
|
|
|
|
**Cons:**
|
|
- ❌ More complex implementation
|
|
- ❌ Requires enrollment phase
|
|
- ❌ Additional processing time (~100-200ms)
|
|
- ❌ May fail with similar voices
|
|
|
|
### Approach 3: Per-User Wake Word Models
|
|
|
|
Each person has their OWN wake word:
|
|
|
|
```bash
|
|
# Alice's wake word: "Hey Mycroft"
|
|
# Train on ONLY Alice's voice
|
|
|
|
# Bob's wake word: "Hey Jarvis"
|
|
# Train on ONLY Bob's voice
|
|
|
|
# Carol's wake word: "Hey Computer"
|
|
# Train on ONLY Carol's voice
|
|
```
|
|
|
|
**Deployment:**
|
|
Run all 3 models in parallel (server-side):
|
|
```python
|
|
wake_word_configs = [
|
|
{'name': 'Alice', 'wake_word': 'hey mycroft', 'model': 'alice-wake.net'},
|
|
{'name': 'Bob', 'wake_word': 'hey jarvis', 'model': 'bob-wake.net'},
|
|
{'name': 'Carol', 'wake_word': 'hey computer', 'model': 'carol-wake.net'}
|
|
]
|
|
```
|
|
|
|
**Pros:**
|
|
- ✅ Automatic user identification
|
|
- ✅ Highest accuracy per user
|
|
- ✅ Clear user separation
|
|
- ✅ No additional speaker ID needed
|
|
|
|
**Cons:**
|
|
- ❌ Requires 3x models (server only)
|
|
- ❌ Users must remember their wake word
|
|
- ❌ 3x CPU usage (~15-30%)
|
|
- ❌ Can't work on edge (K210)
|
|
|
|
### Approach 4: Context-Based Adaptation
|
|
|
|
No speaker ID, but learn from interaction:
|
|
|
|
```python
|
|
# Track command patterns
|
|
user_context = {
|
|
'last_command': 'turn on living room lights',
|
|
'frequent_entities': ['light.living_room', 'light.bedroom'],
|
|
'time_of_day_patterns': {'morning': 'coffee maker', 'evening': 'tv'},
|
|
'location': 'home' # vs 'away'
|
|
}
|
|
|
|
# Use context to improve intent recognition
|
|
if "turn on the lights" and time.is_morning():
|
|
# Probably means bedroom lights (based on history)
|
|
entity = user_context['frequent_entities'][0]
|
|
```
|
|
|
|
**Pros:**
|
|
- ✅ No enrollment needed
|
|
- ✅ Improves over time
|
|
- ✅ Simple to implement
|
|
- ✅ Works with any number of users
|
|
|
|
**Cons:**
|
|
- ❌ No true user identification
|
|
- ❌ May make incorrect assumptions
|
|
- ❌ Privacy concerns (tracking behavior)
|
|
|
|
## Recommended Strategy
|
|
|
|
### For Your Use Case
|
|
|
|
Based on your home lab setup, I recommend:
|
|
|
|
#### Phase 1: Single Wake Word, Inclusive Training (Week 1-2)
|
|
```bash
|
|
# Start simple
|
|
cd ~/precise-models/hey-computer
|
|
conda activate precise
|
|
|
|
# Have all family members record samples
|
|
# Alice: 30 samples of "Hey Computer"
|
|
# Bob: 30 samples of "Hey Computer"
|
|
# You: 30 samples of "Hey Computer"
|
|
|
|
# Train single model on all voices
|
|
precise-train -e 60 hey-computer.net .
|
|
|
|
# Deploy to server
|
|
python voice_server.py \
|
|
--enable-precise \
|
|
--precise-model hey-computer.net
|
|
```
|
|
|
|
**Why:**
|
|
- Simple to setup and test
|
|
- Everyone can use it immediately
|
|
- Single model = easier debugging
|
|
- Works on edge if you migrate later
|
|
|
|
#### Phase 2: Add Speaker Identification (Week 3-4)
|
|
```bash
|
|
# Install resemblyzer
|
|
pip install resemblyzer --break-system-packages
|
|
|
|
# Enroll users
|
|
python enroll_users.py
|
|
# Each person speaks for 20 seconds
|
|
|
|
# Update voice_server.py to identify speaker
|
|
# Use speaker ID for personalized responses
|
|
```
|
|
|
|
**Why:**
|
|
- Enables personalization
|
|
- Can track preferences per user
|
|
- User-specific command permissions
|
|
- Better privacy (know who's speaking)
|
|
|
|
#### Phase 3: Multiple Wake Words (Month 2+)
|
|
```bash
|
|
# Add alternative wake words for different contexts
|
|
# "Hey Mycroft" - General commands
|
|
# "Hey Jarvis" - Media/Plex control
|
|
# "Computer" - Quick commands (lights, temp)
|
|
|
|
# Deploy multiple models on server
|
|
python voice_server.py \
|
|
--enable-precise \
|
|
--precise-models "mycroft:hey-mycroft.net:0.5,jarvis:hey-jarvis.net:0.5"
|
|
```
|
|
|
|
**Why:**
|
|
- Different wake words for different contexts
|
|
- Reduces false positives (more specific triggers)
|
|
- Fun factor (Jarvis for media!)
|
|
- Server can handle 2-3 easily
|
|
|
|
## Implementation Guide: Multiple Wake Words
|
|
|
|
### Update voice_server.py for Multiple Wake Words
|
|
|
|
```python
|
|
# Add to voice_server.py
|
|
|
|
def start_multiple_wake_words(configs):
|
|
"""
|
|
Start multiple wake word detectors
|
|
|
|
Args:
|
|
configs: List of dicts with 'name', 'model_path', 'sensitivity'
|
|
"""
|
|
global precise_runners
|
|
precise_runners = {}
|
|
|
|
for config in configs:
|
|
try:
|
|
engine = PreciseEngine(
|
|
DEFAULT_PRECISE_ENGINE,
|
|
config['model_path']
|
|
)
|
|
|
|
def make_callback(wake_word_name):
|
|
def callback():
|
|
print(f"Wake word detected: {wake_word_name}")
|
|
wake_word_queue.put({
|
|
'wake_word': wake_word_name,
|
|
'timestamp': time.time(),
|
|
'source': 'precise'
|
|
})
|
|
return callback
|
|
|
|
runner = PreciseRunner(
|
|
engine,
|
|
sensitivity=config['sensitivity'],
|
|
on_activation=make_callback(config['name'])
|
|
)
|
|
|
|
runner.start()
|
|
precise_runners[config['name']] = runner
|
|
|
|
print(f"✓ Started: {config['name']} (sensitivity: {config['sensitivity']})")
|
|
|
|
except Exception as e:
|
|
print(f"✗ Failed to start {config['name']}: {e}")
|
|
|
|
return len(precise_runners) > 0
|
|
|
|
# Add to main()
|
|
parser.add_argument('--precise-models',
|
|
help='Multiple models: name:path:sensitivity,name2:path2:sensitivity2')
|
|
|
|
# Parse multiple models
|
|
if args.precise_models:
|
|
configs = []
|
|
for model_spec in args.precise_models.split(','):
|
|
name, path, sensitivity = model_spec.split(':')
|
|
configs.append({
|
|
'name': name,
|
|
'model_path': os.path.expanduser(path),
|
|
'sensitivity': float(sensitivity)
|
|
})
|
|
|
|
start_multiple_wake_words(configs)
|
|
```
|
|
|
|
### Usage Example
|
|
|
|
```bash
|
|
cd ~/voice-assistant
|
|
|
|
# Start with multiple wake words
|
|
python voice_server.py \
|
|
--enable-precise \
|
|
--precise-models "\
|
|
hey-mycroft:~/precise-models/pretrained/hey-mycroft.net:0.5,\
|
|
hey-jarvis:~/precise-models/pretrained/hey-jarvis.net:0.5"
|
|
```
|
|
|
|
## Implementation Guide: Speaker Identification
|
|
|
|
### Add to voice_server.py
|
|
|
|
```python
|
|
# Add resemblyzer support
|
|
try:
|
|
from resemblyzer import VoiceEncoder, preprocess_wav
|
|
import numpy as np
|
|
SPEAKER_ID_AVAILABLE = True
|
|
except ImportError:
|
|
SPEAKER_ID_AVAILABLE = False
|
|
print("Warning: resemblyzer not available. Speaker ID disabled.")
|
|
|
|
# Initialize encoder
|
|
voice_encoder = None
|
|
speaker_profiles = {}
|
|
|
|
def load_speaker_profiles(profiles_dir='~/voice-assistant/profiles'):
|
|
"""Load enrolled speaker profiles"""
|
|
global speaker_profiles, voice_encoder
|
|
|
|
if not SPEAKER_ID_AVAILABLE:
|
|
return False
|
|
|
|
profiles_dir = os.path.expanduser(profiles_dir)
|
|
|
|
if not os.path.exists(profiles_dir):
|
|
print(f"No speaker profiles found at {profiles_dir}")
|
|
return False
|
|
|
|
# Initialize encoder
|
|
voice_encoder = VoiceEncoder()
|
|
|
|
# Load all profiles
|
|
for profile_file in os.listdir(profiles_dir):
|
|
if profile_file.endswith('.npy'):
|
|
name = profile_file.replace('.npy', '')
|
|
profile = np.load(os.path.join(profiles_dir, profile_file))
|
|
speaker_profiles[name] = profile
|
|
print(f"Loaded speaker profile: {name}")
|
|
|
|
return len(speaker_profiles) > 0
|
|
|
|
def identify_speaker(audio_path, threshold=0.7):
|
|
"""Identify speaker from audio file"""
|
|
if not SPEAKER_ID_AVAILABLE or not speaker_profiles:
|
|
return None
|
|
|
|
try:
|
|
# Get embedding for test audio
|
|
wav = preprocess_wav(audio_path)
|
|
test_embedding = voice_encoder.embed_utterance(wav)
|
|
|
|
# Compare to all profiles
|
|
similarities = {}
|
|
for name, profile in speaker_profiles.items():
|
|
similarity = np.dot(test_embedding, profile)
|
|
similarities[name] = similarity
|
|
|
|
# Get best match
|
|
best_match = max(similarities, key=similarities.get)
|
|
confidence = similarities[best_match]
|
|
|
|
print(f"Speaker ID: {best_match} (confidence: {confidence:.2f})")
|
|
|
|
if confidence > threshold:
|
|
return best_match
|
|
else:
|
|
return "unknown"
|
|
|
|
except Exception as e:
|
|
print(f"Error identifying speaker: {e}")
|
|
return None
|
|
|
|
# Update process endpoint to include speaker ID
|
|
@app.route('/process', methods=['POST'])
|
|
def process():
|
|
"""Process complete voice command with speaker identification"""
|
|
# ... existing code ...
|
|
|
|
# Add speaker identification
|
|
speaker = identify_speaker(temp_path) if speaker_profiles else None
|
|
|
|
if speaker:
|
|
print(f"Detected speaker: {speaker}")
|
|
# Could personalize response based on speaker
|
|
|
|
# ... rest of processing ...
|
|
```
|
|
|
|
### Enrollment Script
|
|
|
|
Create `enroll_speaker.py`:
|
|
|
|
```python
|
|
#!/usr/bin/env python3
|
|
"""
|
|
Enroll users for speaker identification
|
|
|
|
Usage:
|
|
python enroll_speaker.py --name Alice --audio alice_sample.wav
|
|
python enroll_speaker.py --name Alice --duration 20 # Record live
|
|
"""
|
|
|
|
import argparse
|
|
import os
|
|
import numpy as np
|
|
from resemblyzer import VoiceEncoder, preprocess_wav
|
|
import pyaudio
|
|
import wave
|
|
|
|
def record_audio(duration=20, sample_rate=16000):
|
|
"""Record audio from microphone"""
|
|
print(f"Recording for {duration} seconds...")
|
|
print("Speak naturally - read a paragraph, have a conversation, etc.")
|
|
|
|
chunk = 1024
|
|
format = pyaudio.paInt16
|
|
channels = 1
|
|
|
|
p = pyaudio.PyAudio()
|
|
|
|
stream = p.open(
|
|
format=format,
|
|
channels=channels,
|
|
rate=sample_rate,
|
|
input=True,
|
|
frames_per_buffer=chunk
|
|
)
|
|
|
|
frames = []
|
|
for i in range(0, int(sample_rate / chunk * duration)):
|
|
data = stream.read(chunk)
|
|
frames.append(data)
|
|
|
|
stream.stop_stream()
|
|
stream.close()
|
|
p.terminate()
|
|
|
|
# Save to temp file
|
|
temp_file = f"/tmp/enrollment_{os.getpid()}.wav"
|
|
wf = wave.open(temp_file, 'wb')
|
|
wf.setnchannels(channels)
|
|
wf.setsampwidth(p.get_sample_size(format))
|
|
wf.setframerate(sample_rate)
|
|
wf.writeframes(b''.join(frames))
|
|
wf.close()
|
|
|
|
return temp_file
|
|
|
|
def enroll_speaker(name, audio_file, profiles_dir='~/voice-assistant/profiles'):
|
|
"""Create voice profile for speaker"""
|
|
profiles_dir = os.path.expanduser(profiles_dir)
|
|
os.makedirs(profiles_dir, exist_ok=True)
|
|
|
|
# Initialize encoder
|
|
encoder = VoiceEncoder()
|
|
|
|
# Process audio
|
|
wav = preprocess_wav(audio_file)
|
|
embedding = encoder.embed_utterance(wav)
|
|
|
|
# Save profile
|
|
profile_path = os.path.join(profiles_dir, f'{name}.npy')
|
|
np.save(profile_path, embedding)
|
|
|
|
print(f"✓ Enrolled speaker: {name}")
|
|
print(f" Profile saved to: {profile_path}")
|
|
|
|
return profile_path
|
|
|
|
def main():
|
|
parser = argparse.ArgumentParser(description="Enroll speaker for voice identification")
|
|
parser.add_argument('--name', required=True, help='Speaker name')
|
|
parser.add_argument('--audio', help='Path to audio file (wav)')
|
|
parser.add_argument('--duration', type=int, default=20,
|
|
help='Recording duration if not using audio file')
|
|
parser.add_argument('--profiles-dir', default='~/voice-assistant/profiles',
|
|
help='Directory to save profiles')
|
|
|
|
args = parser.parse_args()
|
|
|
|
# Get audio file
|
|
if args.audio:
|
|
audio_file = args.audio
|
|
if not os.path.exists(audio_file):
|
|
print(f"Error: Audio file not found: {audio_file}")
|
|
return 1
|
|
else:
|
|
audio_file = record_audio(args.duration)
|
|
|
|
# Enroll speaker
|
|
try:
|
|
enroll_speaker(args.name, audio_file, args.profiles_dir)
|
|
return 0
|
|
except Exception as e:
|
|
print(f"Error enrolling speaker: {e}")
|
|
return 1
|
|
|
|
if __name__ == '__main__':
|
|
import sys
|
|
sys.exit(main())
|
|
```
|
|
|
|
## Performance Comparison
|
|
|
|
### Single Wake Word
|
|
- **Latency:** 100-200ms
|
|
- **CPU:** ~5-10% (idle)
|
|
- **Memory:** ~100MB
|
|
- **Accuracy:** 95%+
|
|
|
|
### Multiple Wake Words (3 models)
|
|
- **Latency:** 100-200ms (parallel)
|
|
- **CPU:** ~15-30% (idle)
|
|
- **Memory:** ~300MB
|
|
- **Accuracy:** 95%+ each
|
|
|
|
### With Speaker Identification
|
|
- **Additional latency:** +100-200ms
|
|
- **Additional CPU:** +5% during ID
|
|
- **Additional memory:** +50MB
|
|
- **Accuracy:** 85-95% (depending on enrollment quality)
|
|
|
|
## Best Practices
|
|
|
|
### Wake Word Selection
|
|
1. **Different enough** - "Hey Mycroft" vs "Hey Jarvis" (not "Hey Alice" vs "Hey Alex")
|
|
2. **Clear consonants** - Easier to detect
|
|
3. **2-3 syllables** - Not too short, not too long
|
|
4. **Test in environment** - Check for false triggers
|
|
|
|
### Training
|
|
1. **Include all users** - If using single model
|
|
2. **Diverse conditions** - Different rooms, noise levels
|
|
3. **Regular updates** - Add false positives weekly
|
|
4. **Per-user models** - Higher accuracy, more compute
|
|
|
|
### Speaker Identification
|
|
1. **Quality enrollment** - 20+ seconds of clear speech
|
|
2. **Re-enroll periodically** - Voices change (colds, etc.)
|
|
3. **Test thresholds** - Balance accuracy vs false IDs
|
|
4. **Graceful fallback** - Handle unknown speakers
|
|
|
|
## Recommended Path for You
|
|
|
|
```bash
|
|
# Week 1: Start with pre-trained "Hey Mycroft"
|
|
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
|
|
precise-listen hey-mycroft.net # Test it!
|
|
|
|
# Week 2: Fine-tune with your voices
|
|
precise-train -e 30 hey-mycroft-custom.net . \
|
|
--from-checkpoint hey-mycroft.net
|
|
|
|
# Week 3: Add speaker identification
|
|
pip install resemblyzer
|
|
python enroll_speaker.py --name Alan --duration 20
|
|
python enroll_speaker.py --name [Family Member] --duration 20
|
|
|
|
# Week 4: Add second wake word ("Hey Jarvis" for Plex?)
|
|
wget hey-jarvis.tar.gz
|
|
# Run both in parallel
|
|
|
|
# Month 2+: Optimize and expand
|
|
# - More wake words for different contexts
|
|
# - Per-user wake word models
|
|
# - Context-aware responses
|
|
```
|
|
|
|
This gives you a smooth progression from simple to advanced!
|