Ports prior voice assistant research and prototypes from devl/Devops into the Minerva repo. Includes: - docs/: architecture, wake word guides, ESP32-S3 spec, hardware buying guide - scripts/: voice_server.py, voice_server_enhanced.py, setup scripts - hardware/maixduino/: edge device scripts with WiFi credentials scrubbed (replaced hardcoded password with secrets.py pattern) - config/.env.example: server config template - .gitignore: excludes .env, secrets.py, model blobs, ELF firmware - CLAUDE.md: Minerva product context and connection to cf-voice roadmap
24 KiB
Executable file
Advanced Wake Word Topics - Pre-trained Models, Multiple Wake Words, and Voice Adaptation
Pre-trained Mycroft Models
Yes! Pre-trained Models Exist
Mycroft AI provides several pre-trained wake word models you can use immediately:
Available Models:
- Hey Mycroft - Original Mycroft wake word (most training data)
- Hey Jarvis - Popular alternative
- Christopher - Alternative wake word
- Hey Ezra - Another option
Download Pre-trained Models
# On Heimdall
conda activate precise
cd ~/precise-models
# Create directory for pre-trained models
mkdir -p pretrained
cd pretrained
# Download Hey Mycroft (recommended starting point)
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
tar xzf hey-mycroft.tar.gz
# Download other models
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-jarvis.tar.gz
tar xzf hey-jarvis.tar.gz
# List available models
ls -lh *.net
Test Pre-trained Model
conda activate precise
# Test Hey Mycroft
precise-listen hey-mycroft.net
# Speak "Hey Mycroft" - should see "!" when detected
# Press Ctrl+C to exit
# Test with different threshold
precise-listen hey-mycroft.net -t 0.7 # More conservative
Use Pre-trained Model in Voice Server
cd ~/voice-assistant
# Start server with Hey Mycroft model
python voice_server.py \
--enable-precise \
--precise-model ~/precise-models/pretrained/hey-mycroft.net \
--precise-sensitivity 0.5
Fine-tune Pre-trained Models
You can use pre-trained models as a starting point and fine-tune with your voice:
cd ~/precise-models
mkdir -p hey-mycroft-custom
# Copy base model
cp pretrained/hey-mycroft.net hey-mycroft-custom/
# Collect your samples
cd hey-mycroft-custom
precise-collect # Record 20-30 samples of YOUR voice
# Fine-tune from pre-trained model
precise-train -e 30 hey-mycroft-custom.net . \
--from-checkpoint ../pretrained/hey-mycroft.net
# This is MUCH faster than training from scratch!
Benefits:
- ✅ Start with proven model
- ✅ Much less training data needed (20-30 vs 100+ samples)
- ✅ Faster training (30 mins vs 60 mins)
- ✅ Good baseline accuracy
Multiple Wake Words
Architecture Options
Option 1: Multiple Models in Parallel (Server-Side Only)
Run multiple Precise instances simultaneously:
# In voice_server.py - Multiple wake word detection
from precise_runner import PreciseEngine, PreciseRunner
import threading
# Global runners
precise_runners = {}
def on_wake_word_detected(wake_word_name):
"""Callback factory for different wake words"""
def callback():
print(f"Wake word detected: {wake_word_name}")
wake_word_queue.put({
'wake_word': wake_word_name,
'timestamp': time.time()
})
return callback
def start_multiple_wake_words(wake_word_configs):
"""
Start multiple wake word detectors
Args:
wake_word_configs: List of dicts with 'name', 'model', 'sensitivity'
Example:
configs = [
{'name': 'hey mycroft', 'model': 'hey-mycroft.net', 'sensitivity': 0.5},
{'name': 'hey jarvis', 'model': 'hey-jarvis.net', 'sensitivity': 0.5}
]
"""
global precise_runners
for config in wake_word_configs:
engine = PreciseEngine(
'/usr/local/bin/precise-engine',
config['model']
)
runner = PreciseRunner(
engine,
sensitivity=config['sensitivity'],
on_activation=on_wake_word_detected(config['name'])
)
runner.start()
precise_runners[config['name']] = runner
print(f"Started wake word detector: {config['name']}")
Server-Side Multiple Wake Words:
# Start server with multiple wake words
python voice_server.py \
--enable-precise \
--precise-models "hey-mycroft:~/models/hey-mycroft.net:0.5,hey-jarvis:~/models/hey-jarvis.net:0.5"
Performance Impact:
- CPU: ~5-10% per model (can run 2-3 easily)
- Memory: ~50-100MB per model
- Latency: Minimal (all run in parallel)
Option 2: Single Model, Multiple Phrases (Edge or Server)
Train ONE model that responds to multiple phrases:
cd ~/precise-models/multi-wake
conda activate precise
# Record samples for BOTH wake words in the SAME dataset
# Label all as "wake-word" regardless of which phrase
mkdir -p wake-word not-wake-word
# Record "Hey Mycroft" samples
precise-collect # Save to wake-word/hey-mycroft-*.wav
# Record "Hey Computer" samples
precise-collect # Save to wake-word/hey-computer-*.wav
# Record negatives
precise-collect -f not-wake-word/random.wav
# Train single model on both phrases
precise-train -e 60 multi-wake.net .
Pros:
- ✅ Single model = less compute
- ✅ Works on edge (K210)
- ✅ Easy to deploy
Cons:
- ❌ Can't tell which wake word was used
- ❌ May reduce accuracy for each individual phrase
- ❌ Higher false positive risk
Option 3: Sequential Detection (Edge)
Detect wake word, then identify which one:
# Pseudo-code for edge detection
if wake_word_detected():
audio_snippet = last_2_seconds()
# Run all models on the audio snippet
scores = {
'hey-mycroft': model1.score(audio_snippet),
'hey-jarvis': model2.score(audio_snippet),
'hey-computer': model3.score(audio_snippet)
}
# Use highest scoring wake word
wake_word = max(scores, key=scores.get)
Recommendations
Server-Side (Heimdall):
- ✅ Use Option 1 - Multiple models in parallel
- Run 2-3 wake words easily
- Each can have different sensitivity
- Can identify which wake word was used
- Example: "Hey Mycroft" for commands, "Hey Jarvis" for queries
Edge (Maix Duino K210):
- ✅ Use Option 2 - Single multi-phrase model
- K210 can handle 1 model efficiently
- Train on 2-3 phrases max
- Simpler deployment
- Lower latency
Voice Adaptation & Multi-User Support
Approach 1: Inclusive Training (Recommended)
Train ONE model on EVERYONE'S voices:
cd ~/precise-models/family-wake-word
conda activate precise
# Record samples from each family member
# Alice records 30 samples
precise-collect # Save as wake-word/alice-*.wav
# Bob records 30 samples
precise-collect # Save as wake-word/bob-*.wav
# Carol records 30 samples
precise-collect # Save as wake-word/carol-*.wav
# Train on all voices
precise-train -e 60 family-wake-word.net .
Pros:
- ✅ Everyone can use the system
- ✅ Single model deployment
- ✅ Works for all family members
- ✅ Simple maintenance
Cons:
- ❌ Can't identify who spoke
- ❌ May need more training data
- ❌ No personalization
Best for: Family voice assistant, shared devices
Approach 2: Speaker Identification (Advanced)
Detect wake word, then identify speaker:
# Architecture with speaker ID
# Step 1: Precise detects wake word
if wake_word_detected():
# Step 2: Capture voice sample
voice_sample = record_audio(duration=3)
# Step 3: Speaker identification
speaker = identify_speaker(voice_sample)
# Uses voice embeddings/neural network
# Step 4: Process with user context
process_command(voice_sample, user=speaker)
Implementation Options:
Option A: Use resemblyzer (Voice Embeddings)
pip install resemblyzer --break-system-packages
# Enrollment phase
python enroll_users.py
# Each user records 10-20 seconds of speech
# System creates voice profile (embedding)
# Runtime
python speaker_id.py
# Compares incoming audio to stored embeddings
# Returns most likely speaker
Example Code:
from resemblyzer import VoiceEncoder, preprocess_wav
import numpy as np
# Initialize encoder
encoder = VoiceEncoder()
# Enrollment - do once per user
def enroll_user(name, audio_files):
"""Create voice profile for user"""
embeddings = []
for audio_file in audio_files:
wav = preprocess_wav(audio_file)
embedding = encoder.embed_utterance(wav)
embeddings.append(embedding)
# Average embeddings for robustness
user_profile = np.mean(embeddings, axis=0)
# Save profile
np.save(f'profiles/{name}.npy', user_profile)
return user_profile
# Identification - run each time
def identify_speaker(audio_file, profiles_dir='profiles'):
"""Identify which enrolled user is speaking"""
wav = preprocess_wav(audio_file)
test_embedding = encoder.embed_utterance(wav)
# Load all profiles
profiles = {}
for profile_file in os.listdir(profiles_dir):
name = profile_file.replace('.npy', '')
profile = np.load(os.path.join(profiles_dir, profile_file))
profiles[name] = profile
# Calculate similarity to each profile
similarities = {}
for name, profile in profiles.items():
similarity = np.dot(test_embedding, profile)
similarities[name] = similarity
# Return most similar
best_match = max(similarities, key=similarities.get)
confidence = similarities[best_match]
if confidence > 0.7: # Threshold
return best_match
else:
return "unknown"
Option B: Use pyannote.audio (Production-grade)
pip install pyannote.audio --break-system-packages
# Requires HuggingFace token (same as diarization)
Example:
from pyannote.audio import Inference
# Initialize
inference = Inference(
"pyannote/embedding",
use_auth_token="your_hf_token"
)
# Enroll users
alice_profile = inference("alice_sample.wav")
bob_profile = inference("bob_sample.wav")
# Identify
test_embedding = inference("test_audio.wav")
# Compare
from scipy.spatial.distance import cosine
alice_similarity = 1 - cosine(test_embedding, alice_profile)
bob_similarity = 1 - cosine(test_embedding, bob_profile)
if alice_similarity > bob_similarity and alice_similarity > 0.7:
speaker = "Alice"
elif bob_similarity > 0.7:
speaker = "Bob"
else:
speaker = "Unknown"
Pros:
- ✅ Can identify individual users
- ✅ Personalized responses
- ✅ User-specific commands/permissions
- ✅ Better for privacy (know who's speaking)
Cons:
- ❌ More complex implementation
- ❌ Requires enrollment phase
- ❌ Additional processing time (~100-200ms)
- ❌ May fail with similar voices
Approach 3: Per-User Wake Word Models
Each person has their OWN wake word:
# Alice's wake word: "Hey Mycroft"
# Train on ONLY Alice's voice
# Bob's wake word: "Hey Jarvis"
# Train on ONLY Bob's voice
# Carol's wake word: "Hey Computer"
# Train on ONLY Carol's voice
Deployment: Run all 3 models in parallel (server-side):
wake_word_configs = [
{'name': 'Alice', 'wake_word': 'hey mycroft', 'model': 'alice-wake.net'},
{'name': 'Bob', 'wake_word': 'hey jarvis', 'model': 'bob-wake.net'},
{'name': 'Carol', 'wake_word': 'hey computer', 'model': 'carol-wake.net'}
]
Pros:
- ✅ Automatic user identification
- ✅ Highest accuracy per user
- ✅ Clear user separation
- ✅ No additional speaker ID needed
Cons:
- ❌ Requires 3x models (server only)
- ❌ Users must remember their wake word
- ❌ 3x CPU usage (~15-30%)
- ❌ Can't work on edge (K210)
Approach 4: Context-Based Adaptation
No speaker ID, but learn from interaction:
# Track command patterns
user_context = {
'last_command': 'turn on living room lights',
'frequent_entities': ['light.living_room', 'light.bedroom'],
'time_of_day_patterns': {'morning': 'coffee maker', 'evening': 'tv'},
'location': 'home' # vs 'away'
}
# Use context to improve intent recognition
if "turn on the lights" and time.is_morning():
# Probably means bedroom lights (based on history)
entity = user_context['frequent_entities'][0]
Pros:
- ✅ No enrollment needed
- ✅ Improves over time
- ✅ Simple to implement
- ✅ Works with any number of users
Cons:
- ❌ No true user identification
- ❌ May make incorrect assumptions
- ❌ Privacy concerns (tracking behavior)
Recommended Strategy
For Your Use Case
Based on your home lab setup, I recommend:
Phase 1: Single Wake Word, Inclusive Training (Week 1-2)
# Start simple
cd ~/precise-models/hey-computer
conda activate precise
# Have all family members record samples
# Alice: 30 samples of "Hey Computer"
# Bob: 30 samples of "Hey Computer"
# You: 30 samples of "Hey Computer"
# Train single model on all voices
precise-train -e 60 hey-computer.net .
# Deploy to server
python voice_server.py \
--enable-precise \
--precise-model hey-computer.net
Why:
- Simple to setup and test
- Everyone can use it immediately
- Single model = easier debugging
- Works on edge if you migrate later
Phase 2: Add Speaker Identification (Week 3-4)
# Install resemblyzer
pip install resemblyzer --break-system-packages
# Enroll users
python enroll_users.py
# Each person speaks for 20 seconds
# Update voice_server.py to identify speaker
# Use speaker ID for personalized responses
Why:
- Enables personalization
- Can track preferences per user
- User-specific command permissions
- Better privacy (know who's speaking)
Phase 3: Multiple Wake Words (Month 2+)
# Add alternative wake words for different contexts
# "Hey Mycroft" - General commands
# "Hey Jarvis" - Media/Plex control
# "Computer" - Quick commands (lights, temp)
# Deploy multiple models on server
python voice_server.py \
--enable-precise \
--precise-models "mycroft:hey-mycroft.net:0.5,jarvis:hey-jarvis.net:0.5"
Why:
- Different wake words for different contexts
- Reduces false positives (more specific triggers)
- Fun factor (Jarvis for media!)
- Server can handle 2-3 easily
Implementation Guide: Multiple Wake Words
Update voice_server.py for Multiple Wake Words
# Add to voice_server.py
def start_multiple_wake_words(configs):
"""
Start multiple wake word detectors
Args:
configs: List of dicts with 'name', 'model_path', 'sensitivity'
"""
global precise_runners
precise_runners = {}
for config in configs:
try:
engine = PreciseEngine(
DEFAULT_PRECISE_ENGINE,
config['model_path']
)
def make_callback(wake_word_name):
def callback():
print(f"Wake word detected: {wake_word_name}")
wake_word_queue.put({
'wake_word': wake_word_name,
'timestamp': time.time(),
'source': 'precise'
})
return callback
runner = PreciseRunner(
engine,
sensitivity=config['sensitivity'],
on_activation=make_callback(config['name'])
)
runner.start()
precise_runners[config['name']] = runner
print(f"✓ Started: {config['name']} (sensitivity: {config['sensitivity']})")
except Exception as e:
print(f"✗ Failed to start {config['name']}: {e}")
return len(precise_runners) > 0
# Add to main()
parser.add_argument('--precise-models',
help='Multiple models: name:path:sensitivity,name2:path2:sensitivity2')
# Parse multiple models
if args.precise_models:
configs = []
for model_spec in args.precise_models.split(','):
name, path, sensitivity = model_spec.split(':')
configs.append({
'name': name,
'model_path': os.path.expanduser(path),
'sensitivity': float(sensitivity)
})
start_multiple_wake_words(configs)
Usage Example
cd ~/voice-assistant
# Start with multiple wake words
python voice_server.py \
--enable-precise \
--precise-models "\
hey-mycroft:~/precise-models/pretrained/hey-mycroft.net:0.5,\
hey-jarvis:~/precise-models/pretrained/hey-jarvis.net:0.5"
Implementation Guide: Speaker Identification
Add to voice_server.py
# Add resemblyzer support
try:
from resemblyzer import VoiceEncoder, preprocess_wav
import numpy as np
SPEAKER_ID_AVAILABLE = True
except ImportError:
SPEAKER_ID_AVAILABLE = False
print("Warning: resemblyzer not available. Speaker ID disabled.")
# Initialize encoder
voice_encoder = None
speaker_profiles = {}
def load_speaker_profiles(profiles_dir='~/voice-assistant/profiles'):
"""Load enrolled speaker profiles"""
global speaker_profiles, voice_encoder
if not SPEAKER_ID_AVAILABLE:
return False
profiles_dir = os.path.expanduser(profiles_dir)
if not os.path.exists(profiles_dir):
print(f"No speaker profiles found at {profiles_dir}")
return False
# Initialize encoder
voice_encoder = VoiceEncoder()
# Load all profiles
for profile_file in os.listdir(profiles_dir):
if profile_file.endswith('.npy'):
name = profile_file.replace('.npy', '')
profile = np.load(os.path.join(profiles_dir, profile_file))
speaker_profiles[name] = profile
print(f"Loaded speaker profile: {name}")
return len(speaker_profiles) > 0
def identify_speaker(audio_path, threshold=0.7):
"""Identify speaker from audio file"""
if not SPEAKER_ID_AVAILABLE or not speaker_profiles:
return None
try:
# Get embedding for test audio
wav = preprocess_wav(audio_path)
test_embedding = voice_encoder.embed_utterance(wav)
# Compare to all profiles
similarities = {}
for name, profile in speaker_profiles.items():
similarity = np.dot(test_embedding, profile)
similarities[name] = similarity
# Get best match
best_match = max(similarities, key=similarities.get)
confidence = similarities[best_match]
print(f"Speaker ID: {best_match} (confidence: {confidence:.2f})")
if confidence > threshold:
return best_match
else:
return "unknown"
except Exception as e:
print(f"Error identifying speaker: {e}")
return None
# Update process endpoint to include speaker ID
@app.route('/process', methods=['POST'])
def process():
"""Process complete voice command with speaker identification"""
# ... existing code ...
# Add speaker identification
speaker = identify_speaker(temp_path) if speaker_profiles else None
if speaker:
print(f"Detected speaker: {speaker}")
# Could personalize response based on speaker
# ... rest of processing ...
Enrollment Script
Create enroll_speaker.py:
#!/usr/bin/env python3
"""
Enroll users for speaker identification
Usage:
python enroll_speaker.py --name Alice --audio alice_sample.wav
python enroll_speaker.py --name Alice --duration 20 # Record live
"""
import argparse
import os
import numpy as np
from resemblyzer import VoiceEncoder, preprocess_wav
import pyaudio
import wave
def record_audio(duration=20, sample_rate=16000):
"""Record audio from microphone"""
print(f"Recording for {duration} seconds...")
print("Speak naturally - read a paragraph, have a conversation, etc.")
chunk = 1024
format = pyaudio.paInt16
channels = 1
p = pyaudio.PyAudio()
stream = p.open(
format=format,
channels=channels,
rate=sample_rate,
input=True,
frames_per_buffer=chunk
)
frames = []
for i in range(0, int(sample_rate / chunk * duration)):
data = stream.read(chunk)
frames.append(data)
stream.stop_stream()
stream.close()
p.terminate()
# Save to temp file
temp_file = f"/tmp/enrollment_{os.getpid()}.wav"
wf = wave.open(temp_file, 'wb')
wf.setnchannels(channels)
wf.setsampwidth(p.get_sample_size(format))
wf.setframerate(sample_rate)
wf.writeframes(b''.join(frames))
wf.close()
return temp_file
def enroll_speaker(name, audio_file, profiles_dir='~/voice-assistant/profiles'):
"""Create voice profile for speaker"""
profiles_dir = os.path.expanduser(profiles_dir)
os.makedirs(profiles_dir, exist_ok=True)
# Initialize encoder
encoder = VoiceEncoder()
# Process audio
wav = preprocess_wav(audio_file)
embedding = encoder.embed_utterance(wav)
# Save profile
profile_path = os.path.join(profiles_dir, f'{name}.npy')
np.save(profile_path, embedding)
print(f"✓ Enrolled speaker: {name}")
print(f" Profile saved to: {profile_path}")
return profile_path
def main():
parser = argparse.ArgumentParser(description="Enroll speaker for voice identification")
parser.add_argument('--name', required=True, help='Speaker name')
parser.add_argument('--audio', help='Path to audio file (wav)')
parser.add_argument('--duration', type=int, default=20,
help='Recording duration if not using audio file')
parser.add_argument('--profiles-dir', default='~/voice-assistant/profiles',
help='Directory to save profiles')
args = parser.parse_args()
# Get audio file
if args.audio:
audio_file = args.audio
if not os.path.exists(audio_file):
print(f"Error: Audio file not found: {audio_file}")
return 1
else:
audio_file = record_audio(args.duration)
# Enroll speaker
try:
enroll_speaker(args.name, audio_file, args.profiles_dir)
return 0
except Exception as e:
print(f"Error enrolling speaker: {e}")
return 1
if __name__ == '__main__':
import sys
sys.exit(main())
Performance Comparison
Single Wake Word
- Latency: 100-200ms
- CPU: ~5-10% (idle)
- Memory: ~100MB
- Accuracy: 95%+
Multiple Wake Words (3 models)
- Latency: 100-200ms (parallel)
- CPU: ~15-30% (idle)
- Memory: ~300MB
- Accuracy: 95%+ each
With Speaker Identification
- Additional latency: +100-200ms
- Additional CPU: +5% during ID
- Additional memory: +50MB
- Accuracy: 85-95% (depending on enrollment quality)
Best Practices
Wake Word Selection
- Different enough - "Hey Mycroft" vs "Hey Jarvis" (not "Hey Alice" vs "Hey Alex")
- Clear consonants - Easier to detect
- 2-3 syllables - Not too short, not too long
- Test in environment - Check for false triggers
Training
- Include all users - If using single model
- Diverse conditions - Different rooms, noise levels
- Regular updates - Add false positives weekly
- Per-user models - Higher accuracy, more compute
Speaker Identification
- Quality enrollment - 20+ seconds of clear speech
- Re-enroll periodically - Voices change (colds, etc.)
- Test thresholds - Balance accuracy vs false IDs
- Graceful fallback - Handle unknown speakers
Recommended Path for You
# Week 1: Start with pre-trained "Hey Mycroft"
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
precise-listen hey-mycroft.net # Test it!
# Week 2: Fine-tune with your voices
precise-train -e 30 hey-mycroft-custom.net . \
--from-checkpoint hey-mycroft.net
# Week 3: Add speaker identification
pip install resemblyzer
python enroll_speaker.py --name Alan --duration 20
python enroll_speaker.py --name [Family Member] --duration 20
# Week 4: Add second wake word ("Hey Jarvis" for Plex?)
wget hey-jarvis.tar.gz
# Run both in parallel
# Month 2+: Optimize and expand
# - More wake words for different contexts
# - Per-user wake word models
# - Context-aware responses
This gives you a smooth progression from simple to advanced!