Ports prior voice assistant research and prototypes from devl/Devops into the Minerva repo. Includes: - docs/: architecture, wake word guides, ESP32-S3 spec, hardware buying guide - scripts/: voice_server.py, voice_server_enhanced.py, setup scripts - hardware/maixduino/: edge device scripts with WiFi credentials scrubbed (replaced hardcoded password with secrets.py pattern) - config/.env.example: server config template - .gitignore: excludes .env, secrets.py, model blobs, ELF firmware - CLAUDE.md: Minerva product context and connection to cf-voice roadmap
18 KiB
Executable file
Wake Word Models: Pre-trained, Multiple, and Voice Adaptation
Pre-trained Wake Word Models
Yes! "Hey Mycroft" Already Exists
Mycroft provides several pre-trained models that you can use immediately:
Available Pre-trained Models
Hey Mycroft (Official)
# Download from Mycroft's model repository
cd ~/precise-models/pretrained
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
tar xzf hey-mycroft.tar.gz
# Test immediately
conda activate precise
precise-listen hey-mycroft.net
# Should detect "Hey Mycroft" right away!
Other Available Models:
- Hey Mycroft - Best tested, most reliable
- Christopher - Alternative wake word
- Hey Jarvis - Community contributed
- Computer - Star Trek style
Using Pre-trained Models
Option 1: Use as-is
# Just point your server to the pre-trained model
python voice_server.py \
--enable-precise \
--precise-model ~/precise-models/pretrained/hey-mycroft.net \
--precise-sensitivity 0.5
Option 2: Fine-tune for your voice
# Use pre-trained as starting point, add your samples
cd ~/precise-models/my-hey-mycroft
# Record additional samples
precise-collect
# Train from checkpoint (much faster than from scratch!)
precise-train -e 30 my-hey-mycroft.net . \
--from-checkpoint ~/precise-models/pretrained/hey-mycroft.net
# This adds your voice/environment while keeping the base model
Option 3: Ensemble with custom
# Use both pre-trained and custom model
# Require both to agree (reduces false positives)
# See implementation below
Advantages of Pre-trained Models
✅ Instant deployment - No training required
✅ Proven accuracy - Tested by thousands of users
✅ Good starting point - Fine-tune rather than train from scratch
✅ Multiple speakers - Already includes diverse voices
✅ Save time - Skip 1-2 hours of training
Disadvantages
❌ Generic - Not optimized for your voice/environment
❌ May need tuning - Threshold adjustment required
❌ Limited choice - Only a few wake words available
Recommendation
Start with "Hey Mycroft" pre-trained model:
- Deploy immediately (zero training time)
- Test in your environment
- Collect false positives/negatives
- Fine-tune with your examples
- Best of both worlds!
Multiple Wake Words
Can You Have Multiple Wake Words?
Short answer: Yes, but with tradeoffs.
Implementation Approaches
Approach 1: Server-Side Multiple Models (Recommended)
Run multiple Precise models in parallel on Heimdall:
# In voice_server.py
from precise_runner import PreciseEngine, PreciseRunner
# Global runners for each wake word
precise_runners = {}
wake_word_configs = {
'hey_mycroft': {
'model': '~/precise-models/pretrained/hey-mycroft.net',
'sensitivity': 0.5,
'response': 'Yes?'
},
'hey_computer': {
'model': '~/precise-models/hey-computer/hey-computer.net',
'sensitivity': 0.5,
'response': 'I\'m listening'
},
'jarvis': {
'model': '~/precise-models/jarvis/jarvis.net',
'sensitivity': 0.6,
'response': 'At your service, sir'
}
}
def on_wake_word_detected(wake_word_name):
"""Callback with wake word identifier"""
def callback():
print(f"Wake word detected: {wake_word_name}")
wake_word_queue.put({
'timestamp': time.time(),
'wake_word': wake_word_name,
'response': wake_word_configs[wake_word_name]['response']
})
return callback
def start_multiple_wake_words():
"""Start multiple Precise listeners"""
for name, config in wake_word_configs.items():
engine = PreciseEngine(
'/usr/local/bin/precise-engine',
os.path.expanduser(config['model'])
)
runner = PreciseRunner(
engine,
sensitivity=config['sensitivity'],
on_activation=on_wake_word_detected(name)
)
runner.start()
precise_runners[name] = runner
print(f"Started wake word listener: {name}")
Resource Usage:
- CPU: ~5-10% per model (3 models = ~15-30%)
- RAM: ~100-200MB per model
- Still very manageable on Heimdall
Pros:
✅ Different wake words for different purposes
✅ Family members can choose preferred wake word
✅ Context-aware responses
✅ Easy to add/remove models
Cons:
❌ Higher CPU usage (scales linearly)
❌ Increased false positive risk (3x models = 3x chance)
❌ More complex configuration
Approach 2: Edge Multiple Models (K210)
Challenge: K210 has limited resources
Option A: Sequential checking (Feasible)
# Check each model in sequence
models = ['hey-mycroft.kmodel', 'hey-computer.kmodel']
for model in models:
kpu_task = kpu.load(f"/sd/models/{model}")
result = kpu.run(kpu_task, audio_features)
if result > threshold:
return model # Wake word detected
Resource impact:
- Latency: +50-100ms per additional model
- Memory: Models must fit in 6MB total
- CPU: ~30% per model check
Option B: Combined model (Advanced)
# Train a single model that recognizes multiple phrases
# Each phrase maps to different output class
# More complex training but single inference
Recommendation for edge:
- 1-2 wake words max on K210
- Server-side for 3+ wake words
Approach 3: Contextual Wake Words
Different wake words trigger different behaviors:
wake_word_contexts = {
'hey_mycroft': 'general', # General commands
'hey_assistant': 'general', # Alternative general
'emergency': 'priority', # High priority
'goodnight': 'bedtime', # Bedtime routine
}
def handle_wake_word(wake_word, command):
context = wake_word_contexts[wake_word]
if context == 'priority':
# Skip queue, process immediately
# Maybe call emergency contact
pass
elif context == 'bedtime':
# Trigger bedtime automation
# Lower volume for responses
pass
else:
# Normal processing
pass
Best Practices for Multiple Wake Words
- Start with one - Get it working well first
- Add gradually - One at a time, test thoroughly
- Different purposes - Each wake word should have a reason
- Monitor performance - Track false positives per wake word
- User preference - Let family members choose their favorite
Recommended Configuration
For most users:
wake_words = {
'hey_mycroft': 'primary', # Main wake word (pre-trained)
'hey_computer': 'alternative' # Custom trained for your voice
}
For power users:
wake_words = {
'hey_mycroft': 'general',
'jarvis': 'personal_assistant', # Custom responses
'computer': 'technical_queries', # Different intent parser
}
For families:
wake_words = {
'hey_mycroft': 'shared', # Everyone can use
'dad': 'user_alan', # Personalized
'mom': 'user_sarah', # Personalized
'kids': 'user_children', # Kid-safe responses
}
Voice Adaptation and Multi-User Support
Challenge: Different Voices, Same Wake Word
When multiple people use the system:
- Different accents
- Different speech patterns
- Different pronunciations
- Different vocal characteristics
Solution Approaches
Approach 1: Diverse Training Data (Recommended)
During initial training:
# Have everyone in household record samples
cd ~/precise-models/hey-computer
# Alan records 30 samples
precise-collect # Record as user 1
# Sarah records 30 samples
precise-collect # Record as user 2
# Kids record 20 samples
precise-collect # Record as user 3
# Combine all in training set
# Train one model that works for everyone
./3-train-model.sh
Pros:
✅ Single model for everyone
✅ No user switching needed
✅ Simple to maintain
✅ Works immediately for all users
Cons:
❌ May have lower per-person accuracy
❌ Requires upfront time from everyone
❌ Hard to add new users later
Approach 2: Incremental Training
Start with your voice, add others over time:
# Week 1: Train with Alan's voice
cd ~/precise-models/hey-computer
# Record and train with Alan's samples
precise-train -e 60 hey-computer.net .
# Week 2: Sarah wants to use it
# Collect Sarah's samples
mkdir -p sarah-samples/wake-word
precise-collect # Sarah records 20-30 samples
# Add to existing training set
cp sarah-samples/wake-word/* wake-word/
# Retrain (continue from checkpoint)
precise-train -e 30 hey-computer.net . \
--from-checkpoint hey-computer.net
# Now works for both Alan and Sarah!
Pros:
✅ Gradual improvement
✅ Don't need everyone upfront
✅ Easy to add new users
✅ Maintains accuracy for existing users
Cons:
❌ May not work well for new users initially
❌ Requires retraining periodically
Approach 3: Per-User Models with Speaker Identification
Train separate models + identify who's speaking:
Step 1: Train per-user wake word models
# Alan's model
~/precise-models/hey-computer-alan/
# Sarah's model
~/precise-models/hey-computer-sarah/
# Kids' model
~/precise-models/hey-computer-kids/
Step 2: Use speaker identification
# Pseudo-code for speaker identification
def identify_speaker(audio):
"""
Identify speaker from voice characteristics
Using speaker embeddings (x-vectors, d-vectors)
"""
# Extract speaker embedding
embedding = speaker_encoder.encode(audio)
# Compare to known users
similarities = {
'alan': cosine_similarity(embedding, alan_embedding),
'sarah': cosine_similarity(embedding, sarah_embedding),
'kids': cosine_similarity(embedding, kids_embedding),
}
# Return most similar
return max(similarities, key=similarities.get)
def process_command(audio):
# Detect wake word with all models
wake_detected = check_all_models(audio)
if wake_detected:
# Identify speaker
speaker = identify_speaker(audio)
# Use speaker-specific model for better accuracy
model = f'~/precise-models/hey-computer-{speaker}/'
# Continue with speaker context
process_with_context(audio, speaker)
Speaker identification libraries:
- Resemblyzer - Simple speaker verification
- speechbrain - Complete toolkit
- pyannote.audio - You already use this for diarization!
Implementation:
# You already have pyannote for diarization!
conda activate voice-assistant
pip install pyannote.audio --break-system-packages
# Can use speaker embeddings for identification
from pyannote.audio import Inference
# Load speaker embedding model
inference = Inference(
"pyannote/embedding",
use_auth_token=hf_token
)
# Extract embeddings for known users
alan_embedding = inference("alan_voice_sample.wav")
sarah_embedding = inference("sarah_voice_sample.wav")
# Compare with incoming audio
unknown_embedding = inference(audio_buffer)
from scipy.spatial.distance import cosine
alan_similarity = 1 - cosine(unknown_embedding, alan_embedding)
sarah_similarity = 1 - cosine(unknown_embedding, sarah_embedding)
if alan_similarity > 0.8:
user = 'alan'
elif sarah_similarity > 0.8:
user = 'sarah'
else:
user = 'unknown'
Pros:
✅ Personalized responses per user
✅ Better accuracy (model optimized for each voice)
✅ User-specific preferences/permissions
✅ Can track who said what
Cons:
❌ More complex setup
❌ Higher resource usage
❌ Requires voice samples from each user
❌ Privacy considerations
Approach 4: Adaptive/Online Learning
Model improves automatically based on usage:
class AdaptiveWakeWord:
def __init__(self, base_model):
self.base_model = base_model
self.user_samples = []
self.retrain_threshold = 50 # Retrain after N samples
def on_detection(self, audio, user_confirmed=True):
"""User confirms this was correct detection"""
if user_confirmed:
self.user_samples.append(audio)
# Periodically retrain
if len(self.user_samples) >= self.retrain_threshold:
self.retrain_with_samples()
self.user_samples = []
def retrain_with_samples(self):
"""Background retraining with collected samples"""
# Add samples to training set
# Retrain model
# Swap in new model
pass
Pros:
✅ Automatic improvement
✅ Adapts to user's voice over time
✅ No manual retraining
✅ Gets better with use
Cons:
❌ Complex implementation
❌ Requires user feedback mechanism
❌ Risk of drift/degradation
❌ Background training overhead
Recommended Strategy
Phase 1: Single Wake Word, Single Model
# Week 1-2
# Use pre-trained "Hey Mycroft"
# OR train custom "Hey Computer" with all family members' voices
# Keep it simple, get it working
Phase 2: Add Fine-tuning
# Week 3-4
# Collect false positives/negatives
# Retrain with household-specific data
# Optimize threshold
Phase 3: Consider Multiple Wake Words
# Month 2
# If needed, add second wake word
# "Hey Mycroft" for general
# "Jarvis" for personal assistant tasks
Phase 4: Personalization
# Month 3+
# If desired, add speaker identification
# Personalized responses
# User-specific preferences
Practical Examples
Example 1: Family of 4, Single Model
# Training session with everyone
cd ~/precise-models/hey-mycroft-family
# Dad records 25 samples
precise-collect
# Mom records 25 samples
precise-collect
# Kid 1 records 15 samples
precise-collect
# Kid 2 records 15 samples
precise-collect
# Collect shared negative samples (200+)
# TV, music, conversation, etc.
precise-collect -f not-wake-word/household.wav
# Train single model for everyone
precise-train -e 60 hey-mycroft-family.net .
# Deploy
python voice_server.py \
--enable-precise \
--precise-model hey-mycroft-family.net
Result: Everyone can use it, one model, simple.
Example 2: Two Wake Words, Different Purposes
# voice_server.py configuration
wake_words = {
'hey_mycroft': {
'model': 'hey-mycroft.net',
'sensitivity': 0.5,
'intent_parser': 'general', # All commands
'response': 'Yes?'
},
'emergency': {
'model': 'emergency.net',
'sensitivity': 0.7, # Higher threshold
'intent_parser': 'emergency', # Limited commands
'response': 'Emergency mode activated'
}
}
# "Hey Mycroft, turn on the lights" - works
# "Emergency, call for help" - triggers emergency protocol
Example 3: Speaker Identification + Personalization
# Enhanced processing with speaker ID
def process_with_speaker_id(audio, speaker):
# Different HA entity based on speaker
entity_maps = {
'alan': {
'bedroom_light': 'light.master_bedroom',
'office_light': 'light.alan_office',
},
'sarah': {
'bedroom_light': 'light.master_bedroom',
'office_light': 'light.sarah_office',
},
'kids': {
'bedroom_light': 'light.kids_bedroom',
'tv': None, # Kids can't control TV
}
}
# Transcribe command
text = whisper_transcribe(audio)
# "Turn on bedroom light"
if 'bedroom light' in text:
entity = entity_maps[speaker]['bedroom_light']
ha_client.turn_on(entity)
response = f"Turned on your bedroom light"
return response
Resource Requirements
Single Wake Word
- CPU: 5-10% (Heimdall)
- RAM: 100-200MB
- Model size: 1-3MB
- Training time: 30-60 min
Multiple Wake Words (3 models)
- CPU: 15-30% (Heimdall)
- RAM: 300-600MB
- Model size: 3-9MB total
- Training time: 90-180 min
With Speaker Identification
- CPU: +5-10% for speaker ID
- RAM: +200-300MB for embedding model
- Model size: +50MB for speaker model
- Setup time: +30-60 min for voice enrollment
K210 Edge (Maix Duino)
- Single model: Feasible, ~30% CPU
- 2 models: Feasible, ~60% CPU, higher latency
- 3+ models: Not recommended
- Speaker ID: Not feasible (limited RAM/compute)
Quick Decision Guide
Just getting started? → Use pre-trained "Hey Mycroft"
Want custom wake word? → Train one model with all family voices
Need multiple wake words? → Start server-side with 2-3 models
Want personalization? → Add speaker identification
Deploying to edge (K210)? → Stick to 1-2 wake words maximum
Family of 4+ people? → Train single model with everyone's voice
Privacy is paramount? → Skip speaker ID, use single universal model
Testing Multiple Wake Words
# Test all wake words quickly
conda activate precise
# Terminal 1: Hey Mycroft
precise-listen hey-mycroft.net
# Terminal 2: Hey Computer
precise-listen hey-computer.net
# Terminal 3: Emergency
precise-listen emergency.net
# Say each wake word, verify correct detection
Conclusion
For Your Maix Duino Project:
Recommended approach:
- Start with "Hey Mycroft" - Use pre-trained model
- Fine-tune if needed - Add your household's voices
- Consider 2nd wake word - Only if you have a specific use case
- Speaker ID - Phase 2/3 enhancement, not critical for MVP
- Keep it simple - One wake word works great for most users
The pre-trained "Hey Mycroft" model saves you 1-2 hours and works immediately. You can always fine-tune or add custom wake words later!
Multiple wake words are cool but not necessary - Most commercial products use just one. Focus on making one wake word work really well before adding more.
Voice adaptation - Training with multiple voices upfront is simpler than per-user models. Save speaker ID for later if you need personalization.
Quick Start with Pre-trained
# On Heimdall
cd ~/precise-models/pretrained
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
tar xzf hey-mycroft.tar.gz
# Test it
conda activate precise
precise-listen hey-mycroft.net
# Deploy
cd ~/voice-assistant
python voice_server.py \
--enable-precise \
--precise-model ~/precise-models/pretrained/hey-mycroft.net
# You're done! No training needed!
That's it - you have a working wake word in 5 minutes! 🎉