minerva/docs/WAKE_WORD_ADVANCED.md
pyr0ball 173f7f37d4 feat: import mycroft-precise work as Minerva foundation
Ports prior voice assistant research and prototypes from devl/Devops
into the Minerva repo. Includes:

- docs/: architecture, wake word guides, ESP32-S3 spec, hardware buying guide
- scripts/: voice_server.py, voice_server_enhanced.py, setup scripts
- hardware/maixduino/: edge device scripts with WiFi credentials scrubbed
  (replaced hardcoded password with secrets.py pattern)
- config/.env.example: server config template
- .gitignore: excludes .env, secrets.py, model blobs, ELF firmware
- CLAUDE.md: Minerva product context and connection to cf-voice roadmap
2026-04-06 22:21:12 -07:00

723 lines
18 KiB
Markdown
Executable file

# Wake Word Models: Pre-trained, Multiple, and Voice Adaptation
## Pre-trained Wake Word Models
### Yes! "Hey Mycroft" Already Exists
Mycroft provides several pre-trained models that you can use immediately:
#### Available Pre-trained Models
**Hey Mycroft** (Official)
```bash
# Download from Mycroft's model repository
cd ~/precise-models/pretrained
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
tar xzf hey-mycroft.tar.gz
# Test immediately
conda activate precise
precise-listen hey-mycroft.net
# Should detect "Hey Mycroft" right away!
```
**Other Available Models:**
- **Hey Mycroft** - Best tested, most reliable
- **Christopher** - Alternative wake word
- **Hey Jarvis** - Community contributed
- **Computer** - Star Trek style
#### Using Pre-trained Models
**Option 1: Use as-is**
```bash
# Just point your server to the pre-trained model
python voice_server.py \
--enable-precise \
--precise-model ~/precise-models/pretrained/hey-mycroft.net \
--precise-sensitivity 0.5
```
**Option 2: Fine-tune for your voice**
```bash
# Use pre-trained as starting point, add your samples
cd ~/precise-models/my-hey-mycroft
# Record additional samples
precise-collect
# Train from checkpoint (much faster than from scratch!)
precise-train -e 30 my-hey-mycroft.net . \
--from-checkpoint ~/precise-models/pretrained/hey-mycroft.net
# This adds your voice/environment while keeping the base model
```
**Option 3: Ensemble with custom**
```python
# Use both pre-trained and custom model
# Require both to agree (reduces false positives)
# See implementation below
```
### Advantages of Pre-trained Models
**Instant deployment** - No training required
**Proven accuracy** - Tested by thousands of users
**Good starting point** - Fine-tune rather than train from scratch
**Multiple speakers** - Already includes diverse voices
**Save time** - Skip 1-2 hours of training
### Disadvantages
**Generic** - Not optimized for your voice/environment
**May need tuning** - Threshold adjustment required
**Limited choice** - Only a few wake words available
### Recommendation
**Start with "Hey Mycroft"** pre-trained model:
1. Deploy immediately (zero training time)
2. Test in your environment
3. Collect false positives/negatives
4. Fine-tune with your examples
5. Best of both worlds!
## Multiple Wake Words
### Can You Have Multiple Wake Words?
**Short answer:** Yes, but with tradeoffs.
### Implementation Approaches
#### Approach 1: Server-Side Multiple Models (Recommended)
Run multiple Precise models in parallel on Heimdall:
```python
# In voice_server.py
from precise_runner import PreciseEngine, PreciseRunner
# Global runners for each wake word
precise_runners = {}
wake_word_configs = {
'hey_mycroft': {
'model': '~/precise-models/pretrained/hey-mycroft.net',
'sensitivity': 0.5,
'response': 'Yes?'
},
'hey_computer': {
'model': '~/precise-models/hey-computer/hey-computer.net',
'sensitivity': 0.5,
'response': 'I\'m listening'
},
'jarvis': {
'model': '~/precise-models/jarvis/jarvis.net',
'sensitivity': 0.6,
'response': 'At your service, sir'
}
}
def on_wake_word_detected(wake_word_name):
"""Callback with wake word identifier"""
def callback():
print(f"Wake word detected: {wake_word_name}")
wake_word_queue.put({
'timestamp': time.time(),
'wake_word': wake_word_name,
'response': wake_word_configs[wake_word_name]['response']
})
return callback
def start_multiple_wake_words():
"""Start multiple Precise listeners"""
for name, config in wake_word_configs.items():
engine = PreciseEngine(
'/usr/local/bin/precise-engine',
os.path.expanduser(config['model'])
)
runner = PreciseRunner(
engine,
sensitivity=config['sensitivity'],
on_activation=on_wake_word_detected(name)
)
runner.start()
precise_runners[name] = runner
print(f"Started wake word listener: {name}")
```
**Resource Usage:**
- CPU: ~5-10% per model (3 models = ~15-30%)
- RAM: ~100-200MB per model
- Still very manageable on Heimdall
**Pros:**
✅ Different wake words for different purposes
✅ Family members can choose preferred wake word
✅ Context-aware responses
✅ Easy to add/remove models
**Cons:**
❌ Higher CPU usage (scales linearly)
❌ Increased false positive risk (3x models = 3x chance)
❌ More complex configuration
#### Approach 2: Edge Multiple Models (K210)
**Challenge:** K210 has limited resources
**Option A: Sequential checking** (Feasible)
```python
# Check each model in sequence
models = ['hey-mycroft.kmodel', 'hey-computer.kmodel']
for model in models:
kpu_task = kpu.load(f"/sd/models/{model}")
result = kpu.run(kpu_task, audio_features)
if result > threshold:
return model # Wake word detected
```
**Resource impact:**
- Latency: +50-100ms per additional model
- Memory: Models must fit in 6MB total
- CPU: ~30% per model check
**Option B: Combined model** (Advanced)
```python
# Train a single model that recognizes multiple phrases
# Each phrase maps to different output class
# More complex training but single inference
```
**Recommendation for edge:**
- **1-2 wake words max** on K210
- **Server-side** for 3+ wake words
#### Approach 3: Contextual Wake Words
Different wake words trigger different behaviors:
```python
wake_word_contexts = {
'hey_mycroft': 'general', # General commands
'hey_assistant': 'general', # Alternative general
'emergency': 'priority', # High priority
'goodnight': 'bedtime', # Bedtime routine
}
def handle_wake_word(wake_word, command):
context = wake_word_contexts[wake_word]
if context == 'priority':
# Skip queue, process immediately
# Maybe call emergency contact
pass
elif context == 'bedtime':
# Trigger bedtime automation
# Lower volume for responses
pass
else:
# Normal processing
pass
```
### Best Practices for Multiple Wake Words
1. **Start with one** - Get it working well first
2. **Add gradually** - One at a time, test thoroughly
3. **Different purposes** - Each wake word should have a reason
4. **Monitor performance** - Track false positives per wake word
5. **User preference** - Let family members choose their favorite
### Recommended Configuration
**For most users:**
```python
wake_words = {
'hey_mycroft': 'primary', # Main wake word (pre-trained)
'hey_computer': 'alternative' # Custom trained for your voice
}
```
**For power users:**
```python
wake_words = {
'hey_mycroft': 'general',
'jarvis': 'personal_assistant', # Custom responses
'computer': 'technical_queries', # Different intent parser
}
```
**For families:**
```python
wake_words = {
'hey_mycroft': 'shared', # Everyone can use
'dad': 'user_alan', # Personalized
'mom': 'user_sarah', # Personalized
'kids': 'user_children', # Kid-safe responses
}
```
## Voice Adaptation and Multi-User Support
### Challenge: Different Voices, Same Wake Word
When multiple people use the system:
- Different accents
- Different speech patterns
- Different pronunciations
- Different vocal characteristics
### Solution Approaches
#### Approach 1: Diverse Training Data (Recommended)
**During initial training:**
```bash
# Have everyone in household record samples
cd ~/precise-models/hey-computer
# Alan records 30 samples
precise-collect # Record as user 1
# Sarah records 30 samples
precise-collect # Record as user 2
# Kids record 20 samples
precise-collect # Record as user 3
# Combine all in training set
# Train one model that works for everyone
./3-train-model.sh
```
**Pros:**
✅ Single model for everyone
✅ No user switching needed
✅ Simple to maintain
✅ Works immediately for all users
**Cons:**
❌ May have lower per-person accuracy
❌ Requires upfront time from everyone
❌ Hard to add new users later
#### Approach 2: Incremental Training
Start with your voice, add others over time:
```bash
# Week 1: Train with Alan's voice
cd ~/precise-models/hey-computer
# Record and train with Alan's samples
precise-train -e 60 hey-computer.net .
# Week 2: Sarah wants to use it
# Collect Sarah's samples
mkdir -p sarah-samples/wake-word
precise-collect # Sarah records 20-30 samples
# Add to existing training set
cp sarah-samples/wake-word/* wake-word/
# Retrain (continue from checkpoint)
precise-train -e 30 hey-computer.net . \
--from-checkpoint hey-computer.net
# Now works for both Alan and Sarah!
```
**Pros:**
✅ Gradual improvement
✅ Don't need everyone upfront
✅ Easy to add new users
✅ Maintains accuracy for existing users
**Cons:**
❌ May not work well for new users initially
❌ Requires retraining periodically
#### Approach 3: Per-User Models with Speaker Identification
Train separate models + identify who's speaking:
**Step 1: Train per-user wake word models**
```bash
# Alan's model
~/precise-models/hey-computer-alan/
# Sarah's model
~/precise-models/hey-computer-sarah/
# Kids' model
~/precise-models/hey-computer-kids/
```
**Step 2: Use speaker identification**
```python
# Pseudo-code for speaker identification
def identify_speaker(audio):
"""
Identify speaker from voice characteristics
Using speaker embeddings (x-vectors, d-vectors)
"""
# Extract speaker embedding
embedding = speaker_encoder.encode(audio)
# Compare to known users
similarities = {
'alan': cosine_similarity(embedding, alan_embedding),
'sarah': cosine_similarity(embedding, sarah_embedding),
'kids': cosine_similarity(embedding, kids_embedding),
}
# Return most similar
return max(similarities, key=similarities.get)
def process_command(audio):
# Detect wake word with all models
wake_detected = check_all_models(audio)
if wake_detected:
# Identify speaker
speaker = identify_speaker(audio)
# Use speaker-specific model for better accuracy
model = f'~/precise-models/hey-computer-{speaker}/'
# Continue with speaker context
process_with_context(audio, speaker)
```
**Speaker identification libraries:**
- **Resemblyzer** - Simple speaker verification
- **speechbrain** - Complete toolkit
- **pyannote.audio** - You already use this for diarization!
**Implementation:**
```bash
# You already have pyannote for diarization!
conda activate voice-assistant
pip install pyannote.audio --break-system-packages
# Can use speaker embeddings for identification
```
```python
from pyannote.audio import Inference
# Load speaker embedding model
inference = Inference(
"pyannote/embedding",
use_auth_token=hf_token
)
# Extract embeddings for known users
alan_embedding = inference("alan_voice_sample.wav")
sarah_embedding = inference("sarah_voice_sample.wav")
# Compare with incoming audio
unknown_embedding = inference(audio_buffer)
from scipy.spatial.distance import cosine
alan_similarity = 1 - cosine(unknown_embedding, alan_embedding)
sarah_similarity = 1 - cosine(unknown_embedding, sarah_embedding)
if alan_similarity > 0.8:
user = 'alan'
elif sarah_similarity > 0.8:
user = 'sarah'
else:
user = 'unknown'
```
**Pros:**
✅ Personalized responses per user
✅ Better accuracy (model optimized for each voice)
✅ User-specific preferences/permissions
✅ Can track who said what
**Cons:**
❌ More complex setup
❌ Higher resource usage
❌ Requires voice samples from each user
❌ Privacy considerations
#### Approach 4: Adaptive/Online Learning
Model improves automatically based on usage:
```python
class AdaptiveWakeWord:
def __init__(self, base_model):
self.base_model = base_model
self.user_samples = []
self.retrain_threshold = 50 # Retrain after N samples
def on_detection(self, audio, user_confirmed=True):
"""User confirms this was correct detection"""
if user_confirmed:
self.user_samples.append(audio)
# Periodically retrain
if len(self.user_samples) >= self.retrain_threshold:
self.retrain_with_samples()
self.user_samples = []
def retrain_with_samples(self):
"""Background retraining with collected samples"""
# Add samples to training set
# Retrain model
# Swap in new model
pass
```
**Pros:**
✅ Automatic improvement
✅ Adapts to user's voice over time
✅ No manual retraining
✅ Gets better with use
**Cons:**
❌ Complex implementation
❌ Requires user feedback mechanism
❌ Risk of drift/degradation
❌ Background training overhead
## Recommended Strategy
### Phase 1: Single Wake Word, Single Model
```bash
# Week 1-2
# Use pre-trained "Hey Mycroft"
# OR train custom "Hey Computer" with all family members' voices
# Keep it simple, get it working
```
### Phase 2: Add Fine-tuning
```bash
# Week 3-4
# Collect false positives/negatives
# Retrain with household-specific data
# Optimize threshold
```
### Phase 3: Consider Multiple Wake Words
```bash
# Month 2
# If needed, add second wake word
# "Hey Mycroft" for general
# "Jarvis" for personal assistant tasks
```
### Phase 4: Personalization
```bash
# Month 3+
# If desired, add speaker identification
# Personalized responses
# User-specific preferences
```
## Practical Examples
### Example 1: Family of 4, Single Model
```bash
# Training session with everyone
cd ~/precise-models/hey-mycroft-family
# Dad records 25 samples
precise-collect
# Mom records 25 samples
precise-collect
# Kid 1 records 15 samples
precise-collect
# Kid 2 records 15 samples
precise-collect
# Collect shared negative samples (200+)
# TV, music, conversation, etc.
precise-collect -f not-wake-word/household.wav
# Train single model for everyone
precise-train -e 60 hey-mycroft-family.net .
# Deploy
python voice_server.py \
--enable-precise \
--precise-model hey-mycroft-family.net
```
**Result:** Everyone can use it, one model, simple.
### Example 2: Two Wake Words, Different Purposes
```python
# voice_server.py configuration
wake_words = {
'hey_mycroft': {
'model': 'hey-mycroft.net',
'sensitivity': 0.5,
'intent_parser': 'general', # All commands
'response': 'Yes?'
},
'emergency': {
'model': 'emergency.net',
'sensitivity': 0.7, # Higher threshold
'intent_parser': 'emergency', # Limited commands
'response': 'Emergency mode activated'
}
}
# "Hey Mycroft, turn on the lights" - works
# "Emergency, call for help" - triggers emergency protocol
```
### Example 3: Speaker Identification + Personalization
```python
# Enhanced processing with speaker ID
def process_with_speaker_id(audio, speaker):
# Different HA entity based on speaker
entity_maps = {
'alan': {
'bedroom_light': 'light.master_bedroom',
'office_light': 'light.alan_office',
},
'sarah': {
'bedroom_light': 'light.master_bedroom',
'office_light': 'light.sarah_office',
},
'kids': {
'bedroom_light': 'light.kids_bedroom',
'tv': None, # Kids can't control TV
}
}
# Transcribe command
text = whisper_transcribe(audio)
# "Turn on bedroom light"
if 'bedroom light' in text:
entity = entity_maps[speaker]['bedroom_light']
ha_client.turn_on(entity)
response = f"Turned on your bedroom light"
return response
```
## Resource Requirements
### Single Wake Word
- **CPU:** 5-10% (Heimdall)
- **RAM:** 100-200MB
- **Model size:** 1-3MB
- **Training time:** 30-60 min
### Multiple Wake Words (3 models)
- **CPU:** 15-30% (Heimdall)
- **RAM:** 300-600MB
- **Model size:** 3-9MB total
- **Training time:** 90-180 min
### With Speaker Identification
- **CPU:** +5-10% for speaker ID
- **RAM:** +200-300MB for embedding model
- **Model size:** +50MB for speaker model
- **Setup time:** +30-60 min for voice enrollment
### K210 Edge (Maix Duino)
- **Single model:** Feasible, ~30% CPU
- **2 models:** Feasible, ~60% CPU, higher latency
- **3+ models:** Not recommended
- **Speaker ID:** Not feasible (limited RAM/compute)
## Quick Decision Guide
**Just getting started?**
→ Use pre-trained "Hey Mycroft"
**Want custom wake word?**
→ Train one model with all family voices
**Need multiple wake words?**
→ Start server-side with 2-3 models
**Want personalization?**
→ Add speaker identification
**Deploying to edge (K210)?**
→ Stick to 1-2 wake words maximum
**Family of 4+ people?**
→ Train single model with everyone's voice
**Privacy is paramount?**
→ Skip speaker ID, use single universal model
## Testing Multiple Wake Words
```bash
# Test all wake words quickly
conda activate precise
# Terminal 1: Hey Mycroft
precise-listen hey-mycroft.net
# Terminal 2: Hey Computer
precise-listen hey-computer.net
# Terminal 3: Emergency
precise-listen emergency.net
# Say each wake word, verify correct detection
```
## Conclusion
### For Your Maix Duino Project:
**Recommended approach:**
1. **Start with "Hey Mycroft"** - Use pre-trained model
2. **Fine-tune if needed** - Add your household's voices
3. **Consider 2nd wake word** - Only if you have a specific use case
4. **Speaker ID** - Phase 2/3 enhancement, not critical for MVP
5. **Keep it simple** - One wake word works great for most users
**The pre-trained "Hey Mycroft" model saves you 1-2 hours** and works immediately. You can always fine-tune or add custom wake words later!
**Multiple wake words are cool but not necessary** - Most commercial products use just one. Focus on making one wake word work really well before adding more.
**Voice adaptation** - Training with multiple voices upfront is simpler than per-user models. Save speaker ID for later if you need personalization.
## Quick Start with Pre-trained
```bash
# On Heimdall
cd ~/precise-models/pretrained
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
tar xzf hey-mycroft.tar.gz
# Test it
conda activate precise
precise-listen hey-mycroft.net
# Deploy
cd ~/voice-assistant
python voice_server.py \
--enable-precise \
--precise-model ~/precise-models/pretrained/hey-mycroft.net
# You're done! No training needed!
```
**That's it - you have a working wake word in 5 minutes!** 🎉