minerva/docs/WAKE_WORD_ADVANCED.md
pyr0ball 173f7f37d4 feat: import mycroft-precise work as Minerva foundation
Ports prior voice assistant research and prototypes from devl/Devops
into the Minerva repo. Includes:

- docs/: architecture, wake word guides, ESP32-S3 spec, hardware buying guide
- scripts/: voice_server.py, voice_server_enhanced.py, setup scripts
- hardware/maixduino/: edge device scripts with WiFi credentials scrubbed
  (replaced hardcoded password with secrets.py pattern)
- config/.env.example: server config template
- .gitignore: excludes .env, secrets.py, model blobs, ELF firmware
- CLAUDE.md: Minerva product context and connection to cf-voice roadmap
2026-04-06 22:21:12 -07:00

18 KiB
Executable file

Wake Word Models: Pre-trained, Multiple, and Voice Adaptation

Pre-trained Wake Word Models

Yes! "Hey Mycroft" Already Exists

Mycroft provides several pre-trained models that you can use immediately:

Available Pre-trained Models

Hey Mycroft (Official)

# Download from Mycroft's model repository
cd ~/precise-models/pretrained
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
tar xzf hey-mycroft.tar.gz

# Test immediately
conda activate precise
precise-listen hey-mycroft.net

# Should detect "Hey Mycroft" right away!

Other Available Models:

  • Hey Mycroft - Best tested, most reliable
  • Christopher - Alternative wake word
  • Hey Jarvis - Community contributed
  • Computer - Star Trek style

Using Pre-trained Models

Option 1: Use as-is

# Just point your server to the pre-trained model
python voice_server.py \
    --enable-precise \
    --precise-model ~/precise-models/pretrained/hey-mycroft.net \
    --precise-sensitivity 0.5

Option 2: Fine-tune for your voice

# Use pre-trained as starting point, add your samples
cd ~/precise-models/my-hey-mycroft

# Record additional samples
precise-collect

# Train from checkpoint (much faster than from scratch!)
precise-train -e 30 my-hey-mycroft.net . \
    --from-checkpoint ~/precise-models/pretrained/hey-mycroft.net

# This adds your voice/environment while keeping the base model

Option 3: Ensemble with custom

# Use both pre-trained and custom model
# Require both to agree (reduces false positives)
# See implementation below

Advantages of Pre-trained Models

Instant deployment - No training required
Proven accuracy - Tested by thousands of users
Good starting point - Fine-tune rather than train from scratch
Multiple speakers - Already includes diverse voices
Save time - Skip 1-2 hours of training

Disadvantages

Generic - Not optimized for your voice/environment
May need tuning - Threshold adjustment required
Limited choice - Only a few wake words available

Recommendation

Start with "Hey Mycroft" pre-trained model:

  1. Deploy immediately (zero training time)
  2. Test in your environment
  3. Collect false positives/negatives
  4. Fine-tune with your examples
  5. Best of both worlds!

Multiple Wake Words

Can You Have Multiple Wake Words?

Short answer: Yes, but with tradeoffs.

Implementation Approaches

Run multiple Precise models in parallel on Heimdall:

# In voice_server.py
from precise_runner import PreciseEngine, PreciseRunner

# Global runners for each wake word
precise_runners = {}
wake_word_configs = {
    'hey_mycroft': {
        'model': '~/precise-models/pretrained/hey-mycroft.net',
        'sensitivity': 0.5,
        'response': 'Yes?'
    },
    'hey_computer': {
        'model': '~/precise-models/hey-computer/hey-computer.net',
        'sensitivity': 0.5,
        'response': 'I\'m listening'
    },
    'jarvis': {
        'model': '~/precise-models/jarvis/jarvis.net',
        'sensitivity': 0.6,
        'response': 'At your service, sir'
    }
}

def on_wake_word_detected(wake_word_name):
    """Callback with wake word identifier"""
    def callback():
        print(f"Wake word detected: {wake_word_name}")
        wake_word_queue.put({
            'timestamp': time.time(),
            'wake_word': wake_word_name,
            'response': wake_word_configs[wake_word_name]['response']
        })
    return callback

def start_multiple_wake_words():
    """Start multiple Precise listeners"""
    for name, config in wake_word_configs.items():
        engine = PreciseEngine(
            '/usr/local/bin/precise-engine',
            os.path.expanduser(config['model'])
        )
        
        runner = PreciseRunner(
            engine,
            sensitivity=config['sensitivity'],
            on_activation=on_wake_word_detected(name)
        )
        
        runner.start()
        precise_runners[name] = runner
        print(f"Started wake word listener: {name}")

Resource Usage:

  • CPU: ~5-10% per model (3 models = ~15-30%)
  • RAM: ~100-200MB per model
  • Still very manageable on Heimdall

Pros: Different wake words for different purposes
Family members can choose preferred wake word
Context-aware responses
Easy to add/remove models

Cons: Higher CPU usage (scales linearly)
Increased false positive risk (3x models = 3x chance)
More complex configuration

Approach 2: Edge Multiple Models (K210)

Challenge: K210 has limited resources

Option A: Sequential checking (Feasible)

# Check each model in sequence
models = ['hey-mycroft.kmodel', 'hey-computer.kmodel']

for model in models:
    kpu_task = kpu.load(f"/sd/models/{model}")
    result = kpu.run(kpu_task, audio_features)
    if result > threshold:
        return model  # Wake word detected

Resource impact:

  • Latency: +50-100ms per additional model
  • Memory: Models must fit in 6MB total
  • CPU: ~30% per model check

Option B: Combined model (Advanced)

# Train a single model that recognizes multiple phrases
# Each phrase maps to different output class
# More complex training but single inference

Recommendation for edge:

  • 1-2 wake words max on K210
  • Server-side for 3+ wake words

Approach 3: Contextual Wake Words

Different wake words trigger different behaviors:

wake_word_contexts = {
    'hey_mycroft': 'general',      # General commands
    'hey_assistant': 'general',    # Alternative general
    'emergency': 'priority',       # High priority
    'goodnight': 'bedtime',        # Bedtime routine
}

def handle_wake_word(wake_word, command):
    context = wake_word_contexts[wake_word]
    
    if context == 'priority':
        # Skip queue, process immediately
        # Maybe call emergency contact
        pass
    elif context == 'bedtime':
        # Trigger bedtime automation
        # Lower volume for responses
        pass
    else:
        # Normal processing
        pass

Best Practices for Multiple Wake Words

  1. Start with one - Get it working well first
  2. Add gradually - One at a time, test thoroughly
  3. Different purposes - Each wake word should have a reason
  4. Monitor performance - Track false positives per wake word
  5. User preference - Let family members choose their favorite

For most users:

wake_words = {
    'hey_mycroft': 'primary',    # Main wake word (pre-trained)
    'hey_computer': 'alternative' # Custom trained for your voice
}

For power users:

wake_words = {
    'hey_mycroft': 'general',
    'jarvis': 'personal_assistant',  # Custom responses
    'computer': 'technical_queries', # Different intent parser
}

For families:

wake_words = {
    'hey_mycroft': 'shared',        # Everyone can use
    'dad': 'user_alan',             # Personalized
    'mom': 'user_sarah',            # Personalized
    'kids': 'user_children',        # Kid-safe responses
}

Voice Adaptation and Multi-User Support

Challenge: Different Voices, Same Wake Word

When multiple people use the system:

  • Different accents
  • Different speech patterns
  • Different pronunciations
  • Different vocal characteristics

Solution Approaches

During initial training:

# Have everyone in household record samples
cd ~/precise-models/hey-computer

# Alan records 30 samples
precise-collect  # Record as user 1

# Sarah records 30 samples  
precise-collect  # Record as user 2

# Kids record 20 samples
precise-collect  # Record as user 3

# Combine all in training set
# Train one model that works for everyone
./3-train-model.sh

Pros: Single model for everyone
No user switching needed
Simple to maintain
Works immediately for all users

Cons: May have lower per-person accuracy
Requires upfront time from everyone
Hard to add new users later

Approach 2: Incremental Training

Start with your voice, add others over time:

# Week 1: Train with Alan's voice
cd ~/precise-models/hey-computer
# Record and train with Alan's samples
precise-train -e 60 hey-computer.net .

# Week 2: Sarah wants to use it
# Collect Sarah's samples
mkdir -p sarah-samples/wake-word
precise-collect  # Sarah records 20-30 samples

# Add to existing training set
cp sarah-samples/wake-word/* wake-word/

# Retrain (continue from checkpoint)
precise-train -e 30 hey-computer.net . \
    --from-checkpoint hey-computer.net

# Now works for both Alan and Sarah!

Pros: Gradual improvement
Don't need everyone upfront
Easy to add new users
Maintains accuracy for existing users

Cons: May not work well for new users initially
Requires retraining periodically

Approach 3: Per-User Models with Speaker Identification

Train separate models + identify who's speaking:

Step 1: Train per-user wake word models

# Alan's model
~/precise-models/hey-computer-alan/

# Sarah's model
~/precise-models/hey-computer-sarah/

# Kids' model
~/precise-models/hey-computer-kids/

Step 2: Use speaker identification

# Pseudo-code for speaker identification
def identify_speaker(audio):
    """
    Identify speaker from voice characteristics
    Using speaker embeddings (x-vectors, d-vectors)
    """
    # Extract speaker embedding
    embedding = speaker_encoder.encode(audio)
    
    # Compare to known users
    similarities = {
        'alan': cosine_similarity(embedding, alan_embedding),
        'sarah': cosine_similarity(embedding, sarah_embedding),
        'kids': cosine_similarity(embedding, kids_embedding),
    }
    
    # Return most similar
    return max(similarities, key=similarities.get)

def process_command(audio):
    # Detect wake word with all models
    wake_detected = check_all_models(audio)
    
    if wake_detected:
        # Identify speaker
        speaker = identify_speaker(audio)
        
        # Use speaker-specific model for better accuracy
        model = f'~/precise-models/hey-computer-{speaker}/'
        
        # Continue with speaker context
        process_with_context(audio, speaker)

Speaker identification libraries:

  • Resemblyzer - Simple speaker verification
  • speechbrain - Complete toolkit
  • pyannote.audio - You already use this for diarization!

Implementation:

# You already have pyannote for diarization!
conda activate voice-assistant
pip install pyannote.audio --break-system-packages

# Can use speaker embeddings for identification
from pyannote.audio import Inference

# Load speaker embedding model
inference = Inference(
    "pyannote/embedding",
    use_auth_token=hf_token
)

# Extract embeddings for known users
alan_embedding = inference("alan_voice_sample.wav")
sarah_embedding = inference("sarah_voice_sample.wav")

# Compare with incoming audio
unknown_embedding = inference(audio_buffer)

from scipy.spatial.distance import cosine
alan_similarity = 1 - cosine(unknown_embedding, alan_embedding)
sarah_similarity = 1 - cosine(unknown_embedding, sarah_embedding)

if alan_similarity > 0.8:
    user = 'alan'
elif sarah_similarity > 0.8:
    user = 'sarah'
else:
    user = 'unknown'

Pros: Personalized responses per user
Better accuracy (model optimized for each voice)
User-specific preferences/permissions
Can track who said what

Cons: More complex setup
Higher resource usage
Requires voice samples from each user
Privacy considerations

Approach 4: Adaptive/Online Learning

Model improves automatically based on usage:

class AdaptiveWakeWord:
    def __init__(self, base_model):
        self.base_model = base_model
        self.user_samples = []
        self.retrain_threshold = 50  # Retrain after N samples
    
    def on_detection(self, audio, user_confirmed=True):
        """User confirms this was correct detection"""
        if user_confirmed:
            self.user_samples.append(audio)
            
            # Periodically retrain
            if len(self.user_samples) >= self.retrain_threshold:
                self.retrain_with_samples()
                self.user_samples = []
    
    def retrain_with_samples(self):
        """Background retraining with collected samples"""
        # Add samples to training set
        # Retrain model
        # Swap in new model
        pass

Pros: Automatic improvement
Adapts to user's voice over time
No manual retraining
Gets better with use

Cons: Complex implementation
Requires user feedback mechanism
Risk of drift/degradation
Background training overhead

Phase 1: Single Wake Word, Single Model

# Week 1-2
# Use pre-trained "Hey Mycroft"
# OR train custom "Hey Computer" with all family members' voices
# Keep it simple, get it working

Phase 2: Add Fine-tuning

# Week 3-4
# Collect false positives/negatives
# Retrain with household-specific data
# Optimize threshold

Phase 3: Consider Multiple Wake Words

# Month 2
# If needed, add second wake word
# "Hey Mycroft" for general
# "Jarvis" for personal assistant tasks

Phase 4: Personalization

# Month 3+
# If desired, add speaker identification
# Personalized responses
# User-specific preferences

Practical Examples

Example 1: Family of 4, Single Model

# Training session with everyone
cd ~/precise-models/hey-mycroft-family

# Dad records 25 samples
precise-collect  

# Mom records 25 samples
precise-collect

# Kid 1 records 15 samples
precise-collect

# Kid 2 records 15 samples
precise-collect

# Collect shared negative samples (200+)
# TV, music, conversation, etc.
precise-collect -f not-wake-word/household.wav

# Train single model for everyone
precise-train -e 60 hey-mycroft-family.net .

# Deploy
python voice_server.py \
    --enable-precise \
    --precise-model hey-mycroft-family.net

Result: Everyone can use it, one model, simple.

Example 2: Two Wake Words, Different Purposes

# voice_server.py configuration
wake_words = {
    'hey_mycroft': {
        'model': 'hey-mycroft.net',
        'sensitivity': 0.5,
        'intent_parser': 'general',  # All commands
        'response': 'Yes?'
    },
    'emergency': {
        'model': 'emergency.net',
        'sensitivity': 0.7,  # Higher threshold
        'intent_parser': 'emergency',  # Limited commands
        'response': 'Emergency mode activated'
    }
}

# "Hey Mycroft, turn on the lights" - works
# "Emergency, call for help" - triggers emergency protocol

Example 3: Speaker Identification + Personalization

# Enhanced processing with speaker ID
def process_with_speaker_id(audio, speaker):
    # Different HA entity based on speaker
    entity_maps = {
        'alan': {
            'bedroom_light': 'light.master_bedroom',
            'office_light': 'light.alan_office',
        },
        'sarah': {
            'bedroom_light': 'light.master_bedroom',
            'office_light': 'light.sarah_office',
        },
        'kids': {
            'bedroom_light': 'light.kids_bedroom',
            'tv': None,  # Kids can't control TV
        }
    }
    
    # Transcribe command
    text = whisper_transcribe(audio)
    
    # "Turn on bedroom light"
    if 'bedroom light' in text:
        entity = entity_maps[speaker]['bedroom_light']
        ha_client.turn_on(entity)
        
        response = f"Turned on your bedroom light"
    
    return response

Resource Requirements

Single Wake Word

  • CPU: 5-10% (Heimdall)
  • RAM: 100-200MB
  • Model size: 1-3MB
  • Training time: 30-60 min

Multiple Wake Words (3 models)

  • CPU: 15-30% (Heimdall)
  • RAM: 300-600MB
  • Model size: 3-9MB total
  • Training time: 90-180 min

With Speaker Identification

  • CPU: +5-10% for speaker ID
  • RAM: +200-300MB for embedding model
  • Model size: +50MB for speaker model
  • Setup time: +30-60 min for voice enrollment

K210 Edge (Maix Duino)

  • Single model: Feasible, ~30% CPU
  • 2 models: Feasible, ~60% CPU, higher latency
  • 3+ models: Not recommended
  • Speaker ID: Not feasible (limited RAM/compute)

Quick Decision Guide

Just getting started? → Use pre-trained "Hey Mycroft"

Want custom wake word? → Train one model with all family voices

Need multiple wake words? → Start server-side with 2-3 models

Want personalization? → Add speaker identification

Deploying to edge (K210)? → Stick to 1-2 wake words maximum

Family of 4+ people? → Train single model with everyone's voice

Privacy is paramount? → Skip speaker ID, use single universal model

Testing Multiple Wake Words

# Test all wake words quickly
conda activate precise

# Terminal 1: Hey Mycroft
precise-listen hey-mycroft.net

# Terminal 2: Hey Computer  
precise-listen hey-computer.net

# Terminal 3: Emergency
precise-listen emergency.net

# Say each wake word, verify correct detection

Conclusion

For Your Maix Duino Project:

Recommended approach:

  1. Start with "Hey Mycroft" - Use pre-trained model
  2. Fine-tune if needed - Add your household's voices
  3. Consider 2nd wake word - Only if you have a specific use case
  4. Speaker ID - Phase 2/3 enhancement, not critical for MVP
  5. Keep it simple - One wake word works great for most users

The pre-trained "Hey Mycroft" model saves you 1-2 hours and works immediately. You can always fine-tune or add custom wake words later!

Multiple wake words are cool but not necessary - Most commercial products use just one. Focus on making one wake word work really well before adding more.

Voice adaptation - Training with multiple voices upfront is simpler than per-user models. Save speaker ID for later if you need personalization.

Quick Start with Pre-trained

# On Heimdall
cd ~/precise-models/pretrained
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
tar xzf hey-mycroft.tar.gz

# Test it
conda activate precise
precise-listen hey-mycroft.net

# Deploy
cd ~/voice-assistant
python voice_server.py \
    --enable-precise \
    --precise-model ~/precise-models/pretrained/hey-mycroft.net

# You're done! No training needed!

That's it - you have a working wake word in 5 minutes! 🎉