minerva/docs/WAKE_WORD_ADVANCED.md

# Wake Word Models: Pre-trained, Multiple, and Voice Adaptation

## Pre-trained Wake Word Models

### Yes! "Hey Mycroft" Already Exists

Mycroft provides several pre-trained models that you can use immediately:

#### Available Pre-trained Models

**Hey Mycroft** (Official)
```bash
# Download from Mycroft's model repository
cd ~/precise-models/pretrained
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
tar xzf hey-mycroft.tar.gz

# Test immediately
conda activate precise
precise-listen hey-mycroft.net

# Should detect "Hey Mycroft" right away!
```

**Other Available Models:**
- **Hey Mycroft** - Best tested, most reliable
- **Christopher** - Alternative wake word
- **Hey Jarvis** - Community contributed
- **Computer** - Star Trek style

#### Using Pre-trained Models

**Option 1: Use as-is**
```bash
# Just point your server to the pre-trained model
python voice_server.py \
    --enable-precise \
    --precise-model ~/precise-models/pretrained/hey-mycroft.net \
    --precise-sensitivity 0.5
```

**Option 2: Fine-tune for your voice**
```bash
# Use pre-trained as starting point, add your samples
cd ~/precise-models/my-hey-mycroft

# Record additional samples
precise-collect

# Train from checkpoint (much faster than from scratch!)
precise-train -e 30 my-hey-mycroft.net . \
    --from-checkpoint ~/precise-models/pretrained/hey-mycroft.net

# This adds your voice/environment while keeping the base model
```

**Option 3: Ensemble with custom**
```python
# Use both pre-trained and custom model
# Require both to agree (reduces false positives)
# See implementation below
```

### Advantages of Pre-trained Models

✅ **Instant deployment** - No training required
✅ **Proven accuracy** - Tested by thousands of users
✅ **Good starting point** - Fine-tune rather than train from scratch
✅ **Multiple speakers** - Already includes diverse voices
✅ **Save time** - Skip 1-2 hours of training

### Disadvantages

❌ **Generic** - Not optimized for your voice/environment
❌ **May need tuning** - Threshold adjustment required
❌ **Limited choice** - Only a few wake words available

### Recommendation

**Start with "Hey Mycroft"** pre-trained model:
1. Deploy immediately (zero training time)
2. Test in your environment
3. Collect false positives/negatives
4. Fine-tune with your examples
5. Best of both worlds!

## Multiple Wake Words

### Can You Have Multiple Wake Words?

**Short answer:** Yes, but with tradeoffs.

### Implementation Approaches

#### Approach 1: Server-Side Multiple Models (Recommended)

Run multiple Precise models in parallel on Heimdall:

```python
# In voice_server.py
from precise_runner import PreciseEngine, PreciseRunner

# Global runners for each wake word
precise_runners = {}
wake_word_configs = {
    'hey_mycroft': {
        'model': '~/precise-models/pretrained/hey-mycroft.net',
        'sensitivity': 0.5,
        'response': 'Yes?'
    },
    'hey_computer': {
        'model': '~/precise-models/hey-computer/hey-computer.net',
        'sensitivity': 0.5,
        'response': 'I\'m listening'
    },
    'jarvis': {
        'model': '~/precise-models/jarvis/jarvis.net',
        'sensitivity': 0.6,
        'response': 'At your service, sir'
    }
}

def on_wake_word_detected(wake_word_name):
    """Callback with wake word identifier"""
    def callback():
        print(f"Wake word detected: {wake_word_name}")
        wake_word_queue.put({
            'timestamp': time.time(),
            'wake_word': wake_word_name,
            'response': wake_word_configs[wake_word_name]['response']
        })
    return callback

def start_multiple_wake_words():
    """Start multiple Precise listeners"""
    for name, config in wake_word_configs.items():
        engine = PreciseEngine(
            '/usr/local/bin/precise-engine',
            os.path.expanduser(config['model'])
        )

        runner = PreciseRunner(
            engine,
            sensitivity=config['sensitivity'],
            on_activation=on_wake_word_detected(name)
        )

        runner.start()
        precise_runners[name] = runner
        print(f"Started wake word listener: {name}")
```

**Resource Usage:**
- CPU: ~5-10% per model (3 models = ~15-30%)
- RAM: ~100-200MB per model
- Still very manageable on Heimdall

**Pros:**
✅ Different wake words for different purposes
✅ Family members can choose preferred wake word
✅ Context-aware responses
✅ Easy to add/remove models

**Cons:**
❌ Higher CPU usage (scales linearly)
❌ Increased false positive risk (3x models = 3x chance)
❌ More complex configuration

#### Approach 2: Edge Multiple Models (K210)

**Challenge:** K210 has limited resources

**Option A: Sequential checking** (Feasible)
```python
# Check each model in sequence
models = ['hey-mycroft.kmodel', 'hey-computer.kmodel']

for model in models:
    kpu_task = kpu.load(f"/sd/models/{model}")
    result = kpu.run(kpu_task, audio_features)
    if result > threshold:
        return model  # Wake word detected
```

**Resource impact:**
- Latency: +50-100ms per additional model
- Memory: Models must fit in 6MB total
- CPU: ~30% per model check

**Option B: Combined model** (Advanced)
```python
# Train a single model that recognizes multiple phrases
# Each phrase maps to different output class
# More complex training but single inference
```

**Recommendation for edge:**
- **1-2 wake words max** on K210
- **Server-side** for 3+ wake words

#### Approach 3: Contextual Wake Words

Different wake words trigger different behaviors:

```python
wake_word_contexts = {
    'hey_mycroft': 'general',      # General commands
    'hey_assistant': 'general',    # Alternative general
    'emergency': 'priority',       # High priority
    'goodnight': 'bedtime',        # Bedtime routine
}

def handle_wake_word(wake_word, command):
    context = wake_word_contexts[wake_word]

    if context == 'priority':
        # Skip queue, process immediately
        # Maybe call emergency contact
        pass
    elif context == 'bedtime':
        # Trigger bedtime automation
        # Lower volume for responses
        pass
    else:
        # Normal processing
        pass
```

### Best Practices for Multiple Wake Words

1. **Start with one** - Get it working well first
2. **Add gradually** - One at a time, test thoroughly
3. **Different purposes** - Each wake word should have a reason
4. **Monitor performance** - Track false positives per wake word
5. **User preference** - Let family members choose their favorite

### Recommended Configuration

**For most users:**
```python
wake_words = {
    'hey_mycroft': 'primary',    # Main wake word (pre-trained)
    'hey_computer': 'alternative' # Custom trained for your voice
}
```

**For power users:**
```python
wake_words = {
    'hey_mycroft': 'general',
    'jarvis': 'personal_assistant',  # Custom responses
    'computer': 'technical_queries', # Different intent parser
}
```

**For families:**
```python
wake_words = {
    'hey_mycroft': 'shared',        # Everyone can use
    'dad': 'user_alan',             # Personalized
    'mom': 'user_sarah',            # Personalized
    'kids': 'user_children',        # Kid-safe responses
}
```

## Voice Adaptation and Multi-User Support

### Challenge: Different Voices, Same Wake Word

When multiple people use the system:
- Different accents
- Different speech patterns
- Different pronunciations
- Different vocal characteristics

### Solution Approaches

#### Approach 1: Diverse Training Data (Recommended)

**During initial training:**
```bash
# Have everyone in household record samples
cd ~/precise-models/hey-computer

# Alan records 30 samples
precise-collect  # Record as user 1

# Sarah records 30 samples
precise-collect  # Record as user 2

# Kids record 20 samples
precise-collect  # Record as user 3

# Combine all in training set
# Train one model that works for everyone
./3-train-model.sh
```

**Pros:**
✅ Single model for everyone
✅ No user switching needed
✅ Simple to maintain
✅ Works immediately for all users

**Cons:**
❌ May have lower per-person accuracy
❌ Requires upfront time from everyone
❌ Hard to add new users later

#### Approach 2: Incremental Training

Start with your voice, add others over time:

```bash
# Week 1: Train with Alan's voice
cd ~/precise-models/hey-computer
# Record and train with Alan's samples
precise-train -e 60 hey-computer.net .

# Week 2: Sarah wants to use it
# Collect Sarah's samples
mkdir -p sarah-samples/wake-word
precise-collect  # Sarah records 20-30 samples

# Add to existing training set
cp sarah-samples/wake-word/* wake-word/

# Retrain (continue from checkpoint)
precise-train -e 30 hey-computer.net . \
    --from-checkpoint hey-computer.net

# Now works for both Alan and Sarah!
```

**Pros:**
✅ Gradual improvement
✅ Don't need everyone upfront
✅ Easy to add new users
✅ Maintains accuracy for existing users

**Cons:**
❌ May not work well for new users initially
❌ Requires retraining periodically

#### Approach 3: Per-User Models with Speaker Identification

Train separate models + identify who's speaking:

**Step 1: Train per-user wake word models**
```bash
# Alan's model
~/precise-models/hey-computer-alan/

# Sarah's model
~/precise-models/hey-computer-sarah/

# Kids' model
~/precise-models/hey-computer-kids/
```

**Step 2: Use speaker identification**
```python
# Pseudo-code for speaker identification
def identify_speaker(audio):
    """
    Identify speaker from voice characteristics
    Using speaker embeddings (x-vectors, d-vectors)
    """
    # Extract speaker embedding
    embedding = speaker_encoder.encode(audio)

    # Compare to known users
    similarities = {
        'alan': cosine_similarity(embedding, alan_embedding),
        'sarah': cosine_similarity(embedding, sarah_embedding),
        'kids': cosine_similarity(embedding, kids_embedding),
    }

    # Return most similar
    return max(similarities, key=similarities.get)

def process_command(audio):
    # Detect wake word with all models
    wake_detected = check_all_models(audio)

    if wake_detected:
        # Identify speaker
        speaker = identify_speaker(audio)

        # Use speaker-specific model for better accuracy
        model = f'~/precise-models/hey-computer-{speaker}/'

        # Continue with speaker context
        process_with_context(audio, speaker)
```

**Speaker identification libraries:**
- **Resemblyzer** - Simple speaker verification
- **speechbrain** - Complete toolkit
- **pyannote.audio** - You already use this for diarization!

**Implementation:**
```bash
# You already have pyannote for diarization!
conda activate voice-assistant
pip install pyannote.audio --break-system-packages

# Can use speaker embeddings for identification
```

```python
from pyannote.audio import Inference

# Load speaker embedding model
inference = Inference(
    "pyannote/embedding",
    use_auth_token=hf_token
)

# Extract embeddings for known users
alan_embedding = inference("alan_voice_sample.wav")
sarah_embedding = inference("sarah_voice_sample.wav")

# Compare with incoming audio
unknown_embedding = inference(audio_buffer)

from scipy.spatial.distance import cosine
alan_similarity = 1 - cosine(unknown_embedding, alan_embedding)
sarah_similarity = 1 - cosine(unknown_embedding, sarah_embedding)

if alan_similarity > 0.8:
    user = 'alan'
elif sarah_similarity > 0.8:
    user = 'sarah'
else:
    user = 'unknown'
```

**Pros:**
✅ Personalized responses per user
✅ Better accuracy (model optimized for each voice)
✅ User-specific preferences/permissions
✅ Can track who said what

**Cons:**
❌ More complex setup
❌ Higher resource usage
❌ Requires voice samples from each user
❌ Privacy considerations

#### Approach 4: Adaptive/Online Learning

Model improves automatically based on usage:

```python
class AdaptiveWakeWord:
    def __init__(self, base_model):
        self.base_model = base_model
        self.user_samples = []
        self.retrain_threshold = 50  # Retrain after N samples

    def on_detection(self, audio, user_confirmed=True):
        """User confirms this was correct detection"""
        if user_confirmed:
            self.user_samples.append(audio)

            # Periodically retrain
            if len(self.user_samples) >= self.retrain_threshold:
                self.retrain_with_samples()
                self.user_samples = []

    def retrain_with_samples(self):
        """Background retraining with collected samples"""
        # Add samples to training set
        # Retrain model
        # Swap in new model
        pass
```

**Pros:**
✅ Automatic improvement
✅ Adapts to user's voice over time
✅ No manual retraining
✅ Gets better with use

**Cons:**
❌ Complex implementation
❌ Requires user feedback mechanism
❌ Risk of drift/degradation
❌ Background training overhead

## Recommended Strategy

### Phase 1: Single Wake Word, Single Model
```bash
# Week 1-2
# Use pre-trained "Hey Mycroft"
# OR train custom "Hey Computer" with all family members' voices
# Keep it simple, get it working
```

### Phase 2: Add Fine-tuning
```bash
# Week 3-4
# Collect false positives/negatives
# Retrain with household-specific data
# Optimize threshold
```

### Phase 3: Consider Multiple Wake Words
```bash
# Month 2
# If needed, add second wake word
# "Hey Mycroft" for general
# "Jarvis" for personal assistant tasks
```

### Phase 4: Personalization
```bash
# Month 3+
# If desired, add speaker identification
# Personalized responses
# User-specific preferences
```

## Practical Examples

### Example 1: Family of 4, Single Model

```bash
# Training session with everyone
cd ~/precise-models/hey-mycroft-family

# Dad records 25 samples
precise-collect

# Mom records 25 samples
precise-collect

# Kid 1 records 15 samples
precise-collect

# Kid 2 records 15 samples
precise-collect

# Collect shared negative samples (200+)
# TV, music, conversation, etc.
precise-collect -f not-wake-word/household.wav

# Train single model for everyone
precise-train -e 60 hey-mycroft-family.net .

# Deploy
python voice_server.py \
    --enable-precise \
    --precise-model hey-mycroft-family.net
```

**Result:** Everyone can use it, one model, simple.

### Example 2: Two Wake Words, Different Purposes

```python
# voice_server.py configuration
wake_words = {
    'hey_mycroft': {
        'model': 'hey-mycroft.net',
        'sensitivity': 0.5,
        'intent_parser': 'general',  # All commands
        'response': 'Yes?'
    },
    'emergency': {
        'model': 'emergency.net',
        'sensitivity': 0.7,  # Higher threshold
        'intent_parser': 'emergency',  # Limited commands
        'response': 'Emergency mode activated'
    }
}

# "Hey Mycroft, turn on the lights" - works
# "Emergency, call for help" - triggers emergency protocol
```

### Example 3: Speaker Identification + Personalization

```python
# Enhanced processing with speaker ID
def process_with_speaker_id(audio, speaker):
    # Different HA entity based on speaker
    entity_maps = {
        'alan': {
            'bedroom_light': 'light.master_bedroom',
            'office_light': 'light.alan_office',
        },
        'sarah': {
            'bedroom_light': 'light.master_bedroom',
            'office_light': 'light.sarah_office',
        },
        'kids': {
            'bedroom_light': 'light.kids_bedroom',
            'tv': None,  # Kids can't control TV
        }
    }

    # Transcribe command
    text = whisper_transcribe(audio)

    # "Turn on bedroom light"
    if 'bedroom light' in text:
        entity = entity_maps[speaker]['bedroom_light']
        ha_client.turn_on(entity)

        response = f"Turned on your bedroom light"

    return response
```

## Resource Requirements

### Single Wake Word
- **CPU:** 5-10% (Heimdall)
- **RAM:** 100-200MB
- **Model size:** 1-3MB
- **Training time:** 30-60 min

### Multiple Wake Words (3 models)
- **CPU:** 15-30% (Heimdall)
- **RAM:** 300-600MB
- **Model size:** 3-9MB total
- **Training time:** 90-180 min

### With Speaker Identification
- **CPU:** +5-10% for speaker ID
- **RAM:** +200-300MB for embedding model
- **Model size:** +50MB for speaker model
- **Setup time:** +30-60 min for voice enrollment

### K210 Edge (Maix Duino)
- **Single model:** Feasible, ~30% CPU
- **2 models:** Feasible, ~60% CPU, higher latency
- **3+ models:** Not recommended
- **Speaker ID:** Not feasible (limited RAM/compute)

## Quick Decision Guide

**Just getting started?**
→ Use pre-trained "Hey Mycroft"

**Want custom wake word?**
→ Train one model with all family voices

**Need multiple wake words?**
→ Start server-side with 2-3 models

**Want personalization?**
→ Add speaker identification

**Deploying to edge (K210)?**
→ Stick to 1-2 wake words maximum

**Family of 4+ people?**
→ Train single model with everyone's voice

**Privacy is paramount?**
→ Skip speaker ID, use single universal model

## Testing Multiple Wake Words

```bash
# Test all wake words quickly
conda activate precise

# Terminal 1: Hey Mycroft
precise-listen hey-mycroft.net

# Terminal 2: Hey Computer
precise-listen hey-computer.net

# Terminal 3: Emergency
precise-listen emergency.net

# Say each wake word, verify correct detection
```

## Conclusion

### For Your Maix Duino Project:

**Recommended approach:**
1. **Start with "Hey Mycroft"** - Use pre-trained model
2. **Fine-tune if needed** - Add your household's voices
3. **Consider 2nd wake word** - Only if you have a specific use case
4. **Speaker ID** - Phase 2/3 enhancement, not critical for MVP
5. **Keep it simple** - One wake word works great for most users

**The pre-trained "Hey Mycroft" model saves you 1-2 hours** and works immediately. You can always fine-tune or add custom wake words later!

**Multiple wake words are cool but not necessary** - Most commercial products use just one. Focus on making one wake word work really well before adding more.

**Voice adaptation** - Training with multiple voices upfront is simpler than per-user models. Save speaker ID for later if you need personalization.

## Quick Start with Pre-trained

```bash
# On Heimdall
cd ~/precise-models/pretrained
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
tar xzf hey-mycroft.tar.gz

# Test it
conda activate precise
precise-listen hey-mycroft.net

# Deploy
cd ~/voice-assistant
python voice_server.py \
    --enable-precise \
    --precise-model ~/precise-models/pretrained/hey-mycroft.net

# You're done! No training needed!
```

**That's it - you have a working wake word in 5 minutes!** 🎉