minerva/docs/ADVANCED_WAKE_WORD_TOPICS.md

# Advanced Wake Word Topics - Pre-trained Models, Multiple Wake Words, and Voice Adaptation

## Pre-trained Mycroft Models

### Yes! Pre-trained Models Exist

Mycroft AI provides several pre-trained wake word models you can use immediately:

**Available Models:**
- **Hey Mycroft** - Original Mycroft wake word (most training data)
- **Hey Jarvis** - Popular alternative
- **Christopher** - Alternative wake word
- **Hey Ezra** - Another option

### Download Pre-trained Models

```bash
# On Heimdall
conda activate precise
cd ~/precise-models

# Create directory for pre-trained models
mkdir -p pretrained
cd pretrained

# Download Hey Mycroft (recommended starting point)
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
tar xzf hey-mycroft.tar.gz

# Download other models
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-jarvis.tar.gz
tar xzf hey-jarvis.tar.gz

# List available models
ls -lh *.net
```

### Test Pre-trained Model

```bash
conda activate precise

# Test Hey Mycroft
precise-listen hey-mycroft.net

# Speak "Hey Mycroft" - should see "!" when detected
# Press Ctrl+C to exit

# Test with different threshold
precise-listen hey-mycroft.net -t 0.7  # More conservative
```

### Use Pre-trained Model in Voice Server

```bash
cd ~/voice-assistant

# Start server with Hey Mycroft model
python voice_server.py \
    --enable-precise \
    --precise-model ~/precise-models/pretrained/hey-mycroft.net \
    --precise-sensitivity 0.5
```

### Fine-tune Pre-trained Models

You can use pre-trained models as a **starting point** and fine-tune with your voice:

```bash
cd ~/precise-models
mkdir -p hey-mycroft-custom

# Copy base model
cp pretrained/hey-mycroft.net hey-mycroft-custom/

# Collect your samples
cd hey-mycroft-custom
precise-collect  # Record 20-30 samples of YOUR voice

# Fine-tune from pre-trained model
precise-train -e 30 hey-mycroft-custom.net . \
    --from-checkpoint ../pretrained/hey-mycroft.net

# This is MUCH faster than training from scratch!
```

**Benefits:**
- ✅ Start with proven model
- ✅ Much less training data needed (20-30 vs 100+ samples)
- ✅ Faster training (30 mins vs 60 mins)
- ✅ Good baseline accuracy

## Multiple Wake Words

### Architecture Options

#### Option 1: Multiple Models in Parallel (Server-Side Only)

Run multiple Precise instances simultaneously:

```python
# In voice_server.py - Multiple wake word detection

from precise_runner import PreciseEngine, PreciseRunner
import threading

# Global runners
precise_runners = {}

def on_wake_word_detected(wake_word_name):
    """Callback factory for different wake words"""
    def callback():
        print(f"Wake word detected: {wake_word_name}")
        wake_word_queue.put({
            'wake_word': wake_word_name,
            'timestamp': time.time()
        })
    return callback

def start_multiple_wake_words(wake_word_configs):
    """
    Start multiple wake word detectors

    Args:
        wake_word_configs: List of dicts with 'name', 'model', 'sensitivity'

    Example:
        configs = [
            {'name': 'hey mycroft', 'model': 'hey-mycroft.net', 'sensitivity': 0.5},
            {'name': 'hey jarvis', 'model': 'hey-jarvis.net', 'sensitivity': 0.5}
        ]
    """
    global precise_runners

    for config in wake_word_configs:
        engine = PreciseEngine(
            '/usr/local/bin/precise-engine',
            config['model']
        )

        runner = PreciseRunner(
            engine,
            sensitivity=config['sensitivity'],
            on_activation=on_wake_word_detected(config['name'])
        )

        runner.start()
        precise_runners[config['name']] = runner

        print(f"Started wake word detector: {config['name']}")
```

**Server-Side Multiple Wake Words:**
```bash
# Start server with multiple wake words
python voice_server.py \
    --enable-precise \
    --precise-models "hey-mycroft:~/models/hey-mycroft.net:0.5,hey-jarvis:~/models/hey-jarvis.net:0.5"
```

**Performance Impact:**
- CPU: ~5-10% per model (can run 2-3 easily)
- Memory: ~50-100MB per model
- Latency: Minimal (all run in parallel)

#### Option 2: Single Model, Multiple Phrases (Edge or Server)

Train ONE model that responds to multiple phrases:

```bash
cd ~/precise-models/multi-wake
conda activate precise

# Record samples for BOTH wake words in the SAME dataset
# Label all as "wake-word" regardless of which phrase

mkdir -p wake-word not-wake-word

# Record "Hey Mycroft" samples
precise-collect  # Save to wake-word/hey-mycroft-*.wav

# Record "Hey Computer" samples
precise-collect  # Save to wake-word/hey-computer-*.wav

# Record negatives
precise-collect -f not-wake-word/random.wav

# Train single model on both phrases
precise-train -e 60 multi-wake.net .
```

**Pros:**
- ✅ Single model = less compute
- ✅ Works on edge (K210)
- ✅ Easy to deploy

**Cons:**
- ❌ Can't tell which wake word was used
- ❌ May reduce accuracy for each individual phrase
- ❌ Higher false positive risk

#### Option 3: Sequential Detection (Edge)

Detect wake word, then identify which one:

```python
# Pseudo-code for edge detection
if wake_word_detected():
    audio_snippet = last_2_seconds()

    # Run all models on the audio snippet
    scores = {
        'hey-mycroft': model1.score(audio_snippet),
        'hey-jarvis': model2.score(audio_snippet),
        'hey-computer': model3.score(audio_snippet)
    }

    # Use highest scoring wake word
    wake_word = max(scores, key=scores.get)
```

### Recommendations

**Server-Side (Heimdall):**
- ✅ **Use Option 1** - Multiple models in parallel
- Run 2-3 wake words easily
- Each can have different sensitivity
- Can identify which wake word was used
- Example: "Hey Mycroft" for commands, "Hey Jarvis" for queries

**Edge (Maix Duino K210):**
- ✅ **Use Option 2** - Single multi-phrase model
- K210 can handle 1 model efficiently
- Train on 2-3 phrases max
- Simpler deployment
- Lower latency

## Voice Adaptation & Multi-User Support

### Approach 1: Inclusive Training (Recommended)

Train ONE model on EVERYONE'S voices:

```bash
cd ~/precise-models/family-wake-word
conda activate precise

# Record samples from each family member
# Alice records 30 samples
precise-collect  # Save as wake-word/alice-*.wav

# Bob records 30 samples
precise-collect  # Save as wake-word/bob-*.wav

# Carol records 30 samples
precise-collect  # Save as wake-word/carol-*.wav

# Train on all voices
precise-train -e 60 family-wake-word.net .
```

**Pros:**
- ✅ Everyone can use the system
- ✅ Single model deployment
- ✅ Works for all family members
- ✅ Simple maintenance

**Cons:**
- ❌ Can't identify who spoke
- ❌ May need more training data
- ❌ No personalization

**Best for:** Family voice assistant, shared devices

### Approach 2: Speaker Identification (Advanced)

Detect wake word, then identify speaker:

```python
# Architecture with speaker ID

# Step 1: Precise detects wake word
if wake_word_detected():

    # Step 2: Capture voice sample
    voice_sample = record_audio(duration=3)

    # Step 3: Speaker identification
    speaker = identify_speaker(voice_sample)
    # Uses voice embeddings/neural network

    # Step 4: Process with user context
    process_command(voice_sample, user=speaker)
```

**Implementation Options:**

#### Option A: Use resemblyzer (Voice Embeddings)
```bash
pip install resemblyzer --break-system-packages

# Enrollment phase
python enroll_users.py
# Each user records 10-20 seconds of speech
# System creates voice profile (embedding)

# Runtime
python speaker_id.py
# Compares incoming audio to stored embeddings
# Returns most likely speaker
```

**Example Code:**
```python
from resemblyzer import VoiceEncoder, preprocess_wav
import numpy as np

# Initialize encoder
encoder = VoiceEncoder()

# Enrollment - do once per user
def enroll_user(name, audio_files):
    """Create voice profile for user"""
    embeddings = []

    for audio_file in audio_files:
        wav = preprocess_wav(audio_file)
        embedding = encoder.embed_utterance(wav)
        embeddings.append(embedding)

    # Average embeddings for robustness
    user_profile = np.mean(embeddings, axis=0)

    # Save profile
    np.save(f'profiles/{name}.npy', user_profile)
    return user_profile

# Identification - run each time
def identify_speaker(audio_file, profiles_dir='profiles'):
    """Identify which enrolled user is speaking"""
    wav = preprocess_wav(audio_file)
    test_embedding = encoder.embed_utterance(wav)

    # Load all profiles
    profiles = {}
    for profile_file in os.listdir(profiles_dir):
        name = profile_file.replace('.npy', '')
        profile = np.load(os.path.join(profiles_dir, profile_file))
        profiles[name] = profile

    # Calculate similarity to each profile
    similarities = {}
    for name, profile in profiles.items():
        similarity = np.dot(test_embedding, profile)
        similarities[name] = similarity

    # Return most similar
    best_match = max(similarities, key=similarities.get)
    confidence = similarities[best_match]

    if confidence > 0.7:  # Threshold
        return best_match
    else:
        return "unknown"
```

#### Option B: Use pyannote.audio (Production-grade)
```bash
pip install pyannote.audio --break-system-packages

# Requires HuggingFace token (same as diarization)
```

**Example:**
```python
from pyannote.audio import Inference

# Initialize
inference = Inference(
    "pyannote/embedding",
    use_auth_token="your_hf_token"
)

# Enroll users
alice_profile = inference("alice_sample.wav")
bob_profile = inference("bob_sample.wav")

# Identify
test_embedding = inference("test_audio.wav")

# Compare
from scipy.spatial.distance import cosine
alice_similarity = 1 - cosine(test_embedding, alice_profile)
bob_similarity = 1 - cosine(test_embedding, bob_profile)

if alice_similarity > bob_similarity and alice_similarity > 0.7:
    speaker = "Alice"
elif bob_similarity > 0.7:
    speaker = "Bob"
else:
    speaker = "Unknown"
```

**Pros:**
- ✅ Can identify individual users
- ✅ Personalized responses
- ✅ User-specific commands/permissions
- ✅ Better for privacy (know who's speaking)

**Cons:**
- ❌ More complex implementation
- ❌ Requires enrollment phase
- ❌ Additional processing time (~100-200ms)
- ❌ May fail with similar voices

### Approach 3: Per-User Wake Word Models

Each person has their OWN wake word:

```bash
# Alice's wake word: "Hey Mycroft"
# Train on ONLY Alice's voice

# Bob's wake word: "Hey Jarvis"
# Train on ONLY Bob's voice

# Carol's wake word: "Hey Computer"
# Train on ONLY Carol's voice
```

**Deployment:**
Run all 3 models in parallel (server-side):
```python
wake_word_configs = [
    {'name': 'Alice', 'wake_word': 'hey mycroft', 'model': 'alice-wake.net'},
    {'name': 'Bob', 'wake_word': 'hey jarvis', 'model': 'bob-wake.net'},
    {'name': 'Carol', 'wake_word': 'hey computer', 'model': 'carol-wake.net'}
]
```

**Pros:**
- ✅ Automatic user identification
- ✅ Highest accuracy per user
- ✅ Clear user separation
- ✅ No additional speaker ID needed

**Cons:**
- ❌ Requires 3x models (server only)
- ❌ Users must remember their wake word
- ❌ 3x CPU usage (~15-30%)
- ❌ Can't work on edge (K210)

### Approach 4: Context-Based Adaptation

No speaker ID, but learn from interaction:

```python
# Track command patterns
user_context = {
    'last_command': 'turn on living room lights',
    'frequent_entities': ['light.living_room', 'light.bedroom'],
    'time_of_day_patterns': {'morning': 'coffee maker', 'evening': 'tv'},
    'location': 'home'  # vs 'away'
}

# Use context to improve intent recognition
if "turn on the lights" and time.is_morning():
    # Probably means bedroom lights (based on history)
    entity = user_context['frequent_entities'][0]
```

**Pros:**
- ✅ No enrollment needed
- ✅ Improves over time
- ✅ Simple to implement
- ✅ Works with any number of users

**Cons:**
- ❌ No true user identification
- ❌ May make incorrect assumptions
- ❌ Privacy concerns (tracking behavior)

## Recommended Strategy

### For Your Use Case

Based on your home lab setup, I recommend:

#### Phase 1: Single Wake Word, Inclusive Training (Week 1-2)
```bash
# Start simple
cd ~/precise-models/hey-computer
conda activate precise

# Have all family members record samples
# Alice: 30 samples of "Hey Computer"
# Bob: 30 samples of "Hey Computer"
# You: 30 samples of "Hey Computer"

# Train single model on all voices
precise-train -e 60 hey-computer.net .

# Deploy to server
python voice_server.py \
    --enable-precise \
    --precise-model hey-computer.net
```

**Why:**
- Simple to setup and test
- Everyone can use it immediately
- Single model = easier debugging
- Works on edge if you migrate later

#### Phase 2: Add Speaker Identification (Week 3-4)
```bash
# Install resemblyzer
pip install resemblyzer --break-system-packages

# Enroll users
python enroll_users.py
# Each person speaks for 20 seconds

# Update voice_server.py to identify speaker
# Use speaker ID for personalized responses
```

**Why:**
- Enables personalization
- Can track preferences per user
- User-specific command permissions
- Better privacy (know who's speaking)

#### Phase 3: Multiple Wake Words (Month 2+)
```bash
# Add alternative wake words for different contexts
# "Hey Mycroft" - General commands
# "Hey Jarvis" - Media/Plex control
# "Computer" - Quick commands (lights, temp)

# Deploy multiple models on server
python voice_server.py \
    --enable-precise \
    --precise-models "mycroft:hey-mycroft.net:0.5,jarvis:hey-jarvis.net:0.5"
```

**Why:**
- Different wake words for different contexts
- Reduces false positives (more specific triggers)
- Fun factor (Jarvis for media!)
- Server can handle 2-3 easily

## Implementation Guide: Multiple Wake Words

### Update voice_server.py for Multiple Wake Words

```python
# Add to voice_server.py

def start_multiple_wake_words(configs):
    """
    Start multiple wake word detectors

    Args:
        configs: List of dicts with 'name', 'model_path', 'sensitivity'
    """
    global precise_runners
    precise_runners = {}

    for config in configs:
        try:
            engine = PreciseEngine(
                DEFAULT_PRECISE_ENGINE,
                config['model_path']
            )

            def make_callback(wake_word_name):
                def callback():
                    print(f"Wake word detected: {wake_word_name}")
                    wake_word_queue.put({
                        'wake_word': wake_word_name,
                        'timestamp': time.time(),
                        'source': 'precise'
                    })
                return callback

            runner = PreciseRunner(
                engine,
                sensitivity=config['sensitivity'],
                on_activation=make_callback(config['name'])
            )

            runner.start()
            precise_runners[config['name']] = runner

            print(f"✓ Started: {config['name']} (sensitivity: {config['sensitivity']})")

        except Exception as e:
            print(f"✗ Failed to start {config['name']}: {e}")

    return len(precise_runners) > 0

# Add to main()
parser.add_argument('--precise-models',
                   help='Multiple models: name:path:sensitivity,name2:path2:sensitivity2')

# Parse multiple models
if args.precise_models:
    configs = []
    for model_spec in args.precise_models.split(','):
        name, path, sensitivity = model_spec.split(':')
        configs.append({
            'name': name,
            'model_path': os.path.expanduser(path),
            'sensitivity': float(sensitivity)
        })

    start_multiple_wake_words(configs)
```

### Usage Example

```bash
cd ~/voice-assistant

# Start with multiple wake words
python voice_server.py \
    --enable-precise \
    --precise-models "\
hey-mycroft:~/precise-models/pretrained/hey-mycroft.net:0.5,\
hey-jarvis:~/precise-models/pretrained/hey-jarvis.net:0.5"
```

## Implementation Guide: Speaker Identification

### Add to voice_server.py

```python
# Add resemblyzer support
try:
    from resemblyzer import VoiceEncoder, preprocess_wav
    import numpy as np
    SPEAKER_ID_AVAILABLE = True
except ImportError:
    SPEAKER_ID_AVAILABLE = False
    print("Warning: resemblyzer not available. Speaker ID disabled.")

# Initialize encoder
voice_encoder = None
speaker_profiles = {}

def load_speaker_profiles(profiles_dir='~/voice-assistant/profiles'):
    """Load enrolled speaker profiles"""
    global speaker_profiles, voice_encoder

    if not SPEAKER_ID_AVAILABLE:
        return False

    profiles_dir = os.path.expanduser(profiles_dir)

    if not os.path.exists(profiles_dir):
        print(f"No speaker profiles found at {profiles_dir}")
        return False

    # Initialize encoder
    voice_encoder = VoiceEncoder()

    # Load all profiles
    for profile_file in os.listdir(profiles_dir):
        if profile_file.endswith('.npy'):
            name = profile_file.replace('.npy', '')
            profile = np.load(os.path.join(profiles_dir, profile_file))
            speaker_profiles[name] = profile
            print(f"Loaded speaker profile: {name}")

    return len(speaker_profiles) > 0

def identify_speaker(audio_path, threshold=0.7):
    """Identify speaker from audio file"""
    if not SPEAKER_ID_AVAILABLE or not speaker_profiles:
        return None

    try:
        # Get embedding for test audio
        wav = preprocess_wav(audio_path)
        test_embedding = voice_encoder.embed_utterance(wav)

        # Compare to all profiles
        similarities = {}
        for name, profile in speaker_profiles.items():
            similarity = np.dot(test_embedding, profile)
            similarities[name] = similarity

        # Get best match
        best_match = max(similarities, key=similarities.get)
        confidence = similarities[best_match]

        print(f"Speaker ID: {best_match} (confidence: {confidence:.2f})")

        if confidence > threshold:
            return best_match
        else:
            return "unknown"

    except Exception as e:
        print(f"Error identifying speaker: {e}")
        return None

# Update process endpoint to include speaker ID
@app.route('/process', methods=['POST'])
def process():
    """Process complete voice command with speaker identification"""
    # ... existing code ...

    # Add speaker identification
    speaker = identify_speaker(temp_path) if speaker_profiles else None

    if speaker:
        print(f"Detected speaker: {speaker}")
        # Could personalize response based on speaker

    # ... rest of processing ...
```

### Enrollment Script

Create `enroll_speaker.py`:

```python
#!/usr/bin/env python3
"""
Enroll users for speaker identification

Usage:
    python enroll_speaker.py --name Alice --audio alice_sample.wav
    python enroll_speaker.py --name Alice --duration 20  # Record live
"""

import argparse
import os
import numpy as np
from resemblyzer import VoiceEncoder, preprocess_wav
import pyaudio
import wave

def record_audio(duration=20, sample_rate=16000):
    """Record audio from microphone"""
    print(f"Recording for {duration} seconds...")
    print("Speak naturally - read a paragraph, have a conversation, etc.")

    chunk = 1024
    format = pyaudio.paInt16
    channels = 1

    p = pyaudio.PyAudio()

    stream = p.open(
        format=format,
        channels=channels,
        rate=sample_rate,
        input=True,
        frames_per_buffer=chunk
    )

    frames = []
    for i in range(0, int(sample_rate / chunk * duration)):
        data = stream.read(chunk)
        frames.append(data)

    stream.stop_stream()
    stream.close()
    p.terminate()

    # Save to temp file
    temp_file = f"/tmp/enrollment_{os.getpid()}.wav"
    wf = wave.open(temp_file, 'wb')
    wf.setnchannels(channels)
    wf.setsampwidth(p.get_sample_size(format))
    wf.setframerate(sample_rate)
    wf.writeframes(b''.join(frames))
    wf.close()

    return temp_file

def enroll_speaker(name, audio_file, profiles_dir='~/voice-assistant/profiles'):
    """Create voice profile for speaker"""
    profiles_dir = os.path.expanduser(profiles_dir)
    os.makedirs(profiles_dir, exist_ok=True)

    # Initialize encoder
    encoder = VoiceEncoder()

    # Process audio
    wav = preprocess_wav(audio_file)
    embedding = encoder.embed_utterance(wav)

    # Save profile
    profile_path = os.path.join(profiles_dir, f'{name}.npy')
    np.save(profile_path, embedding)

    print(f"✓ Enrolled speaker: {name}")
    print(f"  Profile saved to: {profile_path}")

    return profile_path

def main():
    parser = argparse.ArgumentParser(description="Enroll speaker for voice identification")
    parser.add_argument('--name', required=True, help='Speaker name')
    parser.add_argument('--audio', help='Path to audio file (wav)')
    parser.add_argument('--duration', type=int, default=20,
                       help='Recording duration if not using audio file')
    parser.add_argument('--profiles-dir', default='~/voice-assistant/profiles',
                       help='Directory to save profiles')

    args = parser.parse_args()

    # Get audio file
    if args.audio:
        audio_file = args.audio
        if not os.path.exists(audio_file):
            print(f"Error: Audio file not found: {audio_file}")
            return 1
    else:
        audio_file = record_audio(args.duration)

    # Enroll speaker
    try:
        enroll_speaker(args.name, audio_file, args.profiles_dir)
        return 0
    except Exception as e:
        print(f"Error enrolling speaker: {e}")
        return 1

if __name__ == '__main__':
    import sys
    sys.exit(main())
```

## Performance Comparison

### Single Wake Word
- **Latency:** 100-200ms
- **CPU:** ~5-10% (idle)
- **Memory:** ~100MB
- **Accuracy:** 95%+

### Multiple Wake Words (3 models)
- **Latency:** 100-200ms (parallel)
- **CPU:** ~15-30% (idle)
- **Memory:** ~300MB
- **Accuracy:** 95%+ each

### With Speaker Identification
- **Additional latency:** +100-200ms
- **Additional CPU:** +5% during ID
- **Additional memory:** +50MB
- **Accuracy:** 85-95% (depending on enrollment quality)

## Best Practices

### Wake Word Selection
1. **Different enough** - "Hey Mycroft" vs "Hey Jarvis" (not "Hey Alice" vs "Hey Alex")
2. **Clear consonants** - Easier to detect
3. **2-3 syllables** - Not too short, not too long
4. **Test in environment** - Check for false triggers

### Training
1. **Include all users** - If using single model
2. **Diverse conditions** - Different rooms, noise levels
3. **Regular updates** - Add false positives weekly
4. **Per-user models** - Higher accuracy, more compute

### Speaker Identification
1. **Quality enrollment** - 20+ seconds of clear speech
2. **Re-enroll periodically** - Voices change (colds, etc.)
3. **Test thresholds** - Balance accuracy vs false IDs
4. **Graceful fallback** - Handle unknown speakers

## Recommended Path for You

```bash
# Week 1: Start with pre-trained "Hey Mycroft"
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
precise-listen hey-mycroft.net  # Test it!

# Week 2: Fine-tune with your voices
precise-train -e 30 hey-mycroft-custom.net . \
    --from-checkpoint hey-mycroft.net

# Week 3: Add speaker identification
pip install resemblyzer
python enroll_speaker.py --name Alan --duration 20
python enroll_speaker.py --name [Family Member] --duration 20

# Week 4: Add second wake word ("Hey Jarvis" for Plex?)
wget hey-jarvis.tar.gz
# Run both in parallel

# Month 2+: Optimize and expand
# - More wake words for different contexts
# - Per-user wake word models
# - Context-aware responses
```

This gives you a smooth progression from simple to advanced!