minerva/docs/ADVANCED_WAKE_WORD_TOPICS.md
pyr0ball 173f7f37d4 feat: import mycroft-precise work as Minerva foundation
Ports prior voice assistant research and prototypes from devl/Devops
into the Minerva repo. Includes:

- docs/: architecture, wake word guides, ESP32-S3 spec, hardware buying guide
- scripts/: voice_server.py, voice_server_enhanced.py, setup scripts
- hardware/maixduino/: edge device scripts with WiFi credentials scrubbed
  (replaced hardcoded password with secrets.py pattern)
- config/.env.example: server config template
- .gitignore: excludes .env, secrets.py, model blobs, ELF firmware
- CLAUDE.md: Minerva product context and connection to cf-voice roadmap
2026-04-06 22:21:12 -07:00

24 KiB
Executable file

Advanced Wake Word Topics - Pre-trained Models, Multiple Wake Words, and Voice Adaptation

Pre-trained Mycroft Models

Yes! Pre-trained Models Exist

Mycroft AI provides several pre-trained wake word models you can use immediately:

Available Models:

  • Hey Mycroft - Original Mycroft wake word (most training data)
  • Hey Jarvis - Popular alternative
  • Christopher - Alternative wake word
  • Hey Ezra - Another option

Download Pre-trained Models

# On Heimdall
conda activate precise
cd ~/precise-models

# Create directory for pre-trained models
mkdir -p pretrained
cd pretrained

# Download Hey Mycroft (recommended starting point)
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
tar xzf hey-mycroft.tar.gz

# Download other models
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-jarvis.tar.gz
tar xzf hey-jarvis.tar.gz

# List available models
ls -lh *.net

Test Pre-trained Model

conda activate precise

# Test Hey Mycroft
precise-listen hey-mycroft.net

# Speak "Hey Mycroft" - should see "!" when detected
# Press Ctrl+C to exit

# Test with different threshold
precise-listen hey-mycroft.net -t 0.7  # More conservative

Use Pre-trained Model in Voice Server

cd ~/voice-assistant

# Start server with Hey Mycroft model
python voice_server.py \
    --enable-precise \
    --precise-model ~/precise-models/pretrained/hey-mycroft.net \
    --precise-sensitivity 0.5

Fine-tune Pre-trained Models

You can use pre-trained models as a starting point and fine-tune with your voice:

cd ~/precise-models
mkdir -p hey-mycroft-custom

# Copy base model
cp pretrained/hey-mycroft.net hey-mycroft-custom/

# Collect your samples
cd hey-mycroft-custom
precise-collect  # Record 20-30 samples of YOUR voice

# Fine-tune from pre-trained model
precise-train -e 30 hey-mycroft-custom.net . \
    --from-checkpoint ../pretrained/hey-mycroft.net

# This is MUCH faster than training from scratch!

Benefits:

  • Start with proven model
  • Much less training data needed (20-30 vs 100+ samples)
  • Faster training (30 mins vs 60 mins)
  • Good baseline accuracy

Multiple Wake Words

Architecture Options

Option 1: Multiple Models in Parallel (Server-Side Only)

Run multiple Precise instances simultaneously:

# In voice_server.py - Multiple wake word detection

from precise_runner import PreciseEngine, PreciseRunner
import threading

# Global runners
precise_runners = {}

def on_wake_word_detected(wake_word_name):
    """Callback factory for different wake words"""
    def callback():
        print(f"Wake word detected: {wake_word_name}")
        wake_word_queue.put({
            'wake_word': wake_word_name,
            'timestamp': time.time()
        })
    return callback

def start_multiple_wake_words(wake_word_configs):
    """
    Start multiple wake word detectors
    
    Args:
        wake_word_configs: List of dicts with 'name', 'model', 'sensitivity'
    
    Example:
        configs = [
            {'name': 'hey mycroft', 'model': 'hey-mycroft.net', 'sensitivity': 0.5},
            {'name': 'hey jarvis', 'model': 'hey-jarvis.net', 'sensitivity': 0.5}
        ]
    """
    global precise_runners
    
    for config in wake_word_configs:
        engine = PreciseEngine(
            '/usr/local/bin/precise-engine',
            config['model']
        )
        
        runner = PreciseRunner(
            engine,
            sensitivity=config['sensitivity'],
            on_activation=on_wake_word_detected(config['name'])
        )
        
        runner.start()
        precise_runners[config['name']] = runner
        
        print(f"Started wake word detector: {config['name']}")

Server-Side Multiple Wake Words:

# Start server with multiple wake words
python voice_server.py \
    --enable-precise \
    --precise-models "hey-mycroft:~/models/hey-mycroft.net:0.5,hey-jarvis:~/models/hey-jarvis.net:0.5"

Performance Impact:

  • CPU: ~5-10% per model (can run 2-3 easily)
  • Memory: ~50-100MB per model
  • Latency: Minimal (all run in parallel)

Option 2: Single Model, Multiple Phrases (Edge or Server)

Train ONE model that responds to multiple phrases:

cd ~/precise-models/multi-wake
conda activate precise

# Record samples for BOTH wake words in the SAME dataset
# Label all as "wake-word" regardless of which phrase

mkdir -p wake-word not-wake-word

# Record "Hey Mycroft" samples
precise-collect  # Save to wake-word/hey-mycroft-*.wav

# Record "Hey Computer" samples  
precise-collect  # Save to wake-word/hey-computer-*.wav

# Record negatives
precise-collect -f not-wake-word/random.wav

# Train single model on both phrases
precise-train -e 60 multi-wake.net .

Pros:

  • Single model = less compute
  • Works on edge (K210)
  • Easy to deploy

Cons:

  • Can't tell which wake word was used
  • May reduce accuracy for each individual phrase
  • Higher false positive risk

Option 3: Sequential Detection (Edge)

Detect wake word, then identify which one:

# Pseudo-code for edge detection
if wake_word_detected():
    audio_snippet = last_2_seconds()
    
    # Run all models on the audio snippet
    scores = {
        'hey-mycroft': model1.score(audio_snippet),
        'hey-jarvis': model2.score(audio_snippet),
        'hey-computer': model3.score(audio_snippet)
    }
    
    # Use highest scoring wake word
    wake_word = max(scores, key=scores.get)

Recommendations

Server-Side (Heimdall):

  • Use Option 1 - Multiple models in parallel
  • Run 2-3 wake words easily
  • Each can have different sensitivity
  • Can identify which wake word was used
  • Example: "Hey Mycroft" for commands, "Hey Jarvis" for queries

Edge (Maix Duino K210):

  • Use Option 2 - Single multi-phrase model
  • K210 can handle 1 model efficiently
  • Train on 2-3 phrases max
  • Simpler deployment
  • Lower latency

Voice Adaptation & Multi-User Support

Train ONE model on EVERYONE'S voices:

cd ~/precise-models/family-wake-word
conda activate precise

# Record samples from each family member
# Alice records 30 samples
precise-collect  # Save as wake-word/alice-*.wav

# Bob records 30 samples
precise-collect  # Save as wake-word/bob-*.wav

# Carol records 30 samples
precise-collect  # Save as wake-word/carol-*.wav

# Train on all voices
precise-train -e 60 family-wake-word.net .

Pros:

  • Everyone can use the system
  • Single model deployment
  • Works for all family members
  • Simple maintenance

Cons:

  • Can't identify who spoke
  • May need more training data
  • No personalization

Best for: Family voice assistant, shared devices

Approach 2: Speaker Identification (Advanced)

Detect wake word, then identify speaker:

# Architecture with speaker ID

# Step 1: Precise detects wake word
if wake_word_detected():
    
    # Step 2: Capture voice sample
    voice_sample = record_audio(duration=3)
    
    # Step 3: Speaker identification
    speaker = identify_speaker(voice_sample)
    # Uses voice embeddings/neural network
    
    # Step 4: Process with user context
    process_command(voice_sample, user=speaker)

Implementation Options:

Option A: Use resemblyzer (Voice Embeddings)

pip install resemblyzer --break-system-packages

# Enrollment phase
python enroll_users.py
# Each user records 10-20 seconds of speech
# System creates voice profile (embedding)

# Runtime
python speaker_id.py
# Compares incoming audio to stored embeddings
# Returns most likely speaker

Example Code:

from resemblyzer import VoiceEncoder, preprocess_wav
import numpy as np

# Initialize encoder
encoder = VoiceEncoder()

# Enrollment - do once per user
def enroll_user(name, audio_files):
    """Create voice profile for user"""
    embeddings = []
    
    for audio_file in audio_files:
        wav = preprocess_wav(audio_file)
        embedding = encoder.embed_utterance(wav)
        embeddings.append(embedding)
    
    # Average embeddings for robustness
    user_profile = np.mean(embeddings, axis=0)
    
    # Save profile
    np.save(f'profiles/{name}.npy', user_profile)
    return user_profile

# Identification - run each time
def identify_speaker(audio_file, profiles_dir='profiles'):
    """Identify which enrolled user is speaking"""
    wav = preprocess_wav(audio_file)
    test_embedding = encoder.embed_utterance(wav)
    
    # Load all profiles
    profiles = {}
    for profile_file in os.listdir(profiles_dir):
        name = profile_file.replace('.npy', '')
        profile = np.load(os.path.join(profiles_dir, profile_file))
        profiles[name] = profile
    
    # Calculate similarity to each profile
    similarities = {}
    for name, profile in profiles.items():
        similarity = np.dot(test_embedding, profile)
        similarities[name] = similarity
    
    # Return most similar
    best_match = max(similarities, key=similarities.get)
    confidence = similarities[best_match]
    
    if confidence > 0.7:  # Threshold
        return best_match
    else:
        return "unknown"

Option B: Use pyannote.audio (Production-grade)

pip install pyannote.audio --break-system-packages

# Requires HuggingFace token (same as diarization)

Example:

from pyannote.audio import Inference

# Initialize
inference = Inference(
    "pyannote/embedding",
    use_auth_token="your_hf_token"
)

# Enroll users
alice_profile = inference("alice_sample.wav")
bob_profile = inference("bob_sample.wav")

# Identify
test_embedding = inference("test_audio.wav")

# Compare
from scipy.spatial.distance import cosine
alice_similarity = 1 - cosine(test_embedding, alice_profile)
bob_similarity = 1 - cosine(test_embedding, bob_profile)

if alice_similarity > bob_similarity and alice_similarity > 0.7:
    speaker = "Alice"
elif bob_similarity > 0.7:
    speaker = "Bob"
else:
    speaker = "Unknown"

Pros:

  • Can identify individual users
  • Personalized responses
  • User-specific commands/permissions
  • Better for privacy (know who's speaking)

Cons:

  • More complex implementation
  • Requires enrollment phase
  • Additional processing time (~100-200ms)
  • May fail with similar voices

Approach 3: Per-User Wake Word Models

Each person has their OWN wake word:

# Alice's wake word: "Hey Mycroft"
# Train on ONLY Alice's voice

# Bob's wake word: "Hey Jarvis"  
# Train on ONLY Bob's voice

# Carol's wake word: "Hey Computer"
# Train on ONLY Carol's voice

Deployment: Run all 3 models in parallel (server-side):

wake_word_configs = [
    {'name': 'Alice', 'wake_word': 'hey mycroft', 'model': 'alice-wake.net'},
    {'name': 'Bob', 'wake_word': 'hey jarvis', 'model': 'bob-wake.net'},
    {'name': 'Carol', 'wake_word': 'hey computer', 'model': 'carol-wake.net'}
]

Pros:

  • Automatic user identification
  • Highest accuracy per user
  • Clear user separation
  • No additional speaker ID needed

Cons:

  • Requires 3x models (server only)
  • Users must remember their wake word
  • 3x CPU usage (~15-30%)
  • Can't work on edge (K210)

Approach 4: Context-Based Adaptation

No speaker ID, but learn from interaction:

# Track command patterns
user_context = {
    'last_command': 'turn on living room lights',
    'frequent_entities': ['light.living_room', 'light.bedroom'],
    'time_of_day_patterns': {'morning': 'coffee maker', 'evening': 'tv'},
    'location': 'home'  # vs 'away'
}

# Use context to improve intent recognition
if "turn on the lights" and time.is_morning():
    # Probably means bedroom lights (based on history)
    entity = user_context['frequent_entities'][0]

Pros:

  • No enrollment needed
  • Improves over time
  • Simple to implement
  • Works with any number of users

Cons:

  • No true user identification
  • May make incorrect assumptions
  • Privacy concerns (tracking behavior)

For Your Use Case

Based on your home lab setup, I recommend:

Phase 1: Single Wake Word, Inclusive Training (Week 1-2)

# Start simple
cd ~/precise-models/hey-computer
conda activate precise

# Have all family members record samples
# Alice: 30 samples of "Hey Computer"
# Bob: 30 samples of "Hey Computer"
# You: 30 samples of "Hey Computer"

# Train single model on all voices
precise-train -e 60 hey-computer.net .

# Deploy to server
python voice_server.py \
    --enable-precise \
    --precise-model hey-computer.net

Why:

  • Simple to setup and test
  • Everyone can use it immediately
  • Single model = easier debugging
  • Works on edge if you migrate later

Phase 2: Add Speaker Identification (Week 3-4)

# Install resemblyzer
pip install resemblyzer --break-system-packages

# Enroll users
python enroll_users.py
# Each person speaks for 20 seconds

# Update voice_server.py to identify speaker
# Use speaker ID for personalized responses

Why:

  • Enables personalization
  • Can track preferences per user
  • User-specific command permissions
  • Better privacy (know who's speaking)

Phase 3: Multiple Wake Words (Month 2+)

# Add alternative wake words for different contexts
# "Hey Mycroft" - General commands
# "Hey Jarvis" - Media/Plex control
# "Computer" - Quick commands (lights, temp)

# Deploy multiple models on server
python voice_server.py \
    --enable-precise \
    --precise-models "mycroft:hey-mycroft.net:0.5,jarvis:hey-jarvis.net:0.5"

Why:

  • Different wake words for different contexts
  • Reduces false positives (more specific triggers)
  • Fun factor (Jarvis for media!)
  • Server can handle 2-3 easily

Implementation Guide: Multiple Wake Words

Update voice_server.py for Multiple Wake Words

# Add to voice_server.py

def start_multiple_wake_words(configs):
    """
    Start multiple wake word detectors
    
    Args:
        configs: List of dicts with 'name', 'model_path', 'sensitivity'
    """
    global precise_runners
    precise_runners = {}
    
    for config in configs:
        try:
            engine = PreciseEngine(
                DEFAULT_PRECISE_ENGINE,
                config['model_path']
            )
            
            def make_callback(wake_word_name):
                def callback():
                    print(f"Wake word detected: {wake_word_name}")
                    wake_word_queue.put({
                        'wake_word': wake_word_name,
                        'timestamp': time.time(),
                        'source': 'precise'
                    })
                return callback
            
            runner = PreciseRunner(
                engine,
                sensitivity=config['sensitivity'],
                on_activation=make_callback(config['name'])
            )
            
            runner.start()
            precise_runners[config['name']] = runner
            
            print(f"✓ Started: {config['name']} (sensitivity: {config['sensitivity']})")
            
        except Exception as e:
            print(f"✗ Failed to start {config['name']}: {e}")
    
    return len(precise_runners) > 0

# Add to main()
parser.add_argument('--precise-models', 
                   help='Multiple models: name:path:sensitivity,name2:path2:sensitivity2')

# Parse multiple models
if args.precise_models:
    configs = []
    for model_spec in args.precise_models.split(','):
        name, path, sensitivity = model_spec.split(':')
        configs.append({
            'name': name,
            'model_path': os.path.expanduser(path),
            'sensitivity': float(sensitivity)
        })
    
    start_multiple_wake_words(configs)

Usage Example

cd ~/voice-assistant

# Start with multiple wake words
python voice_server.py \
    --enable-precise \
    --precise-models "\
hey-mycroft:~/precise-models/pretrained/hey-mycroft.net:0.5,\
hey-jarvis:~/precise-models/pretrained/hey-jarvis.net:0.5"

Implementation Guide: Speaker Identification

Add to voice_server.py

# Add resemblyzer support
try:
    from resemblyzer import VoiceEncoder, preprocess_wav
    import numpy as np
    SPEAKER_ID_AVAILABLE = True
except ImportError:
    SPEAKER_ID_AVAILABLE = False
    print("Warning: resemblyzer not available. Speaker ID disabled.")

# Initialize encoder
voice_encoder = None
speaker_profiles = {}

def load_speaker_profiles(profiles_dir='~/voice-assistant/profiles'):
    """Load enrolled speaker profiles"""
    global speaker_profiles, voice_encoder
    
    if not SPEAKER_ID_AVAILABLE:
        return False
    
    profiles_dir = os.path.expanduser(profiles_dir)
    
    if not os.path.exists(profiles_dir):
        print(f"No speaker profiles found at {profiles_dir}")
        return False
    
    # Initialize encoder
    voice_encoder = VoiceEncoder()
    
    # Load all profiles
    for profile_file in os.listdir(profiles_dir):
        if profile_file.endswith('.npy'):
            name = profile_file.replace('.npy', '')
            profile = np.load(os.path.join(profiles_dir, profile_file))
            speaker_profiles[name] = profile
            print(f"Loaded speaker profile: {name}")
    
    return len(speaker_profiles) > 0

def identify_speaker(audio_path, threshold=0.7):
    """Identify speaker from audio file"""
    if not SPEAKER_ID_AVAILABLE or not speaker_profiles:
        return None
    
    try:
        # Get embedding for test audio
        wav = preprocess_wav(audio_path)
        test_embedding = voice_encoder.embed_utterance(wav)
        
        # Compare to all profiles
        similarities = {}
        for name, profile in speaker_profiles.items():
            similarity = np.dot(test_embedding, profile)
            similarities[name] = similarity
        
        # Get best match
        best_match = max(similarities, key=similarities.get)
        confidence = similarities[best_match]
        
        print(f"Speaker ID: {best_match} (confidence: {confidence:.2f})")
        
        if confidence > threshold:
            return best_match
        else:
            return "unknown"
            
    except Exception as e:
        print(f"Error identifying speaker: {e}")
        return None

# Update process endpoint to include speaker ID
@app.route('/process', methods=['POST'])
def process():
    """Process complete voice command with speaker identification"""
    # ... existing code ...
    
    # Add speaker identification
    speaker = identify_speaker(temp_path) if speaker_profiles else None
    
    if speaker:
        print(f"Detected speaker: {speaker}")
        # Could personalize response based on speaker
    
    # ... rest of processing ...

Enrollment Script

Create enroll_speaker.py:

#!/usr/bin/env python3
"""
Enroll users for speaker identification

Usage:
    python enroll_speaker.py --name Alice --audio alice_sample.wav
    python enroll_speaker.py --name Alice --duration 20  # Record live
"""

import argparse
import os
import numpy as np
from resemblyzer import VoiceEncoder, preprocess_wav
import pyaudio
import wave

def record_audio(duration=20, sample_rate=16000):
    """Record audio from microphone"""
    print(f"Recording for {duration} seconds...")
    print("Speak naturally - read a paragraph, have a conversation, etc.")
    
    chunk = 1024
    format = pyaudio.paInt16
    channels = 1
    
    p = pyaudio.PyAudio()
    
    stream = p.open(
        format=format,
        channels=channels,
        rate=sample_rate,
        input=True,
        frames_per_buffer=chunk
    )
    
    frames = []
    for i in range(0, int(sample_rate / chunk * duration)):
        data = stream.read(chunk)
        frames.append(data)
    
    stream.stop_stream()
    stream.close()
    p.terminate()
    
    # Save to temp file
    temp_file = f"/tmp/enrollment_{os.getpid()}.wav"
    wf = wave.open(temp_file, 'wb')
    wf.setnchannels(channels)
    wf.setsampwidth(p.get_sample_size(format))
    wf.setframerate(sample_rate)
    wf.writeframes(b''.join(frames))
    wf.close()
    
    return temp_file

def enroll_speaker(name, audio_file, profiles_dir='~/voice-assistant/profiles'):
    """Create voice profile for speaker"""
    profiles_dir = os.path.expanduser(profiles_dir)
    os.makedirs(profiles_dir, exist_ok=True)
    
    # Initialize encoder
    encoder = VoiceEncoder()
    
    # Process audio
    wav = preprocess_wav(audio_file)
    embedding = encoder.embed_utterance(wav)
    
    # Save profile
    profile_path = os.path.join(profiles_dir, f'{name}.npy')
    np.save(profile_path, embedding)
    
    print(f"✓ Enrolled speaker: {name}")
    print(f"  Profile saved to: {profile_path}")
    
    return profile_path

def main():
    parser = argparse.ArgumentParser(description="Enroll speaker for voice identification")
    parser.add_argument('--name', required=True, help='Speaker name')
    parser.add_argument('--audio', help='Path to audio file (wav)')
    parser.add_argument('--duration', type=int, default=20, 
                       help='Recording duration if not using audio file')
    parser.add_argument('--profiles-dir', default='~/voice-assistant/profiles',
                       help='Directory to save profiles')
    
    args = parser.parse_args()
    
    # Get audio file
    if args.audio:
        audio_file = args.audio
        if not os.path.exists(audio_file):
            print(f"Error: Audio file not found: {audio_file}")
            return 1
    else:
        audio_file = record_audio(args.duration)
    
    # Enroll speaker
    try:
        enroll_speaker(args.name, audio_file, args.profiles_dir)
        return 0
    except Exception as e:
        print(f"Error enrolling speaker: {e}")
        return 1

if __name__ == '__main__':
    import sys
    sys.exit(main())

Performance Comparison

Single Wake Word

  • Latency: 100-200ms
  • CPU: ~5-10% (idle)
  • Memory: ~100MB
  • Accuracy: 95%+

Multiple Wake Words (3 models)

  • Latency: 100-200ms (parallel)
  • CPU: ~15-30% (idle)
  • Memory: ~300MB
  • Accuracy: 95%+ each

With Speaker Identification

  • Additional latency: +100-200ms
  • Additional CPU: +5% during ID
  • Additional memory: +50MB
  • Accuracy: 85-95% (depending on enrollment quality)

Best Practices

Wake Word Selection

  1. Different enough - "Hey Mycroft" vs "Hey Jarvis" (not "Hey Alice" vs "Hey Alex")
  2. Clear consonants - Easier to detect
  3. 2-3 syllables - Not too short, not too long
  4. Test in environment - Check for false triggers

Training

  1. Include all users - If using single model
  2. Diverse conditions - Different rooms, noise levels
  3. Regular updates - Add false positives weekly
  4. Per-user models - Higher accuracy, more compute

Speaker Identification

  1. Quality enrollment - 20+ seconds of clear speech
  2. Re-enroll periodically - Voices change (colds, etc.)
  3. Test thresholds - Balance accuracy vs false IDs
  4. Graceful fallback - Handle unknown speakers
# Week 1: Start with pre-trained "Hey Mycroft"
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
precise-listen hey-mycroft.net  # Test it!

# Week 2: Fine-tune with your voices
precise-train -e 30 hey-mycroft-custom.net . \
    --from-checkpoint hey-mycroft.net

# Week 3: Add speaker identification
pip install resemblyzer
python enroll_speaker.py --name Alan --duration 20
python enroll_speaker.py --name [Family Member] --duration 20

# Week 4: Add second wake word ("Hey Jarvis" for Plex?)
wget hey-jarvis.tar.gz
# Run both in parallel

# Month 2+: Optimize and expand
# - More wake words for different contexts
# - Per-user wake word models
# - Context-aware responses

This gives you a smooth progression from simple to advanced!