# Advanced Wake Word Topics - Pre-trained Models, Multiple Wake Words, and Voice Adaptation ## Pre-trained Mycroft Models ### Yes! Pre-trained Models Exist Mycroft AI provides several pre-trained wake word models you can use immediately: **Available Models:** - **Hey Mycroft** - Original Mycroft wake word (most training data) - **Hey Jarvis** - Popular alternative - **Christopher** - Alternative wake word - **Hey Ezra** - Another option ### Download Pre-trained Models ```bash # On Heimdall conda activate precise cd ~/precise-models # Create directory for pre-trained models mkdir -p pretrained cd pretrained # Download Hey Mycroft (recommended starting point) wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz tar xzf hey-mycroft.tar.gz # Download other models wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-jarvis.tar.gz tar xzf hey-jarvis.tar.gz # List available models ls -lh *.net ``` ### Test Pre-trained Model ```bash conda activate precise # Test Hey Mycroft precise-listen hey-mycroft.net # Speak "Hey Mycroft" - should see "!" when detected # Press Ctrl+C to exit # Test with different threshold precise-listen hey-mycroft.net -t 0.7 # More conservative ``` ### Use Pre-trained Model in Voice Server ```bash cd ~/voice-assistant # Start server with Hey Mycroft model python voice_server.py \ --enable-precise \ --precise-model ~/precise-models/pretrained/hey-mycroft.net \ --precise-sensitivity 0.5 ``` ### Fine-tune Pre-trained Models You can use pre-trained models as a **starting point** and fine-tune with your voice: ```bash cd ~/precise-models mkdir -p hey-mycroft-custom # Copy base model cp pretrained/hey-mycroft.net hey-mycroft-custom/ # Collect your samples cd hey-mycroft-custom precise-collect # Record 20-30 samples of YOUR voice # Fine-tune from pre-trained model precise-train -e 30 hey-mycroft-custom.net . \ --from-checkpoint ../pretrained/hey-mycroft.net # This is MUCH faster than training from scratch! ``` **Benefits:** - ✅ Start with proven model - ✅ Much less training data needed (20-30 vs 100+ samples) - ✅ Faster training (30 mins vs 60 mins) - ✅ Good baseline accuracy ## Multiple Wake Words ### Architecture Options #### Option 1: Multiple Models in Parallel (Server-Side Only) Run multiple Precise instances simultaneously: ```python # In voice_server.py - Multiple wake word detection from precise_runner import PreciseEngine, PreciseRunner import threading # Global runners precise_runners = {} def on_wake_word_detected(wake_word_name): """Callback factory for different wake words""" def callback(): print(f"Wake word detected: {wake_word_name}") wake_word_queue.put({ 'wake_word': wake_word_name, 'timestamp': time.time() }) return callback def start_multiple_wake_words(wake_word_configs): """ Start multiple wake word detectors Args: wake_word_configs: List of dicts with 'name', 'model', 'sensitivity' Example: configs = [ {'name': 'hey mycroft', 'model': 'hey-mycroft.net', 'sensitivity': 0.5}, {'name': 'hey jarvis', 'model': 'hey-jarvis.net', 'sensitivity': 0.5} ] """ global precise_runners for config in wake_word_configs: engine = PreciseEngine( '/usr/local/bin/precise-engine', config['model'] ) runner = PreciseRunner( engine, sensitivity=config['sensitivity'], on_activation=on_wake_word_detected(config['name']) ) runner.start() precise_runners[config['name']] = runner print(f"Started wake word detector: {config['name']}") ``` **Server-Side Multiple Wake Words:** ```bash # Start server with multiple wake words python voice_server.py \ --enable-precise \ --precise-models "hey-mycroft:~/models/hey-mycroft.net:0.5,hey-jarvis:~/models/hey-jarvis.net:0.5" ``` **Performance Impact:** - CPU: ~5-10% per model (can run 2-3 easily) - Memory: ~50-100MB per model - Latency: Minimal (all run in parallel) #### Option 2: Single Model, Multiple Phrases (Edge or Server) Train ONE model that responds to multiple phrases: ```bash cd ~/precise-models/multi-wake conda activate precise # Record samples for BOTH wake words in the SAME dataset # Label all as "wake-word" regardless of which phrase mkdir -p wake-word not-wake-word # Record "Hey Mycroft" samples precise-collect # Save to wake-word/hey-mycroft-*.wav # Record "Hey Computer" samples precise-collect # Save to wake-word/hey-computer-*.wav # Record negatives precise-collect -f not-wake-word/random.wav # Train single model on both phrases precise-train -e 60 multi-wake.net . ``` **Pros:** - ✅ Single model = less compute - ✅ Works on edge (K210) - ✅ Easy to deploy **Cons:** - ❌ Can't tell which wake word was used - ❌ May reduce accuracy for each individual phrase - ❌ Higher false positive risk #### Option 3: Sequential Detection (Edge) Detect wake word, then identify which one: ```python # Pseudo-code for edge detection if wake_word_detected(): audio_snippet = last_2_seconds() # Run all models on the audio snippet scores = { 'hey-mycroft': model1.score(audio_snippet), 'hey-jarvis': model2.score(audio_snippet), 'hey-computer': model3.score(audio_snippet) } # Use highest scoring wake word wake_word = max(scores, key=scores.get) ``` ### Recommendations **Server-Side (Heimdall):** - ✅ **Use Option 1** - Multiple models in parallel - Run 2-3 wake words easily - Each can have different sensitivity - Can identify which wake word was used - Example: "Hey Mycroft" for commands, "Hey Jarvis" for queries **Edge (Maix Duino K210):** - ✅ **Use Option 2** - Single multi-phrase model - K210 can handle 1 model efficiently - Train on 2-3 phrases max - Simpler deployment - Lower latency ## Voice Adaptation & Multi-User Support ### Approach 1: Inclusive Training (Recommended) Train ONE model on EVERYONE'S voices: ```bash cd ~/precise-models/family-wake-word conda activate precise # Record samples from each family member # Alice records 30 samples precise-collect # Save as wake-word/alice-*.wav # Bob records 30 samples precise-collect # Save as wake-word/bob-*.wav # Carol records 30 samples precise-collect # Save as wake-word/carol-*.wav # Train on all voices precise-train -e 60 family-wake-word.net . ``` **Pros:** - ✅ Everyone can use the system - ✅ Single model deployment - ✅ Works for all family members - ✅ Simple maintenance **Cons:** - ❌ Can't identify who spoke - ❌ May need more training data - ❌ No personalization **Best for:** Family voice assistant, shared devices ### Approach 2: Speaker Identification (Advanced) Detect wake word, then identify speaker: ```python # Architecture with speaker ID # Step 1: Precise detects wake word if wake_word_detected(): # Step 2: Capture voice sample voice_sample = record_audio(duration=3) # Step 3: Speaker identification speaker = identify_speaker(voice_sample) # Uses voice embeddings/neural network # Step 4: Process with user context process_command(voice_sample, user=speaker) ``` **Implementation Options:** #### Option A: Use resemblyzer (Voice Embeddings) ```bash pip install resemblyzer --break-system-packages # Enrollment phase python enroll_users.py # Each user records 10-20 seconds of speech # System creates voice profile (embedding) # Runtime python speaker_id.py # Compares incoming audio to stored embeddings # Returns most likely speaker ``` **Example Code:** ```python from resemblyzer import VoiceEncoder, preprocess_wav import numpy as np # Initialize encoder encoder = VoiceEncoder() # Enrollment - do once per user def enroll_user(name, audio_files): """Create voice profile for user""" embeddings = [] for audio_file in audio_files: wav = preprocess_wav(audio_file) embedding = encoder.embed_utterance(wav) embeddings.append(embedding) # Average embeddings for robustness user_profile = np.mean(embeddings, axis=0) # Save profile np.save(f'profiles/{name}.npy', user_profile) return user_profile # Identification - run each time def identify_speaker(audio_file, profiles_dir='profiles'): """Identify which enrolled user is speaking""" wav = preprocess_wav(audio_file) test_embedding = encoder.embed_utterance(wav) # Load all profiles profiles = {} for profile_file in os.listdir(profiles_dir): name = profile_file.replace('.npy', '') profile = np.load(os.path.join(profiles_dir, profile_file)) profiles[name] = profile # Calculate similarity to each profile similarities = {} for name, profile in profiles.items(): similarity = np.dot(test_embedding, profile) similarities[name] = similarity # Return most similar best_match = max(similarities, key=similarities.get) confidence = similarities[best_match] if confidence > 0.7: # Threshold return best_match else: return "unknown" ``` #### Option B: Use pyannote.audio (Production-grade) ```bash pip install pyannote.audio --break-system-packages # Requires HuggingFace token (same as diarization) ``` **Example:** ```python from pyannote.audio import Inference # Initialize inference = Inference( "pyannote/embedding", use_auth_token="your_hf_token" ) # Enroll users alice_profile = inference("alice_sample.wav") bob_profile = inference("bob_sample.wav") # Identify test_embedding = inference("test_audio.wav") # Compare from scipy.spatial.distance import cosine alice_similarity = 1 - cosine(test_embedding, alice_profile) bob_similarity = 1 - cosine(test_embedding, bob_profile) if alice_similarity > bob_similarity and alice_similarity > 0.7: speaker = "Alice" elif bob_similarity > 0.7: speaker = "Bob" else: speaker = "Unknown" ``` **Pros:** - ✅ Can identify individual users - ✅ Personalized responses - ✅ User-specific commands/permissions - ✅ Better for privacy (know who's speaking) **Cons:** - ❌ More complex implementation - ❌ Requires enrollment phase - ❌ Additional processing time (~100-200ms) - ❌ May fail with similar voices ### Approach 3: Per-User Wake Word Models Each person has their OWN wake word: ```bash # Alice's wake word: "Hey Mycroft" # Train on ONLY Alice's voice # Bob's wake word: "Hey Jarvis" # Train on ONLY Bob's voice # Carol's wake word: "Hey Computer" # Train on ONLY Carol's voice ``` **Deployment:** Run all 3 models in parallel (server-side): ```python wake_word_configs = [ {'name': 'Alice', 'wake_word': 'hey mycroft', 'model': 'alice-wake.net'}, {'name': 'Bob', 'wake_word': 'hey jarvis', 'model': 'bob-wake.net'}, {'name': 'Carol', 'wake_word': 'hey computer', 'model': 'carol-wake.net'} ] ``` **Pros:** - ✅ Automatic user identification - ✅ Highest accuracy per user - ✅ Clear user separation - ✅ No additional speaker ID needed **Cons:** - ❌ Requires 3x models (server only) - ❌ Users must remember their wake word - ❌ 3x CPU usage (~15-30%) - ❌ Can't work on edge (K210) ### Approach 4: Context-Based Adaptation No speaker ID, but learn from interaction: ```python # Track command patterns user_context = { 'last_command': 'turn on living room lights', 'frequent_entities': ['light.living_room', 'light.bedroom'], 'time_of_day_patterns': {'morning': 'coffee maker', 'evening': 'tv'}, 'location': 'home' # vs 'away' } # Use context to improve intent recognition if "turn on the lights" and time.is_morning(): # Probably means bedroom lights (based on history) entity = user_context['frequent_entities'][0] ``` **Pros:** - ✅ No enrollment needed - ✅ Improves over time - ✅ Simple to implement - ✅ Works with any number of users **Cons:** - ❌ No true user identification - ❌ May make incorrect assumptions - ❌ Privacy concerns (tracking behavior) ## Recommended Strategy ### For Your Use Case Based on your home lab setup, I recommend: #### Phase 1: Single Wake Word, Inclusive Training (Week 1-2) ```bash # Start simple cd ~/precise-models/hey-computer conda activate precise # Have all family members record samples # Alice: 30 samples of "Hey Computer" # Bob: 30 samples of "Hey Computer" # You: 30 samples of "Hey Computer" # Train single model on all voices precise-train -e 60 hey-computer.net . # Deploy to server python voice_server.py \ --enable-precise \ --precise-model hey-computer.net ``` **Why:** - Simple to setup and test - Everyone can use it immediately - Single model = easier debugging - Works on edge if you migrate later #### Phase 2: Add Speaker Identification (Week 3-4) ```bash # Install resemblyzer pip install resemblyzer --break-system-packages # Enroll users python enroll_users.py # Each person speaks for 20 seconds # Update voice_server.py to identify speaker # Use speaker ID for personalized responses ``` **Why:** - Enables personalization - Can track preferences per user - User-specific command permissions - Better privacy (know who's speaking) #### Phase 3: Multiple Wake Words (Month 2+) ```bash # Add alternative wake words for different contexts # "Hey Mycroft" - General commands # "Hey Jarvis" - Media/Plex control # "Computer" - Quick commands (lights, temp) # Deploy multiple models on server python voice_server.py \ --enable-precise \ --precise-models "mycroft:hey-mycroft.net:0.5,jarvis:hey-jarvis.net:0.5" ``` **Why:** - Different wake words for different contexts - Reduces false positives (more specific triggers) - Fun factor (Jarvis for media!) - Server can handle 2-3 easily ## Implementation Guide: Multiple Wake Words ### Update voice_server.py for Multiple Wake Words ```python # Add to voice_server.py def start_multiple_wake_words(configs): """ Start multiple wake word detectors Args: configs: List of dicts with 'name', 'model_path', 'sensitivity' """ global precise_runners precise_runners = {} for config in configs: try: engine = PreciseEngine( DEFAULT_PRECISE_ENGINE, config['model_path'] ) def make_callback(wake_word_name): def callback(): print(f"Wake word detected: {wake_word_name}") wake_word_queue.put({ 'wake_word': wake_word_name, 'timestamp': time.time(), 'source': 'precise' }) return callback runner = PreciseRunner( engine, sensitivity=config['sensitivity'], on_activation=make_callback(config['name']) ) runner.start() precise_runners[config['name']] = runner print(f"✓ Started: {config['name']} (sensitivity: {config['sensitivity']})") except Exception as e: print(f"✗ Failed to start {config['name']}: {e}") return len(precise_runners) > 0 # Add to main() parser.add_argument('--precise-models', help='Multiple models: name:path:sensitivity,name2:path2:sensitivity2') # Parse multiple models if args.precise_models: configs = [] for model_spec in args.precise_models.split(','): name, path, sensitivity = model_spec.split(':') configs.append({ 'name': name, 'model_path': os.path.expanduser(path), 'sensitivity': float(sensitivity) }) start_multiple_wake_words(configs) ``` ### Usage Example ```bash cd ~/voice-assistant # Start with multiple wake words python voice_server.py \ --enable-precise \ --precise-models "\ hey-mycroft:~/precise-models/pretrained/hey-mycroft.net:0.5,\ hey-jarvis:~/precise-models/pretrained/hey-jarvis.net:0.5" ``` ## Implementation Guide: Speaker Identification ### Add to voice_server.py ```python # Add resemblyzer support try: from resemblyzer import VoiceEncoder, preprocess_wav import numpy as np SPEAKER_ID_AVAILABLE = True except ImportError: SPEAKER_ID_AVAILABLE = False print("Warning: resemblyzer not available. Speaker ID disabled.") # Initialize encoder voice_encoder = None speaker_profiles = {} def load_speaker_profiles(profiles_dir='~/voice-assistant/profiles'): """Load enrolled speaker profiles""" global speaker_profiles, voice_encoder if not SPEAKER_ID_AVAILABLE: return False profiles_dir = os.path.expanduser(profiles_dir) if not os.path.exists(profiles_dir): print(f"No speaker profiles found at {profiles_dir}") return False # Initialize encoder voice_encoder = VoiceEncoder() # Load all profiles for profile_file in os.listdir(profiles_dir): if profile_file.endswith('.npy'): name = profile_file.replace('.npy', '') profile = np.load(os.path.join(profiles_dir, profile_file)) speaker_profiles[name] = profile print(f"Loaded speaker profile: {name}") return len(speaker_profiles) > 0 def identify_speaker(audio_path, threshold=0.7): """Identify speaker from audio file""" if not SPEAKER_ID_AVAILABLE or not speaker_profiles: return None try: # Get embedding for test audio wav = preprocess_wav(audio_path) test_embedding = voice_encoder.embed_utterance(wav) # Compare to all profiles similarities = {} for name, profile in speaker_profiles.items(): similarity = np.dot(test_embedding, profile) similarities[name] = similarity # Get best match best_match = max(similarities, key=similarities.get) confidence = similarities[best_match] print(f"Speaker ID: {best_match} (confidence: {confidence:.2f})") if confidence > threshold: return best_match else: return "unknown" except Exception as e: print(f"Error identifying speaker: {e}") return None # Update process endpoint to include speaker ID @app.route('/process', methods=['POST']) def process(): """Process complete voice command with speaker identification""" # ... existing code ... # Add speaker identification speaker = identify_speaker(temp_path) if speaker_profiles else None if speaker: print(f"Detected speaker: {speaker}") # Could personalize response based on speaker # ... rest of processing ... ``` ### Enrollment Script Create `enroll_speaker.py`: ```python #!/usr/bin/env python3 """ Enroll users for speaker identification Usage: python enroll_speaker.py --name Alice --audio alice_sample.wav python enroll_speaker.py --name Alice --duration 20 # Record live """ import argparse import os import numpy as np from resemblyzer import VoiceEncoder, preprocess_wav import pyaudio import wave def record_audio(duration=20, sample_rate=16000): """Record audio from microphone""" print(f"Recording for {duration} seconds...") print("Speak naturally - read a paragraph, have a conversation, etc.") chunk = 1024 format = pyaudio.paInt16 channels = 1 p = pyaudio.PyAudio() stream = p.open( format=format, channels=channels, rate=sample_rate, input=True, frames_per_buffer=chunk ) frames = [] for i in range(0, int(sample_rate / chunk * duration)): data = stream.read(chunk) frames.append(data) stream.stop_stream() stream.close() p.terminate() # Save to temp file temp_file = f"/tmp/enrollment_{os.getpid()}.wav" wf = wave.open(temp_file, 'wb') wf.setnchannels(channels) wf.setsampwidth(p.get_sample_size(format)) wf.setframerate(sample_rate) wf.writeframes(b''.join(frames)) wf.close() return temp_file def enroll_speaker(name, audio_file, profiles_dir='~/voice-assistant/profiles'): """Create voice profile for speaker""" profiles_dir = os.path.expanduser(profiles_dir) os.makedirs(profiles_dir, exist_ok=True) # Initialize encoder encoder = VoiceEncoder() # Process audio wav = preprocess_wav(audio_file) embedding = encoder.embed_utterance(wav) # Save profile profile_path = os.path.join(profiles_dir, f'{name}.npy') np.save(profile_path, embedding) print(f"✓ Enrolled speaker: {name}") print(f" Profile saved to: {profile_path}") return profile_path def main(): parser = argparse.ArgumentParser(description="Enroll speaker for voice identification") parser.add_argument('--name', required=True, help='Speaker name') parser.add_argument('--audio', help='Path to audio file (wav)') parser.add_argument('--duration', type=int, default=20, help='Recording duration if not using audio file') parser.add_argument('--profiles-dir', default='~/voice-assistant/profiles', help='Directory to save profiles') args = parser.parse_args() # Get audio file if args.audio: audio_file = args.audio if not os.path.exists(audio_file): print(f"Error: Audio file not found: {audio_file}") return 1 else: audio_file = record_audio(args.duration) # Enroll speaker try: enroll_speaker(args.name, audio_file, args.profiles_dir) return 0 except Exception as e: print(f"Error enrolling speaker: {e}") return 1 if __name__ == '__main__': import sys sys.exit(main()) ``` ## Performance Comparison ### Single Wake Word - **Latency:** 100-200ms - **CPU:** ~5-10% (idle) - **Memory:** ~100MB - **Accuracy:** 95%+ ### Multiple Wake Words (3 models) - **Latency:** 100-200ms (parallel) - **CPU:** ~15-30% (idle) - **Memory:** ~300MB - **Accuracy:** 95%+ each ### With Speaker Identification - **Additional latency:** +100-200ms - **Additional CPU:** +5% during ID - **Additional memory:** +50MB - **Accuracy:** 85-95% (depending on enrollment quality) ## Best Practices ### Wake Word Selection 1. **Different enough** - "Hey Mycroft" vs "Hey Jarvis" (not "Hey Alice" vs "Hey Alex") 2. **Clear consonants** - Easier to detect 3. **2-3 syllables** - Not too short, not too long 4. **Test in environment** - Check for false triggers ### Training 1. **Include all users** - If using single model 2. **Diverse conditions** - Different rooms, noise levels 3. **Regular updates** - Add false positives weekly 4. **Per-user models** - Higher accuracy, more compute ### Speaker Identification 1. **Quality enrollment** - 20+ seconds of clear speech 2. **Re-enroll periodically** - Voices change (colds, etc.) 3. **Test thresholds** - Balance accuracy vs false IDs 4. **Graceful fallback** - Handle unknown speakers ## Recommended Path for You ```bash # Week 1: Start with pre-trained "Hey Mycroft" wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz precise-listen hey-mycroft.net # Test it! # Week 2: Fine-tune with your voices precise-train -e 30 hey-mycroft-custom.net . \ --from-checkpoint hey-mycroft.net # Week 3: Add speaker identification pip install resemblyzer python enroll_speaker.py --name Alan --duration 20 python enroll_speaker.py --name [Family Member] --duration 20 # Week 4: Add second wake word ("Hey Jarvis" for Plex?) wget hey-jarvis.tar.gz # Run both in parallel # Month 2+: Optimize and expand # - More wake words for different contexts # - Per-user wake word models # - Context-aware responses ``` This gives you a smooth progression from simple to advanced!