Ports prior voice assistant research and prototypes from devl/Devops into the Minerva repo. Includes: - docs/: architecture, wake word guides, ESP32-S3 spec, hardware buying guide - scripts/: voice_server.py, voice_server_enhanced.py, setup scripts - hardware/maixduino/: edge device scripts with WiFi credentials scrubbed (replaced hardcoded password with secrets.py pattern) - config/.env.example: server config template - .gitignore: excludes .env, secrets.py, model blobs, ELF firmware - CLAUDE.md: Minerva product context and connection to cf-voice roadmap
10 KiB
Executable file
Your Questions Answered - Quick Reference
TL;DR: Yes, Yes, and Multiple Options!
Q1: Pre-trained "Hey Mycroft" Model?
Answer: YES! ✅
Download and use immediately:
./quick_start_hey_mycroft.sh
# Done in 5 minutes - no training!
The pre-trained model works great and saves you 1-2 hours of training time.
Q2: Multiple Wake Words?
Answer: YES! ✅ (with considerations)
Server-side (Heimdall): Easy, run 3-5 wake words
python voice_server_enhanced.py \
--enable-precise \
--multi-wake-word
Edge (K210): Feasible for 1-2, challenging for 3+
Q3: Adopting New Users' Voices?
Answer: Multiple approaches ✅
Best option: Train one model with everyone's voices upfront Alternative: Incremental retraining as new users join Advanced: Speaker identification with personalization
Detailed Answers
1. Pre-trained "Hey Mycroft" Model
Where to Get It
# Quick start script does this for you
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
tar xzf hey-mycroft.tar.gz
How to Use
Instant deployment:
python voice_server.py \
--enable-precise \
--precise-model ~/precise-models/pretrained/hey-mycroft.net
Fine-tune with your voice:
# Record 20-30 samples of your voice saying "Hey Mycroft"
precise-collect
# Fine-tune from pre-trained
precise-train -e 30 my-hey-mycroft.net . \
--from-checkpoint ~/precise-models/pretrained/hey-mycroft.net
Advantages
✅ Zero training time - Works immediately
✅ Proven accuracy - Tested by thousands
✅ Good baseline - Already includes diverse voices
✅ Easy fine-tuning - Add your voice in 30 mins vs 60+ mins from scratch
When to Use Pre-trained vs Custom
Use Pre-trained "Hey Mycroft" when:
- You want to test quickly
- "Hey Mycroft" is an acceptable wake word
- You want proven accuracy out-of-box
Train Custom when:
- You want a different wake word ("Hey Computer", "Jarvis", etc.)
- Maximum accuracy for your specific environment
- Family-specific wake word
Hybrid (Recommended):
- Start with pre-trained "Hey Mycroft"
- Test and learn the system
- Fine-tune with your samples
- Or add custom wake word later
2. Multiple Wake Words
Can You Have Multiple?
Yes! Options:
Option A: Server-Side (Recommended)
Easy implementation:
# Use the enhanced server
python voice_server_enhanced.py \
--enable-precise \
--multi-wake-word
Configured wake words:
- "Hey Mycroft" (pre-trained)
- "Hey Computer" (custom)
- "Jarvis" (custom)
Resource impact:
- 3 models = ~15-30% CPU (Heimdall handles easily)
- ~300-600MB RAM
- Each model runs independently
Example use cases:
"Hey Mycroft, what's the time?" → General assistant
"Jarvis, run diagnostics" → Personal assistant mode
"Emergency, call help" → Priority/emergency mode
Option B: Edge (K210)
Feasible for 1-2 wake words:
# Sequential checking
for model in ['hey-mycroft.kmodel', 'emergency.kmodel']:
if detect_wake_word(model):
return model
Limitations:
- +50-100ms latency per additional model
- Memory constraints (6MB total for all models)
- More models = more power consumption
Recommendation:
- K210: 1 wake word (optimal)
- K210: 2 wake words (acceptable)
- K210: 3+ wake words (not recommended)
Option C: Contextual Wake Words
Different wake words for different purposes:
wake_word_contexts = {
'hey_mycroft': 'general_assistant',
'emergency': 'priority_emergency',
'goodnight': 'bedtime_routine',
}
Should You Use Multiple?
One wake word is usually enough!
Commercial products (Alexa, Google) use one wake word and they work fine.
Use multiple when:
- Different family members want different wake words
- You want context-specific behaviors (emergency vs. general)
- You enjoy the flexibility
Start with one, add more later if needed.
3. Adopting New Users' Voices
Challenge
Same wake word, different voices:
- Mom says "Hey Mycroft" (soprano)
- Dad says "Hey Mycroft" (bass)
- Kids say "Hey Mycroft" (high-pitched)
All need to work!
Solution 1: Diverse Training (Recommended)
During initial training, have everyone record samples:
cd ~/precise-models/family-hey-mycroft
# Session 1: Mom records 30 samples
precise-collect # Mom speaks "Hey Mycroft" 30 times
# Session 2: Dad records 30 samples
precise-collect # Dad speaks "Hey Mycroft" 30 times
# Session 3: Kids record 20 samples each
precise-collect # Kids speak "Hey Mycroft" 40 times total
# Train one model with all voices
precise-train -e 60 family-hey-mycroft.net .
# Deploy
python voice_server.py \
--enable-precise \
--precise-model family-hey-mycroft.net
Pros:
✅ One model works for everyone
✅ Simple deployment
✅ No switching needed
✅ Works from day one
Cons:
❌ Need everyone's time upfront
❌ Slightly lower per-person accuracy than individual models
Solution 2: Incremental Training
Start with one person, add others over time:
# Week 1: Train with Dad's voice
precise-train -e 60 hey-mycroft.net .
# Week 2: Mom wants to use it
# Collect Mom's samples
precise-collect # Mom records 20-30 samples
# Add to training set
cp mom-samples/* wake-word/
# Retrain from checkpoint (faster!)
precise-train -e 30 hey-mycroft.net . \
--from-checkpoint hey-mycroft.net
# Now works for both Dad and Mom!
# Week 3: Kids want in
# Repeat process...
Pros:
✅ Don't need everyone upfront
✅ Easy to add new users
✅ Model improves gradually
Cons:
❌ New users may have issues initially
❌ Requires periodic retraining
Solution 3: Speaker Identification (Advanced)
Identify who's speaking, use personalized model/settings:
# Install speaker ID
pip install pyannote.audio scipy --break-system-packages
# Use enhanced server
python voice_server_enhanced.py \
--enable-precise \
--enable-speaker-id \
--hf-token YOUR_HF_TOKEN
Enroll users:
# Record 30-second voice sample from each person
# POST to /speakers/enroll with audio + name
curl -F "name=alan" \
-F "audio=@alan_voice.wav" \
http://localhost:5000/speakers/enroll
curl -F "name=sarah" \
-F "audio=@sarah_voice.wav" \
http://localhost:5000/speakers/enroll
Benefits:
# Different responses per user
if speaker == 'alan':
turn_on('light.alan_office')
elif speaker == 'sarah':
turn_on('light.sarah_office')
# Different permissions
if speaker == 'kids' and command.startswith('buy'):
return "Sorry, kids can't make purchases"
Pros:
✅ Personalized responses
✅ User-specific settings
✅ Better accuracy (optimized per voice)
✅ Can track who said what
Cons:
❌ More complex
❌ Privacy considerations
❌ Additional CPU/RAM (~10% + 200MB)
❌ Requires voice enrollment
Solution 4: Pre-trained Model (Easiest)
"Hey Mycroft" already includes diverse voices!
# Just use it - already trained on many voices
./quick_start_hey_mycroft.sh
The community model was trained with:
- Male and female voices
- Different accents
- Different ages
- Various environments
It should work for most family members out-of-box!
Then fine-tune if needed.
Recommended Path for Your Situation
Scenario: Family of 3-4 People
Week 1: Quick Start
# Use pre-trained "Hey Mycroft"
./quick_start_hey_mycroft.sh
# Test with all family members
# Likely works for everyone already!
Week 2: Fine-tune if Needed
# If someone has issues:
# Have them record 20 samples
# Fine-tune the model
precise-train -e 30 family-hey-mycroft.net . \
--from-checkpoint ~/precise-models/pretrained/hey-mycroft.net
Week 3: Add Features
# If you want personalization:
python voice_server_enhanced.py \
--enable-speaker-id
# Enroll each family member
Scenario: Just You (or 1-2 People)
Option 1: Pre-trained
./quick_start_hey_mycroft.sh
# Done!
Option 2: Custom Wake Word
# Train custom "Hey Computer"
cd ~/precise-models/hey-computer
./1-record-wake-word.sh # 50 samples
./2-record-not-wake-word.sh # 200 samples
./3-train-model.sh
Scenario: Multiple People + Multiple Wake Words
Full setup:
# Pre-trained for family
./quick_start_hey_mycroft.sh
# Personal wake word for Dad
cd ~/precise-models/jarvis
# Train custom wake word
# Emergency wake word
cd ~/precise-models/emergency
# Train emergency wake word
# Run multi-wake-word server
python voice_server_enhanced.py \
--enable-precise \
--multi-wake-word \
--enable-speaker-id
Quick Decision Matrix
| Your Situation | Recommendation |
|---|---|
| Just getting started | Pre-trained "Hey Mycroft" |
| Want different wake word | Train custom model |
| Family of 3-4 | Pre-trained + fine-tune if needed |
| Want personalization | Add speaker ID |
| Multiple purposes | Multiple wake words (server-side) |
| Deploying to K210 | 1 wake word, no speaker ID |
Files to Use
Quick start with pre-trained:
quick_start_hey_mycroft.sh- Zero training, 5 minutes!
Multiple wake words:
voice_server_enhanced.py- Multi-wake-word + speaker ID support
Training custom:
setup_precise.sh- Setup training environment- Scripts in
~/precise-models/your-wake-word/
Documentation:
WAKE_WORD_ADVANCED.md- Detailed guide (this is comprehensive!)PRECISE_DEPLOYMENT.md- Production deployment
Summary
✅ Yes, pre-trained "Hey Mycroft" exists and works great
✅ Yes, you can have multiple wake words (server-side is easy)
✅ Yes, multiple approaches for multi-user support
Recommended approach:
- Start with
./quick_start_hey_mycroft.sh(5 mins) - Test with all family members
- Fine-tune if anyone has issues
- Add speaker ID later if you want personalization
- Consider multiple wake words only if you have specific use cases
Keep it simple! One pre-trained wake word works for most people.
Next Actions
Ready to start?
# 5-minute quick start
./quick_start_hey_mycroft.sh
# Or read more first
cat WAKE_WORD_ADVANCED.md
Questions?
- Pre-trained models: See WAKE_WORD_ADVANCED.md § Pre-trained
- Multiple wake words: See WAKE_WORD_ADVANCED.md § Multiple Wake Words
- Voice adaptation: See WAKE_WORD_ADVANCED.md § Voice Adaptation
Happy voice assisting! 🎙️