Ports prior voice assistant research and prototypes from devl/Devops into the Minerva repo. Includes: - docs/: architecture, wake word guides, ESP32-S3 spec, hardware buying guide - scripts/: voice_server.py, voice_server_enhanced.py, setup scripts - hardware/maixduino/: edge device scripts with WiFi credentials scrubbed (replaced hardcoded password with secrets.py pattern) - config/.env.example: server config template - .gitignore: excludes .env, secrets.py, model blobs, ELF firmware - CLAUDE.md: Minerva product context and connection to cf-voice roadmap
10 KiB
Executable file
Wake Word Quick Reference Card
🎯 TL;DR: What Should I Do?
Recommendation for Your Setup
Week 1: Use pre-trained "Hey Mycroft"
./download_pretrained_models.sh --model hey-mycroft
precise-listen ~/precise-models/pretrained/hey-mycroft.net
Week 2-3: Fine-tune with all family members' voices
cd ~/precise-models/hey-mycroft-family
precise-train -e 30 custom.net . --from-checkpoint ../pretrained/hey-mycroft.net
Week 4+: Add speaker identification
pip install resemblyzer
python enroll_speaker.py --name Alan --duration 20
python enroll_speaker.py --name [Family] --duration 20
Month 2+: Add second wake word (Hey Jarvis for Plex?)
./download_pretrained_models.sh --model hey-jarvis
# Run both in parallel on server
📋 Pre-trained Models
Available Models (Ready to Use!)
| Wake Word | Download | Best For |
|---|---|---|
| Hey Mycroft ⭐ | --model hey-mycroft |
Default choice, most data |
| Hey Jarvis | --model hey-jarvis |
Pop culture, media control |
| Christopher | --model christopher |
Unique, less common |
| Hey Ezra | --model hey-ezra |
Alternative option |
Quick Download
# Download one
./download_pretrained_models.sh --model hey-mycroft
# Download all
./download_pretrained_models.sh --test-all
# Test immediately
precise-listen ~/precise-models/pretrained/hey-mycroft.net
🔢 Multiple Wake Words
Option 1: Multiple Models (Server-Side) ⭐ RECOMMENDED
What: Run 2-3 different wake word models simultaneously
Where: Heimdall (server)
Performance: ~15-30% CPU for 3 models
# Start with multiple wake words
python voice_server.py \
--enable-precise \
--precise-models "\
hey-mycroft:~/models/hey-mycroft.net:0.5,\
hey-jarvis:~/models/hey-jarvis.net:0.5"
Pros:
- ✅ Can identify which wake word was used
- ✅ Different contexts (Mycroft=commands, Jarvis=media)
- ✅ Easy to add/remove wake words
- ✅ Each can have different sensitivity
Cons:
- ❌ Only works server-side (not on Maix Duino)
- ❌ Higher CPU usage (but still reasonable)
Use When:
- You want different wake words for different purposes
- Server has CPU to spare (yours does!)
- Want flexibility to add wake words later
Option 2: Single Multi-Phrase Model (Edge-Compatible)
What: One model responds to multiple phrases
Where: Server OR Maix Duino
Performance: Same as single model
# Train on multiple phrases
cd ~/precise-models/multi-wake
# Record "Hey Mycroft" samples → wake-word/
# Record "Hey Computer" samples → wake-word/
# Record negatives → not-wake-word/
precise-train -e 60 multi-wake.net .
Pros:
- ✅ Single model = less compute
- ✅ Works on edge (K210)
- ✅ Simple deployment
Cons:
- ❌ Can't tell which wake word was used
- ❌ May reduce accuracy
- ❌ Higher false positive risk
Use When:
- Deploying to Maix Duino (edge)
- Want backup wake words
- Don't care which was used
👥 Multi-User Support
Option 1: Inclusive Training ⭐ START HERE
What: One model, all voices
How: All family members record samples
cd ~/precise-models/family-wake
# Alice records 30 samples
# Bob records 30 samples
# You record 30 samples
precise-train -e 60 family-wake.net .
Pros:
- ✅ Everyone can use it
- ✅ Simple deployment
- ✅ Single model
Cons:
- ❌ Can't identify who spoke
- ❌ No personalization
Use When:
- Just getting started
- Don't need to know who spoke
- Want simplicity
Option 2: Speaker Identification (Week 4+)
What: Detect wake word, then identify speaker
How: Voice embeddings (resemblyzer or pyannote)
# Install
pip install resemblyzer
# Enroll users
python enroll_speaker.py --name Alan --duration 20
python enroll_speaker.py --name Alice --duration 20
python enroll_speaker.py --name Bob --duration 20
# Server identifies speaker automatically
Pros:
- ✅ Personalized responses
- ✅ User-specific permissions
- ✅ Better privacy
- ✅ Track preferences
Cons:
- ❌ More complex
- ❌ Requires enrollment
- ❌ +100-200ms latency
- ❌ May fail with similar voices
Use When:
- Want personalization
- Need user-specific commands
- Ready for advanced features
Option 3: Per-User Wake Words (Advanced)
What: Each person has their own wake word
How: Multiple models, one per person
# Alice: "Hey Mycroft"
# Bob: "Hey Jarvis"
# You: "Hey Computer"
# Run all 3 models in parallel
Pros:
- ✅ Automatic user ID
- ✅ Highest accuracy per user
- ✅ Clear separation
Cons:
- ❌ 3x models = 3x CPU
- ❌ Users must remember their word
- ❌ Server-only (not edge)
Use When:
- Need automatic user ID
- Have CPU to spare
- Users want their own wake word
🎯 Decision Tree
START: Want to use voice assistant
│
├─ Single user or don't care who spoke?
│ └─ Use: Inclusive Training (Option 1)
│ └─ Download: Hey Mycroft (pre-trained)
│
├─ Multiple users AND need to know who spoke?
│ └─ Use: Speaker Identification (Option 2)
│ └─ Start with: Hey Mycroft + resemblyzer
│
├─ Want different wake words for different purposes?
│ └─ Use: Multiple Models (Option 1)
│ └─ Download: Hey Mycroft + Hey Jarvis
│
└─ Deploying to Maix Duino (edge)?
└─ Use: Single Multi-Phrase Model (Option 2)
└─ Train: Custom model with 2-3 phrases
📊 Comparison Table
| Feature | Inclusive | Speaker ID | Per-User Wake | Multiple Wake |
|---|---|---|---|---|
| Setup Time | 2 hours | 4 hours | 6 hours | 3 hours |
| Complexity | ⭐ Easy | ⭐⭐⭐ Medium | ⭐⭐⭐⭐ Hard | ⭐⭐ Easy |
| CPU Usage | 5-10% | 10-15% | 15-30% | 15-30% |
| Latency | 100ms | 300ms | 100ms | 100ms |
| User ID | ❌ No | ✅ Yes | ✅ Yes | ❌ No |
| Edge Deploy | ✅ Yes | ⚠️ Maybe | ❌ No | ⚠️ Partial |
| Personalize | ❌ No | ✅ Yes | ✅ Yes | ⚠️ Partial |
🚀 Recommended Timeline
Week 1: Get It Working
# Use pre-trained Hey Mycroft
./download_pretrained_models.sh --model hey-mycroft
# Test it
precise-listen ~/precise-models/pretrained/hey-mycroft.net
# Deploy to server
python voice_server.py --enable-precise \
--precise-model ~/precise-models/pretrained/hey-mycroft.net
Week 2-3: Make It Yours
# Fine-tune with your family's voices
cd ~/precise-models/hey-mycroft-family
# Have everyone record 20-30 samples
precise-collect # Alice
precise-collect # Bob
precise-collect # You
# Train
precise-train -e 30 custom.net . \
--from-checkpoint ../pretrained/hey-mycroft.net
Week 4+: Add Intelligence
# Speaker identification
pip install resemblyzer
python enroll_speaker.py --name Alan --duration 20
python enroll_speaker.py --name Alice --duration 20
# Now server knows who's speaking!
Month 2+: Expand Features
# Add second wake word for media control
./download_pretrained_models.sh --model hey-jarvis
# Run both: Mycroft for commands, Jarvis for Plex
python voice_server.py --enable-precise \
--precise-models "mycroft:hey-mycroft.net:0.5,jarvis:hey-jarvis.net:0.5"
💡 Pro Tips
Wake Word Selection
- ✅ DO: Choose clear, distinct wake words
- ✅ DO: Test in your environment
- ❌ DON'T: Use similar-sounding words
- ❌ DON'T: Use common phrases
Training
- ✅ DO: Include all intended users
- ✅ DO: Record in various conditions
- ✅ DO: Add false positives to training
- ❌ DON'T: Rush the training process
Deployment
- ✅ DO: Start simple (one wake word)
- ✅ DO: Test thoroughly before adding features
- ✅ DO: Monitor false positive rate
- ❌ DON'T: Deploy too many wake words at once
Speaker ID
- ✅ DO: Use 20+ seconds for enrollment
- ✅ DO: Re-enroll if accuracy drops
- ✅ DO: Test threshold values
- ❌ DON'T: Expect 100% accuracy
🔧 Quick Commands
# Download pre-trained model
./download_pretrained_models.sh --model hey-mycroft
# Test model
precise-listen ~/precise-models/pretrained/hey-mycroft.net
# Fine-tune from pre-trained
precise-train -e 30 custom.net . \
--from-checkpoint ~/precise-models/pretrained/hey-mycroft.net
# Enroll speaker
python enroll_speaker.py --name Alan --duration 20
# Start with single wake word
python voice_server.py --enable-precise \
--precise-model hey-mycroft.net
# Start with multiple wake words
python voice_server.py --enable-precise \
--precise-models "mycroft:hey-mycroft.net:0.5,jarvis:hey-jarvis.net:0.5"
# Check status
curl http://10.1.10.71:5000/wake-word/status
# Monitor detections
curl http://10.1.10.71:5000/wake-word/detections
📚 See Also
- Full guide: ADVANCED_WAKE_WORD_TOPICS.md
- Training: MYCROFT_PRECISE_GUIDE.md
- Deployment: PRECISE_DEPLOYMENT.md
- Getting started: QUICKSTART.md
❓ FAQ
Q: Can I use "Hey Mycroft" right away?
A: Yes! Download with ./download_pretrained_models.sh --model hey-mycroft
Q: How many wake words can I run at once?
A: 2-3 comfortably on server. Maix Duino can handle 1.
Q: Can I train my own custom wake word?
A: Yes! See MYCROFT_PRECISE_GUIDE.md Phase 2.
Q: Does speaker ID work with multiple wake words?
A: Yes! Wake word detected → Speaker identified → Personalized response.
Q: Can I use this on Maix Duino?
A: Server-side (start here), then convert to KMODEL (advanced).
Q: How accurate is speaker identification?
A: 85-95% with good enrollment. Re-enroll if accuracy drops.
Q: What if someone has a cold?
A: May reduce accuracy temporarily. System should recover when voice returns to normal.
Q: Can kids use it?
A: Yes! Include their voices in training or enroll them separately.
Quick Decision: Start with pre-trained Hey Mycroft. Add features later!
./download_pretrained_models.sh --model hey-mycroft
precise-listen ~/precise-models/pretrained/hey-mycroft.net
# It just works! ✨