Ports prior voice assistant research and prototypes from devl/Devops into the Minerva repo. Includes: - docs/: architecture, wake word guides, ESP32-S3 spec, hardware buying guide - scripts/: voice_server.py, voice_server_enhanced.py, setup scripts - hardware/maixduino/: edge device scripts with WiFi credentials scrubbed (replaced hardcoded password with secrets.py pattern) - config/.env.example: server config template - .gitignore: excludes .env, secrets.py, model blobs, ELF firmware - CLAUDE.md: Minerva product context and connection to cf-voice roadmap
411 lines
10 KiB
Markdown
Executable file
411 lines
10 KiB
Markdown
Executable file
# Wake Word Quick Reference Card
|
|
|
|
## 🎯 TL;DR: What Should I Do?
|
|
|
|
### Recommendation for Your Setup
|
|
|
|
**Week 1:** Use pre-trained "Hey Mycroft"
|
|
```bash
|
|
./download_pretrained_models.sh --model hey-mycroft
|
|
precise-listen ~/precise-models/pretrained/hey-mycroft.net
|
|
```
|
|
|
|
**Week 2-3:** Fine-tune with all family members' voices
|
|
```bash
|
|
cd ~/precise-models/hey-mycroft-family
|
|
precise-train -e 30 custom.net . --from-checkpoint ../pretrained/hey-mycroft.net
|
|
```
|
|
|
|
**Week 4+:** Add speaker identification
|
|
```bash
|
|
pip install resemblyzer
|
|
python enroll_speaker.py --name Alan --duration 20
|
|
python enroll_speaker.py --name [Family] --duration 20
|
|
```
|
|
|
|
**Month 2+:** Add second wake word (Hey Jarvis for Plex?)
|
|
```bash
|
|
./download_pretrained_models.sh --model hey-jarvis
|
|
# Run both in parallel on server
|
|
```
|
|
|
|
---
|
|
|
|
## 📋 Pre-trained Models
|
|
|
|
### Available Models (Ready to Use!)
|
|
|
|
| Wake Word | Download | Best For |
|
|
|-----------|----------|----------|
|
|
| **Hey Mycroft** ⭐ | `--model hey-mycroft` | Default choice, most data |
|
|
| **Hey Jarvis** | `--model hey-jarvis` | Pop culture, media control |
|
|
| **Christopher** | `--model christopher` | Unique, less common |
|
|
| **Hey Ezra** | `--model hey-ezra` | Alternative option |
|
|
|
|
### Quick Download
|
|
|
|
```bash
|
|
# Download one
|
|
./download_pretrained_models.sh --model hey-mycroft
|
|
|
|
# Download all
|
|
./download_pretrained_models.sh --test-all
|
|
|
|
# Test immediately
|
|
precise-listen ~/precise-models/pretrained/hey-mycroft.net
|
|
```
|
|
|
|
---
|
|
|
|
## 🔢 Multiple Wake Words
|
|
|
|
### Option 1: Multiple Models (Server-Side) ⭐ RECOMMENDED
|
|
|
|
**What:** Run 2-3 different wake word models simultaneously
|
|
**Where:** Heimdall (server)
|
|
**Performance:** ~15-30% CPU for 3 models
|
|
|
|
```bash
|
|
# Start with multiple wake words
|
|
python voice_server.py \
|
|
--enable-precise \
|
|
--precise-models "\
|
|
hey-mycroft:~/models/hey-mycroft.net:0.5,\
|
|
hey-jarvis:~/models/hey-jarvis.net:0.5"
|
|
```
|
|
|
|
**Pros:**
|
|
- ✅ Can identify which wake word was used
|
|
- ✅ Different contexts (Mycroft=commands, Jarvis=media)
|
|
- ✅ Easy to add/remove wake words
|
|
- ✅ Each can have different sensitivity
|
|
|
|
**Cons:**
|
|
- ❌ Only works server-side (not on Maix Duino)
|
|
- ❌ Higher CPU usage (but still reasonable)
|
|
|
|
**Use When:**
|
|
- You want different wake words for different purposes
|
|
- Server has CPU to spare (yours does!)
|
|
- Want flexibility to add wake words later
|
|
|
|
### Option 2: Single Multi-Phrase Model (Edge-Compatible)
|
|
|
|
**What:** One model responds to multiple phrases
|
|
**Where:** Server OR Maix Duino
|
|
**Performance:** Same as single model
|
|
|
|
```bash
|
|
# Train on multiple phrases
|
|
cd ~/precise-models/multi-wake
|
|
# Record "Hey Mycroft" samples → wake-word/
|
|
# Record "Hey Computer" samples → wake-word/
|
|
# Record negatives → not-wake-word/
|
|
precise-train -e 60 multi-wake.net .
|
|
```
|
|
|
|
**Pros:**
|
|
- ✅ Single model = less compute
|
|
- ✅ Works on edge (K210)
|
|
- ✅ Simple deployment
|
|
|
|
**Cons:**
|
|
- ❌ Can't tell which wake word was used
|
|
- ❌ May reduce accuracy
|
|
- ❌ Higher false positive risk
|
|
|
|
**Use When:**
|
|
- Deploying to Maix Duino (edge)
|
|
- Want backup wake words
|
|
- Don't care which was used
|
|
|
|
---
|
|
|
|
## 👥 Multi-User Support
|
|
|
|
### Option 1: Inclusive Training ⭐ START HERE
|
|
|
|
**What:** One model, all voices
|
|
**How:** All family members record samples
|
|
|
|
```bash
|
|
cd ~/precise-models/family-wake
|
|
# Alice records 30 samples
|
|
# Bob records 30 samples
|
|
# You record 30 samples
|
|
precise-train -e 60 family-wake.net .
|
|
```
|
|
|
|
**Pros:**
|
|
- ✅ Everyone can use it
|
|
- ✅ Simple deployment
|
|
- ✅ Single model
|
|
|
|
**Cons:**
|
|
- ❌ Can't identify who spoke
|
|
- ❌ No personalization
|
|
|
|
**Use When:**
|
|
- Just getting started
|
|
- Don't need to know who spoke
|
|
- Want simplicity
|
|
|
|
### Option 2: Speaker Identification (Week 4+)
|
|
|
|
**What:** Detect wake word, then identify speaker
|
|
**How:** Voice embeddings (resemblyzer or pyannote)
|
|
|
|
```bash
|
|
# Install
|
|
pip install resemblyzer
|
|
|
|
# Enroll users
|
|
python enroll_speaker.py --name Alan --duration 20
|
|
python enroll_speaker.py --name Alice --duration 20
|
|
python enroll_speaker.py --name Bob --duration 20
|
|
|
|
# Server identifies speaker automatically
|
|
```
|
|
|
|
**Pros:**
|
|
- ✅ Personalized responses
|
|
- ✅ User-specific permissions
|
|
- ✅ Better privacy
|
|
- ✅ Track preferences
|
|
|
|
**Cons:**
|
|
- ❌ More complex
|
|
- ❌ Requires enrollment
|
|
- ❌ +100-200ms latency
|
|
- ❌ May fail with similar voices
|
|
|
|
**Use When:**
|
|
- Want personalization
|
|
- Need user-specific commands
|
|
- Ready for advanced features
|
|
|
|
### Option 3: Per-User Wake Words (Advanced)
|
|
|
|
**What:** Each person has their own wake word
|
|
**How:** Multiple models, one per person
|
|
|
|
```bash
|
|
# Alice: "Hey Mycroft"
|
|
# Bob: "Hey Jarvis"
|
|
# You: "Hey Computer"
|
|
|
|
# Run all 3 models in parallel
|
|
```
|
|
|
|
**Pros:**
|
|
- ✅ Automatic user ID
|
|
- ✅ Highest accuracy per user
|
|
- ✅ Clear separation
|
|
|
|
**Cons:**
|
|
- ❌ 3x models = 3x CPU
|
|
- ❌ Users must remember their word
|
|
- ❌ Server-only (not edge)
|
|
|
|
**Use When:**
|
|
- Need automatic user ID
|
|
- Have CPU to spare
|
|
- Users want their own wake word
|
|
|
|
---
|
|
|
|
## 🎯 Decision Tree
|
|
|
|
```
|
|
START: Want to use voice assistant
|
|
│
|
|
├─ Single user or don't care who spoke?
|
|
│ └─ Use: Inclusive Training (Option 1)
|
|
│ └─ Download: Hey Mycroft (pre-trained)
|
|
│
|
|
├─ Multiple users AND need to know who spoke?
|
|
│ └─ Use: Speaker Identification (Option 2)
|
|
│ └─ Start with: Hey Mycroft + resemblyzer
|
|
│
|
|
├─ Want different wake words for different purposes?
|
|
│ └─ Use: Multiple Models (Option 1)
|
|
│ └─ Download: Hey Mycroft + Hey Jarvis
|
|
│
|
|
└─ Deploying to Maix Duino (edge)?
|
|
└─ Use: Single Multi-Phrase Model (Option 2)
|
|
└─ Train: Custom model with 2-3 phrases
|
|
```
|
|
|
|
---
|
|
|
|
## 📊 Comparison Table
|
|
|
|
| Feature | Inclusive | Speaker ID | Per-User Wake | Multiple Wake |
|
|
|---------|-----------|------------|---------------|---------------|
|
|
| **Setup Time** | 2 hours | 4 hours | 6 hours | 3 hours |
|
|
| **Complexity** | ⭐ Easy | ⭐⭐⭐ Medium | ⭐⭐⭐⭐ Hard | ⭐⭐ Easy |
|
|
| **CPU Usage** | 5-10% | 10-15% | 15-30% | 15-30% |
|
|
| **Latency** | 100ms | 300ms | 100ms | 100ms |
|
|
| **User ID** | ❌ No | ✅ Yes | ✅ Yes | ❌ No |
|
|
| **Edge Deploy** | ✅ Yes | ⚠️ Maybe | ❌ No | ⚠️ Partial |
|
|
| **Personalize** | ❌ No | ✅ Yes | ✅ Yes | ⚠️ Partial |
|
|
|
|
---
|
|
|
|
## 🚀 Recommended Timeline
|
|
|
|
### Week 1: Get It Working
|
|
```bash
|
|
# Use pre-trained Hey Mycroft
|
|
./download_pretrained_models.sh --model hey-mycroft
|
|
|
|
# Test it
|
|
precise-listen ~/precise-models/pretrained/hey-mycroft.net
|
|
|
|
# Deploy to server
|
|
python voice_server.py --enable-precise \
|
|
--precise-model ~/precise-models/pretrained/hey-mycroft.net
|
|
```
|
|
|
|
### Week 2-3: Make It Yours
|
|
```bash
|
|
# Fine-tune with your family's voices
|
|
cd ~/precise-models/hey-mycroft-family
|
|
|
|
# Have everyone record 20-30 samples
|
|
precise-collect # Alice
|
|
precise-collect # Bob
|
|
precise-collect # You
|
|
|
|
# Train
|
|
precise-train -e 30 custom.net . \
|
|
--from-checkpoint ../pretrained/hey-mycroft.net
|
|
```
|
|
|
|
### Week 4+: Add Intelligence
|
|
```bash
|
|
# Speaker identification
|
|
pip install resemblyzer
|
|
python enroll_speaker.py --name Alan --duration 20
|
|
python enroll_speaker.py --name Alice --duration 20
|
|
|
|
# Now server knows who's speaking!
|
|
```
|
|
|
|
### Month 2+: Expand Features
|
|
```bash
|
|
# Add second wake word for media control
|
|
./download_pretrained_models.sh --model hey-jarvis
|
|
|
|
# Run both: Mycroft for commands, Jarvis for Plex
|
|
python voice_server.py --enable-precise \
|
|
--precise-models "mycroft:hey-mycroft.net:0.5,jarvis:hey-jarvis.net:0.5"
|
|
```
|
|
|
|
---
|
|
|
|
## 💡 Pro Tips
|
|
|
|
### Wake Word Selection
|
|
- ✅ **DO:** Choose clear, distinct wake words
|
|
- ✅ **DO:** Test in your environment
|
|
- ❌ **DON'T:** Use similar-sounding words
|
|
- ❌ **DON'T:** Use common phrases
|
|
|
|
### Training
|
|
- ✅ **DO:** Include all intended users
|
|
- ✅ **DO:** Record in various conditions
|
|
- ✅ **DO:** Add false positives to training
|
|
- ❌ **DON'T:** Rush the training process
|
|
|
|
### Deployment
|
|
- ✅ **DO:** Start simple (one wake word)
|
|
- ✅ **DO:** Test thoroughly before adding features
|
|
- ✅ **DO:** Monitor false positive rate
|
|
- ❌ **DON'T:** Deploy too many wake words at once
|
|
|
|
### Speaker ID
|
|
- ✅ **DO:** Use 20+ seconds for enrollment
|
|
- ✅ **DO:** Re-enroll if accuracy drops
|
|
- ✅ **DO:** Test threshold values
|
|
- ❌ **DON'T:** Expect 100% accuracy
|
|
|
|
---
|
|
|
|
## 🔧 Quick Commands
|
|
|
|
```bash
|
|
# Download pre-trained model
|
|
./download_pretrained_models.sh --model hey-mycroft
|
|
|
|
# Test model
|
|
precise-listen ~/precise-models/pretrained/hey-mycroft.net
|
|
|
|
# Fine-tune from pre-trained
|
|
precise-train -e 30 custom.net . \
|
|
--from-checkpoint ~/precise-models/pretrained/hey-mycroft.net
|
|
|
|
# Enroll speaker
|
|
python enroll_speaker.py --name Alan --duration 20
|
|
|
|
# Start with single wake word
|
|
python voice_server.py --enable-precise \
|
|
--precise-model hey-mycroft.net
|
|
|
|
# Start with multiple wake words
|
|
python voice_server.py --enable-precise \
|
|
--precise-models "mycroft:hey-mycroft.net:0.5,jarvis:hey-jarvis.net:0.5"
|
|
|
|
# Check status
|
|
curl http://10.1.10.71:5000/wake-word/status
|
|
|
|
# Monitor detections
|
|
curl http://10.1.10.71:5000/wake-word/detections
|
|
```
|
|
|
|
---
|
|
|
|
## 📚 See Also
|
|
|
|
- **Full guide:** [ADVANCED_WAKE_WORD_TOPICS.md](ADVANCED_WAKE_WORD_TOPICS.md)
|
|
- **Training:** [MYCROFT_PRECISE_GUIDE.md](MYCROFT_PRECISE_GUIDE.md)
|
|
- **Deployment:** [PRECISE_DEPLOYMENT.md](PRECISE_DEPLOYMENT.md)
|
|
- **Getting started:** [QUICKSTART.md](QUICKSTART.md)
|
|
|
|
---
|
|
|
|
## ❓ FAQ
|
|
|
|
**Q: Can I use "Hey Mycroft" right away?**
|
|
A: Yes! Download with `./download_pretrained_models.sh --model hey-mycroft`
|
|
|
|
**Q: How many wake words can I run at once?**
|
|
A: 2-3 comfortably on server. Maix Duino can handle 1.
|
|
|
|
**Q: Can I train my own custom wake word?**
|
|
A: Yes! See MYCROFT_PRECISE_GUIDE.md Phase 2.
|
|
|
|
**Q: Does speaker ID work with multiple wake words?**
|
|
A: Yes! Wake word detected → Speaker identified → Personalized response.
|
|
|
|
**Q: Can I use this on Maix Duino?**
|
|
A: Server-side (start here), then convert to KMODEL (advanced).
|
|
|
|
**Q: How accurate is speaker identification?**
|
|
A: 85-95% with good enrollment. Re-enroll if accuracy drops.
|
|
|
|
**Q: What if someone has a cold?**
|
|
A: May reduce accuracy temporarily. System should recover when voice returns to normal.
|
|
|
|
**Q: Can kids use it?**
|
|
A: Yes! Include their voices in training or enroll them separately.
|
|
|
|
---
|
|
|
|
**Quick Decision:** Start with pre-trained Hey Mycroft. Add features later!
|
|
|
|
```bash
|
|
./download_pretrained_models.sh --model hey-mycroft
|
|
precise-listen ~/precise-models/pretrained/hey-mycroft.net
|
|
# It just works! ✨
|
|
```
|