Ports prior voice assistant research and prototypes from devl/Devops into the Minerva repo. Includes: - docs/: architecture, wake word guides, ESP32-S3 spec, hardware buying guide - scripts/: voice_server.py, voice_server_enhanced.py, setup scripts - hardware/maixduino/: edge device scripts with WiFi credentials scrubbed (replaced hardcoded password with secrets.py pattern) - config/.env.example: server config template - .gitignore: excludes .env, secrets.py, model blobs, ELF firmware - CLAUDE.md: Minerva product context and connection to cf-voice roadmap
470 lines
10 KiB
Markdown
Executable file
470 lines
10 KiB
Markdown
Executable file
# Your Questions Answered - Quick Reference
|
|
|
|
## TL;DR: Yes, Yes, and Multiple Options!
|
|
|
|
### Q1: Pre-trained "Hey Mycroft" Model?
|
|
|
|
**Answer: YES! ✅**
|
|
|
|
Download and use immediately:
|
|
```bash
|
|
./quick_start_hey_mycroft.sh
|
|
# Done in 5 minutes - no training!
|
|
```
|
|
|
|
The pre-trained model works great and saves you 1-2 hours of training time.
|
|
|
|
### Q2: Multiple Wake Words?
|
|
|
|
**Answer: YES! ✅ (with considerations)**
|
|
|
|
**Server-side (Heimdall):** Easy, run 3-5 wake words
|
|
```bash
|
|
python voice_server_enhanced.py \
|
|
--enable-precise \
|
|
--multi-wake-word
|
|
```
|
|
|
|
**Edge (K210):** Feasible for 1-2, challenging for 3+
|
|
|
|
### Q3: Adopting New Users' Voices?
|
|
|
|
**Answer: Multiple approaches ✅**
|
|
|
|
**Best option:** Train one model with everyone's voices upfront
|
|
**Alternative:** Incremental retraining as new users join
|
|
**Advanced:** Speaker identification with personalization
|
|
|
|
---
|
|
|
|
## Detailed Answers
|
|
|
|
### 1. Pre-trained "Hey Mycroft" Model
|
|
|
|
#### Where to Get It
|
|
|
|
```bash
|
|
# Quick start script does this for you
|
|
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
|
|
tar xzf hey-mycroft.tar.gz
|
|
```
|
|
|
|
#### How to Use
|
|
|
|
**Instant deployment:**
|
|
```bash
|
|
python voice_server.py \
|
|
--enable-precise \
|
|
--precise-model ~/precise-models/pretrained/hey-mycroft.net
|
|
```
|
|
|
|
**Fine-tune with your voice:**
|
|
```bash
|
|
# Record 20-30 samples of your voice saying "Hey Mycroft"
|
|
precise-collect
|
|
|
|
# Fine-tune from pre-trained
|
|
precise-train -e 30 my-hey-mycroft.net . \
|
|
--from-checkpoint ~/precise-models/pretrained/hey-mycroft.net
|
|
```
|
|
|
|
#### Advantages
|
|
|
|
✅ **Zero training time** - Works immediately
|
|
✅ **Proven accuracy** - Tested by thousands
|
|
✅ **Good baseline** - Already includes diverse voices
|
|
✅ **Easy fine-tuning** - Add your voice in 30 mins vs 60+ mins from scratch
|
|
|
|
#### When to Use Pre-trained vs Custom
|
|
|
|
**Use Pre-trained "Hey Mycroft" when:**
|
|
- You want to test quickly
|
|
- "Hey Mycroft" is an acceptable wake word
|
|
- You want proven accuracy out-of-box
|
|
|
|
**Train Custom when:**
|
|
- You want a different wake word ("Hey Computer", "Jarvis", etc.)
|
|
- Maximum accuracy for your specific environment
|
|
- Family-specific wake word
|
|
|
|
**Hybrid (Recommended):**
|
|
- Start with pre-trained "Hey Mycroft"
|
|
- Test and learn the system
|
|
- Fine-tune with your samples
|
|
- Or add custom wake word later
|
|
|
|
---
|
|
|
|
### 2. Multiple Wake Words
|
|
|
|
#### Can You Have Multiple?
|
|
|
|
**Yes!** Options:
|
|
|
|
#### Option A: Server-Side (Recommended)
|
|
|
|
**Easy implementation:**
|
|
```bash
|
|
# Use the enhanced server
|
|
python voice_server_enhanced.py \
|
|
--enable-precise \
|
|
--multi-wake-word
|
|
```
|
|
|
|
**Configured wake words:**
|
|
- "Hey Mycroft" (pre-trained)
|
|
- "Hey Computer" (custom)
|
|
- "Jarvis" (custom)
|
|
|
|
**Resource impact:**
|
|
- 3 models = ~15-30% CPU (Heimdall handles easily)
|
|
- ~300-600MB RAM
|
|
- Each model runs independently
|
|
|
|
**Example use cases:**
|
|
```python
|
|
"Hey Mycroft, what's the time?" → General assistant
|
|
"Jarvis, run diagnostics" → Personal assistant mode
|
|
"Emergency, call help" → Priority/emergency mode
|
|
```
|
|
|
|
#### Option B: Edge (K210)
|
|
|
|
**Feasible for 1-2 wake words:**
|
|
```python
|
|
# Sequential checking
|
|
for model in ['hey-mycroft.kmodel', 'emergency.kmodel']:
|
|
if detect_wake_word(model):
|
|
return model
|
|
```
|
|
|
|
**Limitations:**
|
|
- +50-100ms latency per additional model
|
|
- Memory constraints (6MB total for all models)
|
|
- More models = more power consumption
|
|
|
|
**Recommendation:**
|
|
- K210: 1 wake word (optimal)
|
|
- K210: 2 wake words (acceptable)
|
|
- K210: 3+ wake words (not recommended)
|
|
|
|
#### Option C: Contextual Wake Words
|
|
|
|
Different wake words for different purposes:
|
|
```python
|
|
wake_word_contexts = {
|
|
'hey_mycroft': 'general_assistant',
|
|
'emergency': 'priority_emergency',
|
|
'goodnight': 'bedtime_routine',
|
|
}
|
|
```
|
|
|
|
#### Should You Use Multiple?
|
|
|
|
**One wake word is usually enough!**
|
|
|
|
Commercial products (Alexa, Google) use one wake word and they work fine.
|
|
|
|
**Use multiple when:**
|
|
- Different family members want different wake words
|
|
- You want context-specific behaviors (emergency vs. general)
|
|
- You enjoy the flexibility
|
|
|
|
**Start with one, add more later if needed.**
|
|
|
|
---
|
|
|
|
### 3. Adopting New Users' Voices
|
|
|
|
#### Challenge
|
|
|
|
Same wake word, different voices:
|
|
- Mom says "Hey Mycroft" (soprano)
|
|
- Dad says "Hey Mycroft" (bass)
|
|
- Kids say "Hey Mycroft" (high-pitched)
|
|
|
|
All need to work!
|
|
|
|
#### Solution 1: Diverse Training (Recommended)
|
|
|
|
**During initial training, have everyone record samples:**
|
|
|
|
```bash
|
|
cd ~/precise-models/family-hey-mycroft
|
|
|
|
# Session 1: Mom records 30 samples
|
|
precise-collect # Mom speaks "Hey Mycroft" 30 times
|
|
|
|
# Session 2: Dad records 30 samples
|
|
precise-collect # Dad speaks "Hey Mycroft" 30 times
|
|
|
|
# Session 3: Kids record 20 samples each
|
|
precise-collect # Kids speak "Hey Mycroft" 40 times total
|
|
|
|
# Train one model with all voices
|
|
precise-train -e 60 family-hey-mycroft.net .
|
|
|
|
# Deploy
|
|
python voice_server.py \
|
|
--enable-precise \
|
|
--precise-model family-hey-mycroft.net
|
|
```
|
|
|
|
**Pros:**
|
|
✅ One model works for everyone
|
|
✅ Simple deployment
|
|
✅ No switching needed
|
|
✅ Works from day one
|
|
|
|
**Cons:**
|
|
❌ Need everyone's time upfront
|
|
❌ Slightly lower per-person accuracy than individual models
|
|
|
|
#### Solution 2: Incremental Training
|
|
|
|
**Start with one person, add others over time:**
|
|
|
|
```bash
|
|
# Week 1: Train with Dad's voice
|
|
precise-train -e 60 hey-mycroft.net .
|
|
|
|
# Week 2: Mom wants to use it
|
|
# Collect Mom's samples
|
|
precise-collect # Mom records 20-30 samples
|
|
|
|
# Add to training set
|
|
cp mom-samples/* wake-word/
|
|
|
|
# Retrain from checkpoint (faster!)
|
|
precise-train -e 30 hey-mycroft.net . \
|
|
--from-checkpoint hey-mycroft.net
|
|
|
|
# Now works for both Dad and Mom!
|
|
|
|
# Week 3: Kids want in
|
|
# Repeat process...
|
|
```
|
|
|
|
**Pros:**
|
|
✅ Don't need everyone upfront
|
|
✅ Easy to add new users
|
|
✅ Model improves gradually
|
|
|
|
**Cons:**
|
|
❌ New users may have issues initially
|
|
❌ Requires periodic retraining
|
|
|
|
#### Solution 3: Speaker Identification (Advanced)
|
|
|
|
**Identify who's speaking, use personalized model/settings:**
|
|
|
|
```bash
|
|
# Install speaker ID
|
|
pip install pyannote.audio scipy --break-system-packages
|
|
|
|
# Use enhanced server
|
|
python voice_server_enhanced.py \
|
|
--enable-precise \
|
|
--enable-speaker-id \
|
|
--hf-token YOUR_HF_TOKEN
|
|
```
|
|
|
|
**Enroll users:**
|
|
```bash
|
|
# Record 30-second voice sample from each person
|
|
# POST to /speakers/enroll with audio + name
|
|
|
|
curl -F "name=alan" \
|
|
-F "audio=@alan_voice.wav" \
|
|
http://localhost:5000/speakers/enroll
|
|
|
|
curl -F "name=sarah" \
|
|
-F "audio=@sarah_voice.wav" \
|
|
http://localhost:5000/speakers/enroll
|
|
```
|
|
|
|
**Benefits:**
|
|
```python
|
|
# Different responses per user
|
|
if speaker == 'alan':
|
|
turn_on('light.alan_office')
|
|
elif speaker == 'sarah':
|
|
turn_on('light.sarah_office')
|
|
|
|
# Different permissions
|
|
if speaker == 'kids' and command.startswith('buy'):
|
|
return "Sorry, kids can't make purchases"
|
|
```
|
|
|
|
**Pros:**
|
|
✅ Personalized responses
|
|
✅ User-specific settings
|
|
✅ Better accuracy (optimized per voice)
|
|
✅ Can track who said what
|
|
|
|
**Cons:**
|
|
❌ More complex
|
|
❌ Privacy considerations
|
|
❌ Additional CPU/RAM (~10% + 200MB)
|
|
❌ Requires voice enrollment
|
|
|
|
#### Solution 4: Pre-trained Model (Easiest)
|
|
|
|
**"Hey Mycroft" already includes diverse voices!**
|
|
|
|
```bash
|
|
# Just use it - already trained on many voices
|
|
./quick_start_hey_mycroft.sh
|
|
```
|
|
|
|
The community model was trained with:
|
|
- Male and female voices
|
|
- Different accents
|
|
- Different ages
|
|
- Various environments
|
|
|
|
**It should work for most family members out-of-box!**
|
|
|
|
Then fine-tune if needed.
|
|
|
|
---
|
|
|
|
## Recommended Path for Your Situation
|
|
|
|
### Scenario: Family of 3-4 People
|
|
|
|
**Week 1: Quick Start**
|
|
```bash
|
|
# Use pre-trained "Hey Mycroft"
|
|
./quick_start_hey_mycroft.sh
|
|
|
|
# Test with all family members
|
|
# Likely works for everyone already!
|
|
```
|
|
|
|
**Week 2: Fine-tune if Needed**
|
|
```bash
|
|
# If someone has issues:
|
|
# Have them record 20 samples
|
|
# Fine-tune the model
|
|
|
|
precise-train -e 30 family-hey-mycroft.net . \
|
|
--from-checkpoint ~/precise-models/pretrained/hey-mycroft.net
|
|
```
|
|
|
|
**Week 3: Add Features**
|
|
```bash
|
|
# If you want personalization:
|
|
python voice_server_enhanced.py \
|
|
--enable-speaker-id
|
|
|
|
# Enroll each family member
|
|
```
|
|
|
|
### Scenario: Just You (or 1-2 People)
|
|
|
|
**Option 1: Pre-trained**
|
|
```bash
|
|
./quick_start_hey_mycroft.sh
|
|
# Done!
|
|
```
|
|
|
|
**Option 2: Custom Wake Word**
|
|
```bash
|
|
# Train custom "Hey Computer"
|
|
cd ~/precise-models/hey-computer
|
|
./1-record-wake-word.sh # 50 samples
|
|
./2-record-not-wake-word.sh # 200 samples
|
|
./3-train-model.sh
|
|
```
|
|
|
|
### Scenario: Multiple People + Multiple Wake Words
|
|
|
|
**Full setup:**
|
|
```bash
|
|
# Pre-trained for family
|
|
./quick_start_hey_mycroft.sh
|
|
|
|
# Personal wake word for Dad
|
|
cd ~/precise-models/jarvis
|
|
# Train custom wake word
|
|
|
|
# Emergency wake word
|
|
cd ~/precise-models/emergency
|
|
# Train emergency wake word
|
|
|
|
# Run multi-wake-word server
|
|
python voice_server_enhanced.py \
|
|
--enable-precise \
|
|
--multi-wake-word \
|
|
--enable-speaker-id
|
|
```
|
|
|
|
---
|
|
|
|
## Quick Decision Matrix
|
|
|
|
| Your Situation | Recommendation |
|
|
|----------------|----------------|
|
|
| **Just getting started** | Pre-trained "Hey Mycroft" |
|
|
| **Want different wake word** | Train custom model |
|
|
| **Family of 3-4** | Pre-trained + fine-tune if needed |
|
|
| **Want personalization** | Add speaker ID |
|
|
| **Multiple purposes** | Multiple wake words (server-side) |
|
|
| **Deploying to K210** | 1 wake word, no speaker ID |
|
|
|
|
---
|
|
|
|
## Files to Use
|
|
|
|
**Quick start with pre-trained:**
|
|
- `quick_start_hey_mycroft.sh` - Zero training, 5 minutes!
|
|
|
|
**Multiple wake words:**
|
|
- `voice_server_enhanced.py` - Multi-wake-word + speaker ID support
|
|
|
|
**Training custom:**
|
|
- `setup_precise.sh` - Setup training environment
|
|
- Scripts in `~/precise-models/your-wake-word/`
|
|
|
|
**Documentation:**
|
|
- `WAKE_WORD_ADVANCED.md` - Detailed guide (this is comprehensive!)
|
|
- `PRECISE_DEPLOYMENT.md` - Production deployment
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
✅ **Yes**, pre-trained "Hey Mycroft" exists and works great
|
|
✅ **Yes**, you can have multiple wake words (server-side is easy)
|
|
✅ **Yes**, multiple approaches for multi-user support
|
|
|
|
**Recommended approach:**
|
|
1. Start with `./quick_start_hey_mycroft.sh` (5 mins)
|
|
2. Test with all family members
|
|
3. Fine-tune if anyone has issues
|
|
4. Add speaker ID later if you want personalization
|
|
5. Consider multiple wake words only if you have specific use cases
|
|
|
|
**Keep it simple!** One pre-trained wake word works for most people.
|
|
|
|
---
|
|
|
|
## Next Actions
|
|
|
|
**Ready to start?**
|
|
|
|
```bash
|
|
# 5-minute quick start
|
|
./quick_start_hey_mycroft.sh
|
|
|
|
# Or read more first
|
|
cat WAKE_WORD_ADVANCED.md
|
|
```
|
|
|
|
**Questions?**
|
|
- Pre-trained models: See WAKE_WORD_ADVANCED.md § Pre-trained
|
|
- Multiple wake words: See WAKE_WORD_ADVANCED.md § Multiple Wake Words
|
|
- Voice adaptation: See WAKE_WORD_ADVANCED.md § Voice Adaptation
|
|
|
|
**Happy voice assisting! 🎙️**
|