Ports prior voice assistant research and prototypes from devl/Devops into the Minerva repo. Includes: - docs/: architecture, wake word guides, ESP32-S3 spec, hardware buying guide - scripts/: voice_server.py, voice_server_enhanced.py, setup scripts - hardware/maixduino/: edge device scripts with WiFi credentials scrubbed (replaced hardcoded password with secrets.py pattern) - config/.env.example: server config template - .gitignore: excludes .env, secrets.py, model blobs, ELF firmware - CLAUDE.md: Minerva product context and connection to cf-voice roadmap
638 lines
18 KiB
Markdown
Executable file
638 lines
18 KiB
Markdown
Executable file
# Mycroft Precise Wake Word Training Guide
|
|
|
|
## Overview
|
|
|
|
Mycroft Precise is a neural network-based wake word detector that you can train on custom wake words. This guide covers two deployment approaches for your Maix Duino voice assistant:
|
|
|
|
1. **Server-side detection** (Recommended to start) - Run Precise on Heimdall
|
|
2. **Edge detection** (Advanced) - Convert model for K210 on Maix Duino
|
|
|
|
## Architecture Options
|
|
|
|
### Option A: Server-Side Wake Word Detection (Recommended)
|
|
|
|
```
|
|
Maix Duino Heimdall
|
|
┌─────────────────┐ ┌──────────────────────┐
|
|
│ Continuous │ Audio Stream │ Mycroft Precise │
|
|
│ Audio Capture │───────────────>│ Wake Word Detection │
|
|
│ │ │ │
|
|
│ LED Feedback │<───────────────│ Whisper STT │
|
|
│ Speaker Output │ Response │ HA Integration │
|
|
│ │ │ Piper TTS │
|
|
└─────────────────┘ └──────────────────────┘
|
|
```
|
|
|
|
**Pros:**
|
|
- Easier setup and debugging
|
|
- Better accuracy (more compute available)
|
|
- Easy to retrain and update models
|
|
- Can use ensemble models
|
|
|
|
**Cons:**
|
|
- Continuous audio streaming (bandwidth)
|
|
- Slightly higher latency (~100-200ms)
|
|
- Requires stable network
|
|
|
|
### Option B: Edge Detection on Maix Duino (Advanced)
|
|
|
|
```
|
|
Maix Duino Heimdall
|
|
┌─────────────────┐ ┌──────────────────────┐
|
|
│ Precise Model │ │ │
|
|
│ (K210 KPU) │ │ │
|
|
│ Wake Detection │ Audio (on wake)│ Whisper STT │
|
|
│ │───────────────>│ HA Integration │
|
|
│ Audio Capture │ │ Piper TTS │
|
|
│ LED Feedback │<───────────────│ │
|
|
└─────────────────┘ Response └──────────────────────┘
|
|
```
|
|
|
|
**Pros:**
|
|
- Lower latency (~50ms wake detection)
|
|
- Less network traffic
|
|
- Works even if server is down
|
|
- Better privacy (no continuous streaming)
|
|
|
|
**Cons:**
|
|
- Complex model conversion (TensorFlow → ONNX → KMODEL)
|
|
- Limited by K210 compute
|
|
- Harder to update models
|
|
- Requires careful optimization
|
|
|
|
## Recommended Approach: Start with Server-Side
|
|
|
|
Begin with server-side detection on Heimdall, then optimize to edge detection once everything works.
|
|
|
|
## Phase 1: Mycroft Precise Setup on Heimdall
|
|
|
|
### Install Mycroft Precise
|
|
|
|
```bash
|
|
# SSH to Heimdall
|
|
ssh alan@10.1.10.71
|
|
|
|
# Create conda environment for Precise
|
|
conda create -n precise python=3.7 -y
|
|
conda activate precise
|
|
|
|
# Install TensorFlow 1.x (Precise requires this)
|
|
pip install tensorflow==1.15.5 --break-system-packages
|
|
|
|
# Install Precise
|
|
pip install mycroft-precise --break-system-packages
|
|
|
|
# Install audio dependencies
|
|
sudo apt-get install -y portaudio19-dev sox libatlas-base-dev
|
|
|
|
# Install precise-engine (for faster inference)
|
|
wget https://github.com/MycroftAI/mycroft-precise/releases/download/v0.3.0/precise-engine_0.3.0_x86_64.tar.gz
|
|
tar xvf precise-engine_0.3.0_x86_64.tar.gz
|
|
sudo cp precise-engine/precise-engine /usr/local/bin/
|
|
sudo chmod +x /usr/local/bin/precise-engine
|
|
```
|
|
|
|
### Verify Installation
|
|
|
|
```bash
|
|
precise-engine --version
|
|
# Should output: Precise v0.3.0
|
|
|
|
precise-listen --help
|
|
# Should show help text
|
|
```
|
|
|
|
## Phase 2: Training Your Custom Wake Word
|
|
|
|
### Step 1: Collect Wake Word Samples
|
|
|
|
You'll need ~50-100 samples of your wake word. Choose something:
|
|
- 2-3 syllables long
|
|
- Easy to pronounce
|
|
- Unlikely to occur in normal speech
|
|
|
|
Example wake words:
|
|
- "Hey Computer" (recommended - similar to commercial products)
|
|
- "Okay Jarvis"
|
|
- "Hello Assistant"
|
|
- "Activate Assistant"
|
|
|
|
```bash
|
|
# Create project directory
|
|
mkdir -p ~/precise-models/hey-computer
|
|
cd ~/precise-models/hey-computer
|
|
|
|
# Record wake word samples
|
|
precise-collect
|
|
```
|
|
|
|
When prompted:
|
|
1. Type your wake word ("hey computer")
|
|
2. Press SPACE to record
|
|
3. Say the wake word clearly
|
|
4. Press SPACE to stop
|
|
5. Repeat 50-100 times
|
|
|
|
**Tips for good samples:**
|
|
- Vary your tone and speed
|
|
- Different distances from mic
|
|
- Different background noise levels
|
|
- Different pronunciations
|
|
- Have family members record too
|
|
|
|
### Step 2: Collect "Not Wake Word" Samples
|
|
|
|
Record background audio and similar-sounding phrases:
|
|
|
|
```bash
|
|
# Create not-wake-word directory
|
|
mkdir -p not-wake-word
|
|
|
|
# Record random speech, music, TV, etc.
|
|
# These help the model learn what NOT to trigger on
|
|
precise-collect -f not-wake-word/random.wav
|
|
```
|
|
|
|
Collect ~200-500 samples of:
|
|
- Normal conversation
|
|
- TV/music in background
|
|
- Similar sounding phrases ("hey commuter", "they computed", etc.)
|
|
- Ambient noise
|
|
- Other household sounds
|
|
|
|
### Step 3: Generate Training Data
|
|
|
|
```bash
|
|
# Organize samples
|
|
mkdir -p hey-computer/{wake-word,not-wake-word,test/wake-word,test/not-wake-word}
|
|
|
|
# Split samples (80% train, 20% test)
|
|
# Move 80% of wake-word samples to hey-computer/wake-word/
|
|
# Move 20% to hey-computer/test/wake-word/
|
|
# Move 80% of not-wake-word to hey-computer/not-wake-word/
|
|
# Move 20% to hey-computer/test/not-wake-word/
|
|
|
|
# Generate training data
|
|
precise-train-incremental hey-computer.net hey-computer/
|
|
```
|
|
|
|
### Step 4: Train the Model
|
|
|
|
```bash
|
|
# Basic training (will take 30-60 minutes)
|
|
precise-train -e 60 hey-computer.net hey-computer/
|
|
|
|
# For better accuracy, train longer
|
|
precise-train -e 120 hey-computer.net hey-computer/
|
|
|
|
# Watch for overfitting - validation loss should decrease
|
|
# Stop if validation loss starts increasing
|
|
```
|
|
|
|
Training output will show:
|
|
```
|
|
Epoch 1/60
|
|
loss: 0.4523 - val_loss: 0.3891
|
|
Epoch 2/60
|
|
loss: 0.3102 - val_loss: 0.2845
|
|
...
|
|
```
|
|
|
|
### Step 5: Test the Model
|
|
|
|
```bash
|
|
# Test with microphone
|
|
precise-listen hey-computer.net
|
|
|
|
# Speak your wake word - should see "!" when detected
|
|
# Speak other phrases - should not trigger
|
|
|
|
# Test with audio files
|
|
precise-test hey-computer.net hey-computer/test/
|
|
|
|
# Should show accuracy metrics:
|
|
# Wake word accuracy: 95%+
|
|
# False positive rate: <5%
|
|
```
|
|
|
|
### Step 6: Optimize Sensitivity
|
|
|
|
```bash
|
|
# Adjust activation threshold
|
|
precise-listen hey-computer.net -t 0.5 # Default
|
|
precise-listen hey-computer.net -t 0.7 # More conservative
|
|
precise-listen hey-computer.net -t 0.3 # More aggressive
|
|
|
|
# Find optimal threshold for your use case
|
|
# Higher = fewer false positives, more false negatives
|
|
# Lower = more false positives, fewer false negatives
|
|
```
|
|
|
|
## Phase 3: Integration with Voice Server
|
|
|
|
### Update voice_server.py
|
|
|
|
Add Mycroft Precise support to the server:
|
|
|
|
```python
|
|
# Add to imports
|
|
from precise_runner import PreciseEngine, PreciseRunner
|
|
import pyaudio
|
|
|
|
# Add to configuration
|
|
PRECISE_MODEL = os.getenv("PRECISE_MODEL",
|
|
"/home/alan/precise-models/hey-computer.net")
|
|
PRECISE_SENSITIVITY = float(os.getenv("PRECISE_SENSITIVITY", "0.5"))
|
|
|
|
# Global precise runner
|
|
precise_runner = None
|
|
|
|
def on_activation():
|
|
"""Called when wake word is detected"""
|
|
print("Wake word detected!")
|
|
# Trigger recording and processing
|
|
# (Implementation depends on your audio streaming setup)
|
|
|
|
def start_precise_listener():
|
|
"""Start Mycroft Precise wake word detection"""
|
|
global precise_runner
|
|
|
|
engine = PreciseEngine(
|
|
'/usr/local/bin/precise-engine',
|
|
PRECISE_MODEL
|
|
)
|
|
|
|
precise_runner = PreciseRunner(
|
|
engine,
|
|
sensitivity=PRECISE_SENSITIVITY,
|
|
on_activation=on_activation
|
|
)
|
|
|
|
precise_runner.start()
|
|
print(f"Precise listening with model: {PRECISE_MODEL}")
|
|
```
|
|
|
|
### Server-Side Wake Word Detection Architecture
|
|
|
|
For server-side detection, you need continuous audio streaming from Maix Duino:
|
|
|
|
```python
|
|
# New endpoint for audio streaming
|
|
@app.route('/stream', methods=['POST'])
|
|
def stream_audio():
|
|
"""
|
|
Receive continuous audio stream for wake word detection
|
|
|
|
This endpoint processes incoming audio chunks and runs them
|
|
through Mycroft Precise for wake word detection.
|
|
"""
|
|
# Implementation here
|
|
pass
|
|
```
|
|
|
|
## Phase 4: Maix Duino Integration (Server-Side Detection)
|
|
|
|
### Update maix_voice_client.py
|
|
|
|
For server-side detection, stream audio continuously:
|
|
|
|
```python
|
|
# Add to configuration
|
|
STREAM_ENDPOINT = "/stream"
|
|
WAKE_WORD_CHECK_INTERVAL = 0.1 # Check every 100ms
|
|
|
|
def stream_audio_continuous():
|
|
"""
|
|
Stream audio to server for wake word detection
|
|
|
|
Server will notify us when wake word is detected
|
|
"""
|
|
import socket
|
|
import struct
|
|
|
|
# Create socket connection
|
|
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
|
|
server_addr = (VOICE_SERVER_URL.replace('http://', '').split(':')[0], 8888)
|
|
|
|
try:
|
|
sock.connect(server_addr)
|
|
print("Connected to wake word server")
|
|
|
|
while True:
|
|
# Capture audio chunk
|
|
chunk = i2s_dev.record(CHUNK_SIZE)
|
|
|
|
if chunk:
|
|
# Send chunk size first, then chunk
|
|
sock.sendall(struct.pack('>I', len(chunk)))
|
|
sock.sendall(chunk)
|
|
|
|
# Check for wake word detection signal
|
|
# (simplified - actual implementation needs non-blocking socket)
|
|
|
|
time.sleep(0.01)
|
|
|
|
except Exception as e:
|
|
print(f"Streaming error: {e}")
|
|
finally:
|
|
sock.close()
|
|
```
|
|
|
|
## Phase 5: Edge Detection on Maix Duino (Advanced)
|
|
|
|
### Convert Precise Model to KMODEL
|
|
|
|
This is complex and requires several conversion steps:
|
|
|
|
```bash
|
|
# Step 1: Convert TensorFlow model to ONNX
|
|
pip install tf2onnx --break-system-packages
|
|
|
|
python -m tf2onnx.convert \
|
|
--saved-model hey-computer.net \
|
|
--output hey-computer.onnx
|
|
|
|
# Step 2: Optimize ONNX model
|
|
pip install onnx --break-system-packages
|
|
|
|
python -c "
|
|
import onnx
|
|
from onnx import optimizer
|
|
|
|
model = onnx.load('hey-computer.onnx')
|
|
passes = ['eliminate_deadend', 'eliminate_identity',
|
|
'eliminate_nop_dropout', 'eliminate_nop_pad']
|
|
optimized = optimizer.optimize(model, passes)
|
|
onnx.save(optimized, 'hey-computer-opt.onnx')
|
|
"
|
|
|
|
# Step 3: Convert ONNX to KMODEL (for K210)
|
|
# Use nncase (https://github.com/kendryte/nncase)
|
|
# This step is hardware-specific and complex
|
|
|
|
# Install nncase
|
|
pip install nncase --break-system-packages
|
|
|
|
# Convert (adjust parameters based on your model)
|
|
ncc compile hey-computer-opt.onnx \
|
|
-i onnx \
|
|
--dataset calibration_data \
|
|
-o hey-computer.kmodel \
|
|
--target k210
|
|
```
|
|
|
|
**Note:** KMODEL conversion is non-trivial and may require model architecture adjustments. The K210 has limitations:
|
|
- Max model size: ~6MB
|
|
- Limited operators support
|
|
- Quantization required for performance
|
|
|
|
### Testing KMODEL on Maix Duino
|
|
|
|
```python
|
|
# Load model in maix_voice_client.py
|
|
import KPU as kpu
|
|
|
|
def load_wake_word_model_kmodel():
|
|
"""Load converted KMODEL for wake word detection"""
|
|
global kpu_task
|
|
|
|
try:
|
|
kpu_task = kpu.load("/sd/models/hey-computer.kmodel")
|
|
print("Wake word model loaded on K210")
|
|
return True
|
|
except Exception as e:
|
|
print(f"Failed to load model: {e}")
|
|
return False
|
|
|
|
def detect_wake_word_kmodel():
|
|
"""Run wake word detection using K210 KPU"""
|
|
global kpu_task
|
|
|
|
# Capture audio
|
|
audio_chunk = i2s_dev.record(CHUNK_SIZE)
|
|
|
|
# Preprocess for model (depends on model input format)
|
|
# This is model-specific - adjust based on your training
|
|
|
|
# Run inference
|
|
features = preprocess_audio(audio_chunk)
|
|
output = kpu.run_yolo2(kpu_task, features) # Adjust based on model type
|
|
|
|
# Check confidence
|
|
if output[0] > WAKE_WORD_THRESHOLD:
|
|
return True
|
|
|
|
return False
|
|
```
|
|
|
|
## Recommended Wake Words
|
|
|
|
Based on testing and community feedback:
|
|
|
|
**Best performers:**
|
|
1. "Hey Computer" - Clear, distinct, 2-syllable, hard consonants
|
|
2. "Okay Jarvis" - Pop culture reference, easy to say
|
|
3. "Hey Mycroft" - Original Mycroft wake word (lots of training data available)
|
|
|
|
**Avoid:**
|
|
- Single syllable words (too easy to trigger)
|
|
- Common phrases ("okay", "hey there")
|
|
- Names of people in your household
|
|
- Words that sound like common speech patterns
|
|
|
|
## Training Tips
|
|
|
|
### For Best Accuracy
|
|
|
|
1. **Diverse training data:**
|
|
- Multiple speakers
|
|
- Various distances (1ft to 15ft)
|
|
- Different noise conditions
|
|
- Accent variations
|
|
|
|
2. **Quality over quantity:**
|
|
- 50 good samples > 200 poor samples
|
|
- Clear pronunciation
|
|
- Consistent volume
|
|
|
|
3. **Hard negatives:**
|
|
- Include similar-sounding phrases
|
|
- Include partial wake words
|
|
- Include common false triggers you notice
|
|
|
|
4. **Regular retraining:**
|
|
- Add false positives to training set
|
|
- Add missed detections
|
|
- Retrain every few weeks initially
|
|
|
|
### Collecting Hard Negatives
|
|
|
|
```bash
|
|
# Run Precise in test mode and collect false positives
|
|
precise-listen hey-computer.net --save-false-positives
|
|
|
|
# This will save audio clips when model triggers incorrectly
|
|
# Add these to your not-wake-word training set
|
|
# Retrain to reduce false positives
|
|
```
|
|
|
|
## Performance Benchmarks
|
|
|
|
### Server-Side Detection (Heimdall)
|
|
- **Latency:** 100-200ms from utterance to detection
|
|
- **Accuracy:** 95%+ with good training
|
|
- **False positive rate:** <1 per hour with tuning
|
|
- **CPU usage:** ~5-10% (single core)
|
|
- **Network:** ~128kbps continuous stream
|
|
|
|
### Edge Detection (Maix Duino)
|
|
- **Latency:** 50-100ms
|
|
- **Accuracy:** 80-90% (limited by K210 quantization)
|
|
- **False positive rate:** Varies by model optimization
|
|
- **CPU usage:** ~30% K210 (leaves room for other tasks)
|
|
- **Network:** 0 until wake detected
|
|
|
|
## Monitoring and Debugging
|
|
|
|
### Log Wake Word Detections
|
|
|
|
```python
|
|
# Add to voice_server.py
|
|
import datetime
|
|
|
|
def log_wake_word(confidence, timestamp=None):
|
|
"""Log wake word detections for analysis"""
|
|
if timestamp is None:
|
|
timestamp = datetime.datetime.now()
|
|
|
|
log_file = "/home/alan/voice-assistant/logs/wake_words.log"
|
|
|
|
with open(log_file, 'a') as f:
|
|
f.write(f"{timestamp.isoformat()},{confidence}\n")
|
|
```
|
|
|
|
### Analyze False Positives
|
|
|
|
```bash
|
|
# Check wake word log
|
|
tail -f ~/voice-assistant/logs/wake_words.log
|
|
|
|
# Find patterns in false positives
|
|
grep "wake_word" ~/voice-assistant/logs/wake_words.log | \
|
|
awk -F',' '{print $2}' | \
|
|
sort -n | uniq -c
|
|
```
|
|
|
|
## Production Deployment
|
|
|
|
### Systemd Service with Precise
|
|
|
|
Update the systemd service to include Precise:
|
|
|
|
```ini
|
|
[Unit]
|
|
Description=Voice Assistant with Wake Word Detection
|
|
After=network.target
|
|
|
|
[Service]
|
|
Type=simple
|
|
User=alan
|
|
WorkingDirectory=/home/alan/voice-assistant
|
|
Environment="PATH=/home/alan/miniconda3/envs/precise/bin:/usr/local/bin:/usr/bin:/bin"
|
|
EnvironmentFile=/home/alan/voice-assistant/config/.env
|
|
ExecStart=/home/alan/miniconda3/envs/precise/bin/python voice_server.py --enable-precise
|
|
Restart=on-failure
|
|
RestartSec=10
|
|
|
|
[Install]
|
|
WantedBy=multi-user.target
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Precise Won't Start
|
|
|
|
```bash
|
|
# Check TensorFlow version
|
|
python -c "import tensorflow as tf; print(tf.__version__)"
|
|
# Should be 1.15.x
|
|
|
|
# Check model file
|
|
file hey-computer.net
|
|
# Should be "TensorFlow SavedModel"
|
|
|
|
# Test model directly
|
|
precise-engine hey-computer.net
|
|
# Should load without errors
|
|
```
|
|
|
|
### Low Accuracy
|
|
|
|
1. **Collect more training data** - Especially hard negatives
|
|
2. **Increase training epochs** - Try 200-300 epochs
|
|
3. **Verify training/test split** - Should be 80/20
|
|
4. **Check audio quality** - Sample rate should match (16kHz)
|
|
5. **Try different wake words** - Some are easier to detect
|
|
|
|
### High False Positive Rate
|
|
|
|
1. **Increase threshold** - Try 0.6, 0.7, 0.8
|
|
2. **Add false positives to training** - Retrain with false triggers
|
|
3. **Collect more negative samples** - Expand not-wake-word set
|
|
4. **Use ensemble models** - Run multiple models, require agreement
|
|
|
|
### KMODEL Conversion Fails
|
|
|
|
This is expected - K210 conversion is complex:
|
|
|
|
1. **Simplify model architecture** - Reduce layer count
|
|
2. **Use quantization-aware training** - Train with quantization in mind
|
|
3. **Check operator support** - K210 doesn't support all TF ops
|
|
4. **Consider alternatives:**
|
|
- Use pre-trained models for K210
|
|
- Stick with server-side detection
|
|
- Use Porcupine instead (has K210 support)
|
|
|
|
## Alternative: Use Pre-trained Models
|
|
|
|
Mycroft provides some pre-trained models:
|
|
|
|
```bash
|
|
# Download Hey Mycroft model
|
|
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
|
|
tar xzf hey-mycroft.tar.gz
|
|
|
|
# Test it
|
|
precise-listen hey-mycroft.net
|
|
```
|
|
|
|
Then train your own wake word starting from this base:
|
|
|
|
```bash
|
|
# Fine-tune from pre-trained model
|
|
precise-train -e 60 my-wake-word.net my-wake-word/ \
|
|
--from-checkpoint hey-mycroft.net
|
|
```
|
|
|
|
## Next Steps
|
|
|
|
1. **Start with server-side** - Get it working on Heimdall first
|
|
2. **Collect good training data** - Quality samples are key
|
|
3. **Test and tune threshold** - Find the sweet spot for your environment
|
|
4. **Monitor performance** - Track false positives and misses
|
|
5. **Iterate on training** - Add hard examples, retrain
|
|
6. **Consider edge deployment** - Once server-side is solid
|
|
|
|
## Resources
|
|
|
|
- Mycroft Precise Docs: https://github.com/MycroftAI/mycroft-precise
|
|
- Training Guide: https://mycroft-ai.gitbook.io/docs/mycroft-technologies/precise
|
|
- Community Models: https://github.com/MycroftAI/precise-data
|
|
- K210 Docs: https://canaan-creative.com/developer
|
|
- nncase: https://github.com/kendryte/nncase
|
|
|
|
## Conclusion
|
|
|
|
Mycroft Precise gives you full control over your wake word detection with complete privacy. Start with server-side detection for easier development, then optimize to edge detection once you have a well-trained model.
|
|
|
|
The key to success is good training data - invest time in collecting diverse, high-quality samples!
|