Ports prior voice assistant research and prototypes from devl/Devops into the Minerva repo. Includes: - docs/: architecture, wake word guides, ESP32-S3 spec, hardware buying guide - scripts/: voice_server.py, voice_server_enhanced.py, setup scripts - hardware/maixduino/: edge device scripts with WiFi credentials scrubbed (replaced hardcoded password with secrets.py pattern) - config/.env.example: server config template - .gitignore: excludes .env, secrets.py, model blobs, ELF firmware - CLAUDE.md: Minerva product context and connection to cf-voice roadmap
18 KiB
Executable file
Mycroft Precise Wake Word Training Guide
Overview
Mycroft Precise is a neural network-based wake word detector that you can train on custom wake words. This guide covers two deployment approaches for your Maix Duino voice assistant:
- Server-side detection (Recommended to start) - Run Precise on Heimdall
- Edge detection (Advanced) - Convert model for K210 on Maix Duino
Architecture Options
Option A: Server-Side Wake Word Detection (Recommended)
Maix Duino Heimdall
┌─────────────────┐ ┌──────────────────────┐
│ Continuous │ Audio Stream │ Mycroft Precise │
│ Audio Capture │───────────────>│ Wake Word Detection │
│ │ │ │
│ LED Feedback │<───────────────│ Whisper STT │
│ Speaker Output │ Response │ HA Integration │
│ │ │ Piper TTS │
└─────────────────┘ └──────────────────────┘
Pros:
- Easier setup and debugging
- Better accuracy (more compute available)
- Easy to retrain and update models
- Can use ensemble models
Cons:
- Continuous audio streaming (bandwidth)
- Slightly higher latency (~100-200ms)
- Requires stable network
Option B: Edge Detection on Maix Duino (Advanced)
Maix Duino Heimdall
┌─────────────────┐ ┌──────────────────────┐
│ Precise Model │ │ │
│ (K210 KPU) │ │ │
│ Wake Detection │ Audio (on wake)│ Whisper STT │
│ │───────────────>│ HA Integration │
│ Audio Capture │ │ Piper TTS │
│ LED Feedback │<───────────────│ │
└─────────────────┘ Response └──────────────────────┘
Pros:
- Lower latency (~50ms wake detection)
- Less network traffic
- Works even if server is down
- Better privacy (no continuous streaming)
Cons:
- Complex model conversion (TensorFlow → ONNX → KMODEL)
- Limited by K210 compute
- Harder to update models
- Requires careful optimization
Recommended Approach: Start with Server-Side
Begin with server-side detection on Heimdall, then optimize to edge detection once everything works.
Phase 1: Mycroft Precise Setup on Heimdall
Install Mycroft Precise
# SSH to Heimdall
ssh alan@10.1.10.71
# Create conda environment for Precise
conda create -n precise python=3.7 -y
conda activate precise
# Install TensorFlow 1.x (Precise requires this)
pip install tensorflow==1.15.5 --break-system-packages
# Install Precise
pip install mycroft-precise --break-system-packages
# Install audio dependencies
sudo apt-get install -y portaudio19-dev sox libatlas-base-dev
# Install precise-engine (for faster inference)
wget https://github.com/MycroftAI/mycroft-precise/releases/download/v0.3.0/precise-engine_0.3.0_x86_64.tar.gz
tar xvf precise-engine_0.3.0_x86_64.tar.gz
sudo cp precise-engine/precise-engine /usr/local/bin/
sudo chmod +x /usr/local/bin/precise-engine
Verify Installation
precise-engine --version
# Should output: Precise v0.3.0
precise-listen --help
# Should show help text
Phase 2: Training Your Custom Wake Word
Step 1: Collect Wake Word Samples
You'll need ~50-100 samples of your wake word. Choose something:
- 2-3 syllables long
- Easy to pronounce
- Unlikely to occur in normal speech
Example wake words:
- "Hey Computer" (recommended - similar to commercial products)
- "Okay Jarvis"
- "Hello Assistant"
- "Activate Assistant"
# Create project directory
mkdir -p ~/precise-models/hey-computer
cd ~/precise-models/hey-computer
# Record wake word samples
precise-collect
When prompted:
- Type your wake word ("hey computer")
- Press SPACE to record
- Say the wake word clearly
- Press SPACE to stop
- Repeat 50-100 times
Tips for good samples:
- Vary your tone and speed
- Different distances from mic
- Different background noise levels
- Different pronunciations
- Have family members record too
Step 2: Collect "Not Wake Word" Samples
Record background audio and similar-sounding phrases:
# Create not-wake-word directory
mkdir -p not-wake-word
# Record random speech, music, TV, etc.
# These help the model learn what NOT to trigger on
precise-collect -f not-wake-word/random.wav
Collect ~200-500 samples of:
- Normal conversation
- TV/music in background
- Similar sounding phrases ("hey commuter", "they computed", etc.)
- Ambient noise
- Other household sounds
Step 3: Generate Training Data
# Organize samples
mkdir -p hey-computer/{wake-word,not-wake-word,test/wake-word,test/not-wake-word}
# Split samples (80% train, 20% test)
# Move 80% of wake-word samples to hey-computer/wake-word/
# Move 20% to hey-computer/test/wake-word/
# Move 80% of not-wake-word to hey-computer/not-wake-word/
# Move 20% to hey-computer/test/not-wake-word/
# Generate training data
precise-train-incremental hey-computer.net hey-computer/
Step 4: Train the Model
# Basic training (will take 30-60 minutes)
precise-train -e 60 hey-computer.net hey-computer/
# For better accuracy, train longer
precise-train -e 120 hey-computer.net hey-computer/
# Watch for overfitting - validation loss should decrease
# Stop if validation loss starts increasing
Training output will show:
Epoch 1/60
loss: 0.4523 - val_loss: 0.3891
Epoch 2/60
loss: 0.3102 - val_loss: 0.2845
...
Step 5: Test the Model
# Test with microphone
precise-listen hey-computer.net
# Speak your wake word - should see "!" when detected
# Speak other phrases - should not trigger
# Test with audio files
precise-test hey-computer.net hey-computer/test/
# Should show accuracy metrics:
# Wake word accuracy: 95%+
# False positive rate: <5%
Step 6: Optimize Sensitivity
# Adjust activation threshold
precise-listen hey-computer.net -t 0.5 # Default
precise-listen hey-computer.net -t 0.7 # More conservative
precise-listen hey-computer.net -t 0.3 # More aggressive
# Find optimal threshold for your use case
# Higher = fewer false positives, more false negatives
# Lower = more false positives, fewer false negatives
Phase 3: Integration with Voice Server
Update voice_server.py
Add Mycroft Precise support to the server:
# Add to imports
from precise_runner import PreciseEngine, PreciseRunner
import pyaudio
# Add to configuration
PRECISE_MODEL = os.getenv("PRECISE_MODEL",
"/home/alan/precise-models/hey-computer.net")
PRECISE_SENSITIVITY = float(os.getenv("PRECISE_SENSITIVITY", "0.5"))
# Global precise runner
precise_runner = None
def on_activation():
"""Called when wake word is detected"""
print("Wake word detected!")
# Trigger recording and processing
# (Implementation depends on your audio streaming setup)
def start_precise_listener():
"""Start Mycroft Precise wake word detection"""
global precise_runner
engine = PreciseEngine(
'/usr/local/bin/precise-engine',
PRECISE_MODEL
)
precise_runner = PreciseRunner(
engine,
sensitivity=PRECISE_SENSITIVITY,
on_activation=on_activation
)
precise_runner.start()
print(f"Precise listening with model: {PRECISE_MODEL}")
Server-Side Wake Word Detection Architecture
For server-side detection, you need continuous audio streaming from Maix Duino:
# New endpoint for audio streaming
@app.route('/stream', methods=['POST'])
def stream_audio():
"""
Receive continuous audio stream for wake word detection
This endpoint processes incoming audio chunks and runs them
through Mycroft Precise for wake word detection.
"""
# Implementation here
pass
Phase 4: Maix Duino Integration (Server-Side Detection)
Update maix_voice_client.py
For server-side detection, stream audio continuously:
# Add to configuration
STREAM_ENDPOINT = "/stream"
WAKE_WORD_CHECK_INTERVAL = 0.1 # Check every 100ms
def stream_audio_continuous():
"""
Stream audio to server for wake word detection
Server will notify us when wake word is detected
"""
import socket
import struct
# Create socket connection
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server_addr = (VOICE_SERVER_URL.replace('http://', '').split(':')[0], 8888)
try:
sock.connect(server_addr)
print("Connected to wake word server")
while True:
# Capture audio chunk
chunk = i2s_dev.record(CHUNK_SIZE)
if chunk:
# Send chunk size first, then chunk
sock.sendall(struct.pack('>I', len(chunk)))
sock.sendall(chunk)
# Check for wake word detection signal
# (simplified - actual implementation needs non-blocking socket)
time.sleep(0.01)
except Exception as e:
print(f"Streaming error: {e}")
finally:
sock.close()
Phase 5: Edge Detection on Maix Duino (Advanced)
Convert Precise Model to KMODEL
This is complex and requires several conversion steps:
# Step 1: Convert TensorFlow model to ONNX
pip install tf2onnx --break-system-packages
python -m tf2onnx.convert \
--saved-model hey-computer.net \
--output hey-computer.onnx
# Step 2: Optimize ONNX model
pip install onnx --break-system-packages
python -c "
import onnx
from onnx import optimizer
model = onnx.load('hey-computer.onnx')
passes = ['eliminate_deadend', 'eliminate_identity',
'eliminate_nop_dropout', 'eliminate_nop_pad']
optimized = optimizer.optimize(model, passes)
onnx.save(optimized, 'hey-computer-opt.onnx')
"
# Step 3: Convert ONNX to KMODEL (for K210)
# Use nncase (https://github.com/kendryte/nncase)
# This step is hardware-specific and complex
# Install nncase
pip install nncase --break-system-packages
# Convert (adjust parameters based on your model)
ncc compile hey-computer-opt.onnx \
-i onnx \
--dataset calibration_data \
-o hey-computer.kmodel \
--target k210
Note: KMODEL conversion is non-trivial and may require model architecture adjustments. The K210 has limitations:
- Max model size: ~6MB
- Limited operators support
- Quantization required for performance
Testing KMODEL on Maix Duino
# Load model in maix_voice_client.py
import KPU as kpu
def load_wake_word_model_kmodel():
"""Load converted KMODEL for wake word detection"""
global kpu_task
try:
kpu_task = kpu.load("/sd/models/hey-computer.kmodel")
print("Wake word model loaded on K210")
return True
except Exception as e:
print(f"Failed to load model: {e}")
return False
def detect_wake_word_kmodel():
"""Run wake word detection using K210 KPU"""
global kpu_task
# Capture audio
audio_chunk = i2s_dev.record(CHUNK_SIZE)
# Preprocess for model (depends on model input format)
# This is model-specific - adjust based on your training
# Run inference
features = preprocess_audio(audio_chunk)
output = kpu.run_yolo2(kpu_task, features) # Adjust based on model type
# Check confidence
if output[0] > WAKE_WORD_THRESHOLD:
return True
return False
Recommended Wake Words
Based on testing and community feedback:
Best performers:
- "Hey Computer" - Clear, distinct, 2-syllable, hard consonants
- "Okay Jarvis" - Pop culture reference, easy to say
- "Hey Mycroft" - Original Mycroft wake word (lots of training data available)
Avoid:
- Single syllable words (too easy to trigger)
- Common phrases ("okay", "hey there")
- Names of people in your household
- Words that sound like common speech patterns
Training Tips
For Best Accuracy
-
Diverse training data:
- Multiple speakers
- Various distances (1ft to 15ft)
- Different noise conditions
- Accent variations
-
Quality over quantity:
- 50 good samples > 200 poor samples
- Clear pronunciation
- Consistent volume
-
Hard negatives:
- Include similar-sounding phrases
- Include partial wake words
- Include common false triggers you notice
-
Regular retraining:
- Add false positives to training set
- Add missed detections
- Retrain every few weeks initially
Collecting Hard Negatives
# Run Precise in test mode and collect false positives
precise-listen hey-computer.net --save-false-positives
# This will save audio clips when model triggers incorrectly
# Add these to your not-wake-word training set
# Retrain to reduce false positives
Performance Benchmarks
Server-Side Detection (Heimdall)
- Latency: 100-200ms from utterance to detection
- Accuracy: 95%+ with good training
- False positive rate: <1 per hour with tuning
- CPU usage: ~5-10% (single core)
- Network: ~128kbps continuous stream
Edge Detection (Maix Duino)
- Latency: 50-100ms
- Accuracy: 80-90% (limited by K210 quantization)
- False positive rate: Varies by model optimization
- CPU usage: ~30% K210 (leaves room for other tasks)
- Network: 0 until wake detected
Monitoring and Debugging
Log Wake Word Detections
# Add to voice_server.py
import datetime
def log_wake_word(confidence, timestamp=None):
"""Log wake word detections for analysis"""
if timestamp is None:
timestamp = datetime.datetime.now()
log_file = "/home/alan/voice-assistant/logs/wake_words.log"
with open(log_file, 'a') as f:
f.write(f"{timestamp.isoformat()},{confidence}\n")
Analyze False Positives
# Check wake word log
tail -f ~/voice-assistant/logs/wake_words.log
# Find patterns in false positives
grep "wake_word" ~/voice-assistant/logs/wake_words.log | \
awk -F',' '{print $2}' | \
sort -n | uniq -c
Production Deployment
Systemd Service with Precise
Update the systemd service to include Precise:
[Unit]
Description=Voice Assistant with Wake Word Detection
After=network.target
[Service]
Type=simple
User=alan
WorkingDirectory=/home/alan/voice-assistant
Environment="PATH=/home/alan/miniconda3/envs/precise/bin:/usr/local/bin:/usr/bin:/bin"
EnvironmentFile=/home/alan/voice-assistant/config/.env
ExecStart=/home/alan/miniconda3/envs/precise/bin/python voice_server.py --enable-precise
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
Troubleshooting
Precise Won't Start
# Check TensorFlow version
python -c "import tensorflow as tf; print(tf.__version__)"
# Should be 1.15.x
# Check model file
file hey-computer.net
# Should be "TensorFlow SavedModel"
# Test model directly
precise-engine hey-computer.net
# Should load without errors
Low Accuracy
- Collect more training data - Especially hard negatives
- Increase training epochs - Try 200-300 epochs
- Verify training/test split - Should be 80/20
- Check audio quality - Sample rate should match (16kHz)
- Try different wake words - Some are easier to detect
High False Positive Rate
- Increase threshold - Try 0.6, 0.7, 0.8
- Add false positives to training - Retrain with false triggers
- Collect more negative samples - Expand not-wake-word set
- Use ensemble models - Run multiple models, require agreement
KMODEL Conversion Fails
This is expected - K210 conversion is complex:
- Simplify model architecture - Reduce layer count
- Use quantization-aware training - Train with quantization in mind
- Check operator support - K210 doesn't support all TF ops
- Consider alternatives:
- Use pre-trained models for K210
- Stick with server-side detection
- Use Porcupine instead (has K210 support)
Alternative: Use Pre-trained Models
Mycroft provides some pre-trained models:
# Download Hey Mycroft model
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
tar xzf hey-mycroft.tar.gz
# Test it
precise-listen hey-mycroft.net
Then train your own wake word starting from this base:
# Fine-tune from pre-trained model
precise-train -e 60 my-wake-word.net my-wake-word/ \
--from-checkpoint hey-mycroft.net
Next Steps
- Start with server-side - Get it working on Heimdall first
- Collect good training data - Quality samples are key
- Test and tune threshold - Find the sweet spot for your environment
- Monitor performance - Track false positives and misses
- Iterate on training - Add hard examples, retrain
- Consider edge deployment - Once server-side is solid
Resources
- Mycroft Precise Docs: https://github.com/MycroftAI/mycroft-precise
- Training Guide: https://mycroft-ai.gitbook.io/docs/mycroft-technologies/precise
- Community Models: https://github.com/MycroftAI/precise-data
- K210 Docs: https://canaan-creative.com/developer
- nncase: https://github.com/kendryte/nncase
Conclusion
Mycroft Precise gives you full control over your wake word detection with complete privacy. Start with server-side detection for easier development, then optimize to edge detection once you have a well-trained model.
The key to success is good training data - invest time in collecting diverse, high-quality samples!