pyr0ball 173f7f37d4 feat: import mycroft-precise work as Minerva foundation

Ports prior voice assistant research and prototypes from devl/Devops
into the Minerva repo. Includes:

- docs/: architecture, wake word guides, ESP32-S3 spec, hardware buying guide
- scripts/: voice_server.py, voice_server_enhanced.py, setup scripts
- hardware/maixduino/: edge device scripts with WiFi credentials scrubbed
  (replaced hardcoded password with secrets.py pattern)
- config/.env.example: server config template
- .gitignore: excludes .env, secrets.py, model blobs, ELF firmware
- CLAUDE.md: Minerva product context and connection to cf-voice roadmap

2026-04-06 22:21:12 -07:00

18 KiB

Executable file

Raw Permalink Blame History

Mycroft Precise Wake Word Training Guide

Overview

Mycroft Precise is a neural network-based wake word detector that you can train on custom wake words. This guide covers two deployment approaches for your Maix Duino voice assistant:

Server-side detection (Recommended to start) - Run Precise on Heimdall
Edge detection (Advanced) - Convert model for K210 on Maix Duino

Architecture Options

Option A: Server-Side Wake Word Detection (Recommended)

Maix Duino                          Heimdall
┌─────────────────┐                ┌──────────────────────┐
│ Continuous      │ Audio Stream   │ Mycroft Precise      │
│ Audio Capture   │───────────────>│ Wake Word Detection  │
│                 │                │                       │
│ LED Feedback    │<───────────────│ Whisper STT          │
│ Speaker Output  │   Response     │ HA Integration       │
│                 │                │ Piper TTS            │
└─────────────────┘                └──────────────────────┘

Pros:

Easier setup and debugging
Better accuracy (more compute available)
Easy to retrain and update models
Can use ensemble models

Cons:

Continuous audio streaming (bandwidth)
Slightly higher latency (~100-200ms)
Requires stable network

Option B: Edge Detection on Maix Duino (Advanced)

Maix Duino                          Heimdall
┌─────────────────┐                ┌──────────────────────┐
│ Precise Model   │                │                       │
│ (K210 KPU)      │                │                       │
│ Wake Detection  │ Audio (on wake)│ Whisper STT          │
│                 │───────────────>│ HA Integration       │
│ Audio Capture   │                │ Piper TTS            │
│ LED Feedback    │<───────────────│                      │
└─────────────────┘   Response     └──────────────────────┘

Pros:

Lower latency (~50ms wake detection)
Less network traffic
Works even if server is down
Better privacy (no continuous streaming)

Cons:

Complex model conversion (TensorFlow → ONNX → KMODEL)
Limited by K210 compute
Harder to update models
Requires careful optimization

Recommended Approach: Start with Server-Side

Begin with server-side detection on Heimdall, then optimize to edge detection once everything works.

Phase 1: Mycroft Precise Setup on Heimdall

Install Mycroft Precise

# SSH to Heimdall
ssh alan@10.1.10.71

# Create conda environment for Precise
conda create -n precise python=3.7 -y
conda activate precise

# Install TensorFlow 1.x (Precise requires this)
pip install tensorflow==1.15.5 --break-system-packages

# Install Precise
pip install mycroft-precise --break-system-packages

# Install audio dependencies
sudo apt-get install -y portaudio19-dev sox libatlas-base-dev

# Install precise-engine (for faster inference)
wget https://github.com/MycroftAI/mycroft-precise/releases/download/v0.3.0/precise-engine_0.3.0_x86_64.tar.gz
tar xvf precise-engine_0.3.0_x86_64.tar.gz
sudo cp precise-engine/precise-engine /usr/local/bin/
sudo chmod +x /usr/local/bin/precise-engine

Verify Installation

precise-engine --version
# Should output: Precise v0.3.0

precise-listen --help
# Should show help text

Phase 2: Training Your Custom Wake Word

Step 1: Collect Wake Word Samples

You'll need ~50-100 samples of your wake word. Choose something:

2-3 syllables long
Easy to pronounce
Unlikely to occur in normal speech

Example wake words:

"Hey Computer" (recommended - similar to commercial products)
"Okay Jarvis"
"Hello Assistant"
"Activate Assistant"

# Create project directory
mkdir -p ~/precise-models/hey-computer
cd ~/precise-models/hey-computer

# Record wake word samples
precise-collect

When prompted:

Type your wake word ("hey computer")
Press SPACE to record
Say the wake word clearly
Press SPACE to stop
Repeat 50-100 times

Tips for good samples:

Vary your tone and speed
Different distances from mic
Different background noise levels
Different pronunciations
Have family members record too

Step 2: Collect "Not Wake Word" Samples

Record background audio and similar-sounding phrases:

# Create not-wake-word directory
mkdir -p not-wake-word

# Record random speech, music, TV, etc.
# These help the model learn what NOT to trigger on
precise-collect -f not-wake-word/random.wav

Collect ~200-500 samples of:

Normal conversation
TV/music in background
Similar sounding phrases ("hey commuter", "they computed", etc.)
Ambient noise
Other household sounds

Step 3: Generate Training Data

# Organize samples
mkdir -p hey-computer/{wake-word,not-wake-word,test/wake-word,test/not-wake-word}

# Split samples (80% train, 20% test)
# Move 80% of wake-word samples to hey-computer/wake-word/
# Move 20% to hey-computer/test/wake-word/
# Move 80% of not-wake-word to hey-computer/not-wake-word/
# Move 20% to hey-computer/test/not-wake-word/

# Generate training data
precise-train-incremental hey-computer.net hey-computer/

Step 4: Train the Model

# Basic training (will take 30-60 minutes)
precise-train -e 60 hey-computer.net hey-computer/

# For better accuracy, train longer
precise-train -e 120 hey-computer.net hey-computer/

# Watch for overfitting - validation loss should decrease
# Stop if validation loss starts increasing

Training output will show:

Epoch 1/60
loss: 0.4523 - val_loss: 0.3891
Epoch 2/60
loss: 0.3102 - val_loss: 0.2845
...

Step 5: Test the Model

# Test with microphone
precise-listen hey-computer.net

# Speak your wake word - should see "!" when detected
# Speak other phrases - should not trigger

# Test with audio files
precise-test hey-computer.net hey-computer/test/

# Should show accuracy metrics:
# Wake word accuracy: 95%+
# False positive rate: <5%

Step 6: Optimize Sensitivity

# Adjust activation threshold
precise-listen hey-computer.net -t 0.5   # Default
precise-listen hey-computer.net -t 0.7   # More conservative
precise-listen hey-computer.net -t 0.3   # More aggressive

# Find optimal threshold for your use case
# Higher = fewer false positives, more false negatives
# Lower = more false positives, fewer false negatives

Phase 3: Integration with Voice Server

Update voice_server.py

Add Mycroft Precise support to the server:

# Add to imports
from precise_runner import PreciseEngine, PreciseRunner
import pyaudio

# Add to configuration
PRECISE_MODEL = os.getenv("PRECISE_MODEL", 
                          "/home/alan/precise-models/hey-computer.net")
PRECISE_SENSITIVITY = float(os.getenv("PRECISE_SENSITIVITY", "0.5"))

# Global precise runner
precise_runner = None

def on_activation():
    """Called when wake word is detected"""
    print("Wake word detected!")
    # Trigger recording and processing
    # (Implementation depends on your audio streaming setup)

def start_precise_listener():
    """Start Mycroft Precise wake word detection"""
    global precise_runner
    
    engine = PreciseEngine(
        '/usr/local/bin/precise-engine',
        PRECISE_MODEL
    )
    
    precise_runner = PreciseRunner(
        engine,
        sensitivity=PRECISE_SENSITIVITY,
        on_activation=on_activation
    )
    
    precise_runner.start()
    print(f"Precise listening with model: {PRECISE_MODEL}")

Server-Side Wake Word Detection Architecture

For server-side detection, you need continuous audio streaming from Maix Duino:

# New endpoint for audio streaming
@app.route('/stream', methods=['POST'])
def stream_audio():
    """
    Receive continuous audio stream for wake word detection
    
    This endpoint processes incoming audio chunks and runs them
    through Mycroft Precise for wake word detection.
    """
    # Implementation here
    pass

Phase 4: Maix Duino Integration (Server-Side Detection)

Update maix_voice_client.py

For server-side detection, stream audio continuously:

# Add to configuration
STREAM_ENDPOINT = "/stream"
WAKE_WORD_CHECK_INTERVAL = 0.1  # Check every 100ms

def stream_audio_continuous():
    """
    Stream audio to server for wake word detection
    
    Server will notify us when wake word is detected
    """
    import socket
    import struct
    
    # Create socket connection
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    server_addr = (VOICE_SERVER_URL.replace('http://', '').split(':')[0], 8888)
    
    try:
        sock.connect(server_addr)
        print("Connected to wake word server")
        
        while True:
            # Capture audio chunk
            chunk = i2s_dev.record(CHUNK_SIZE)
            
            if chunk:
                # Send chunk size first, then chunk
                sock.sendall(struct.pack('>I', len(chunk)))
                sock.sendall(chunk)
            
            # Check for wake word detection signal
            # (simplified - actual implementation needs non-blocking socket)
            
            time.sleep(0.01)
    
    except Exception as e:
        print(f"Streaming error: {e}")
    finally:
        sock.close()

Phase 5: Edge Detection on Maix Duino (Advanced)

Convert Precise Model to KMODEL

This is complex and requires several conversion steps:

# Step 1: Convert TensorFlow model to ONNX
pip install tf2onnx --break-system-packages

python -m tf2onnx.convert \
    --saved-model hey-computer.net \
    --output hey-computer.onnx

# Step 2: Optimize ONNX model
pip install onnx --break-system-packages

python -c "
import onnx
from onnx import optimizer

model = onnx.load('hey-computer.onnx')
passes = ['eliminate_deadend', 'eliminate_identity', 
          'eliminate_nop_dropout', 'eliminate_nop_pad']
optimized = optimizer.optimize(model, passes)
onnx.save(optimized, 'hey-computer-opt.onnx')
"

# Step 3: Convert ONNX to KMODEL (for K210)
# Use nncase (https://github.com/kendryte/nncase)
# This step is hardware-specific and complex

# Install nncase
pip install nncase --break-system-packages

# Convert (adjust parameters based on your model)
ncc compile hey-computer-opt.onnx \
    -i onnx \
    --dataset calibration_data \
    -o hey-computer.kmodel \
    --target k210

Note: KMODEL conversion is non-trivial and may require model architecture adjustments. The K210 has limitations:

Max model size: ~6MB
Limited operators support
Quantization required for performance

Testing KMODEL on Maix Duino

# Load model in maix_voice_client.py
import KPU as kpu

def load_wake_word_model_kmodel():
    """Load converted KMODEL for wake word detection"""
    global kpu_task
    
    try:
        kpu_task = kpu.load("/sd/models/hey-computer.kmodel")
        print("Wake word model loaded on K210")
        return True
    except Exception as e:
        print(f"Failed to load model: {e}")
        return False

def detect_wake_word_kmodel():
    """Run wake word detection using K210 KPU"""
    global kpu_task
    
    # Capture audio
    audio_chunk = i2s_dev.record(CHUNK_SIZE)
    
    # Preprocess for model (depends on model input format)
    # This is model-specific - adjust based on your training
    
    # Run inference
    features = preprocess_audio(audio_chunk)
    output = kpu.run_yolo2(kpu_task, features)  # Adjust based on model type
    
    # Check confidence
    if output[0] > WAKE_WORD_THRESHOLD:
        return True
    
    return False

Recommended Wake Words

Based on testing and community feedback:

Best performers:

"Hey Computer" - Clear, distinct, 2-syllable, hard consonants
"Okay Jarvis" - Pop culture reference, easy to say
"Hey Mycroft" - Original Mycroft wake word (lots of training data available)

Avoid:

Single syllable words (too easy to trigger)
Common phrases ("okay", "hey there")
Names of people in your household
Words that sound like common speech patterns

Training Tips

For Best Accuracy

Diverse training data:
- Multiple speakers
- Various distances (1ft to 15ft)
- Different noise conditions
- Accent variations
Quality over quantity:
- 50 good samples > 200 poor samples
- Clear pronunciation
- Consistent volume
Hard negatives:
- Include similar-sounding phrases
- Include partial wake words
- Include common false triggers you notice
Regular retraining:
- Add false positives to training set
- Add missed detections
- Retrain every few weeks initially

Collecting Hard Negatives

# Run Precise in test mode and collect false positives
precise-listen hey-computer.net --save-false-positives

# This will save audio clips when model triggers incorrectly
# Add these to your not-wake-word training set
# Retrain to reduce false positives

Performance Benchmarks

Server-Side Detection (Heimdall)

Latency: 100-200ms from utterance to detection
Accuracy: 95%+ with good training
False positive rate: <1 per hour with tuning
CPU usage: ~5-10% (single core)
Network: ~128kbps continuous stream

Edge Detection (Maix Duino)

Latency: 50-100ms
Accuracy: 80-90% (limited by K210 quantization)
False positive rate: Varies by model optimization
CPU usage: ~30% K210 (leaves room for other tasks)
Network: 0 until wake detected

Monitoring and Debugging

Log Wake Word Detections

# Add to voice_server.py
import datetime

def log_wake_word(confidence, timestamp=None):
    """Log wake word detections for analysis"""
    if timestamp is None:
        timestamp = datetime.datetime.now()
    
    log_file = "/home/alan/voice-assistant/logs/wake_words.log"
    
    with open(log_file, 'a') as f:
        f.write(f"{timestamp.isoformat()},{confidence}\n")

Analyze False Positives

# Check wake word log
tail -f ~/voice-assistant/logs/wake_words.log

# Find patterns in false positives
grep "wake_word" ~/voice-assistant/logs/wake_words.log | \
    awk -F',' '{print $2}' | \
    sort -n | uniq -c

Production Deployment

Systemd Service with Precise

Update the systemd service to include Precise:

[Unit]
Description=Voice Assistant with Wake Word Detection
After=network.target

[Service]
Type=simple
User=alan
WorkingDirectory=/home/alan/voice-assistant
Environment="PATH=/home/alan/miniconda3/envs/precise/bin:/usr/local/bin:/usr/bin:/bin"
EnvironmentFile=/home/alan/voice-assistant/config/.env
ExecStart=/home/alan/miniconda3/envs/precise/bin/python voice_server.py --enable-precise
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target

Troubleshooting

Precise Won't Start

# Check TensorFlow version
python -c "import tensorflow as tf; print(tf.__version__)"
# Should be 1.15.x

# Check model file
file hey-computer.net
# Should be "TensorFlow SavedModel"

# Test model directly
precise-engine hey-computer.net
# Should load without errors

Low Accuracy

Collect more training data - Especially hard negatives
Increase training epochs - Try 200-300 epochs
Verify training/test split - Should be 80/20
Check audio quality - Sample rate should match (16kHz)
Try different wake words - Some are easier to detect

High False Positive Rate

Increase threshold - Try 0.6, 0.7, 0.8
Add false positives to training - Retrain with false triggers
Collect more negative samples - Expand not-wake-word set
Use ensemble models - Run multiple models, require agreement

KMODEL Conversion Fails

This is expected - K210 conversion is complex:

Simplify model architecture - Reduce layer count
Use quantization-aware training - Train with quantization in mind
Check operator support - K210 doesn't support all TF ops
Consider alternatives:
- Use pre-trained models for K210
- Stick with server-side detection
- Use Porcupine instead (has K210 support)

Alternative: Use Pre-trained Models

Mycroft provides some pre-trained models:

# Download Hey Mycroft model
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
tar xzf hey-mycroft.tar.gz

# Test it
precise-listen hey-mycroft.net

Then train your own wake word starting from this base:

# Fine-tune from pre-trained model
precise-train -e 60 my-wake-word.net my-wake-word/ \
    --from-checkpoint hey-mycroft.net

Next Steps

Start with server-side - Get it working on Heimdall first
Collect good training data - Quality samples are key
Test and tune threshold - Find the sweet spot for your environment
Monitor performance - Track false positives and misses
Iterate on training - Add hard examples, retrain
Consider edge deployment - Once server-side is solid

Resources

Mycroft Precise Docs: https://github.com/MycroftAI/mycroft-precise
Training Guide: https://mycroft-ai.gitbook.io/docs/mycroft-technologies/precise
Community Models: https://github.com/MycroftAI/precise-data
K210 Docs: https://canaan-creative.com/developer
nncase: https://github.com/kendryte/nncase

Conclusion

Mycroft Precise gives you full control over your wake word detection with complete privacy. Start with server-side detection for easier development, then optimize to edge detection once you have a well-trained model.

The key to success is good training data - invest time in collecting diverse, high-quality samples!

18 KiB Executable file Raw Permalink Blame History