minerva/docs/MYCROFT_PRECISE_GUIDE.md
pyr0ball 173f7f37d4 feat: import mycroft-precise work as Minerva foundation
Ports prior voice assistant research and prototypes from devl/Devops
into the Minerva repo. Includes:

- docs/: architecture, wake word guides, ESP32-S3 spec, hardware buying guide
- scripts/: voice_server.py, voice_server_enhanced.py, setup scripts
- hardware/maixduino/: edge device scripts with WiFi credentials scrubbed
  (replaced hardcoded password with secrets.py pattern)
- config/.env.example: server config template
- .gitignore: excludes .env, secrets.py, model blobs, ELF firmware
- CLAUDE.md: Minerva product context and connection to cf-voice roadmap
2026-04-06 22:21:12 -07:00

18 KiB
Executable file

Mycroft Precise Wake Word Training Guide

Overview

Mycroft Precise is a neural network-based wake word detector that you can train on custom wake words. This guide covers two deployment approaches for your Maix Duino voice assistant:

  1. Server-side detection (Recommended to start) - Run Precise on Heimdall
  2. Edge detection (Advanced) - Convert model for K210 on Maix Duino

Architecture Options

Maix Duino                          Heimdall
┌─────────────────┐                ┌──────────────────────┐
│ Continuous      │ Audio Stream   │ Mycroft Precise      │
│ Audio Capture   │───────────────>│ Wake Word Detection  │
│                 │                │                       │
│ LED Feedback    │<───────────────│ Whisper STT          │
│ Speaker Output  │   Response     │ HA Integration       │
│                 │                │ Piper TTS            │
└─────────────────┘                └──────────────────────┘

Pros:

  • Easier setup and debugging
  • Better accuracy (more compute available)
  • Easy to retrain and update models
  • Can use ensemble models

Cons:

  • Continuous audio streaming (bandwidth)
  • Slightly higher latency (~100-200ms)
  • Requires stable network

Option B: Edge Detection on Maix Duino (Advanced)

Maix Duino                          Heimdall
┌─────────────────┐                ┌──────────────────────┐
│ Precise Model   │                │                       │
│ (K210 KPU)      │                │                       │
│ Wake Detection  │ Audio (on wake)│ Whisper STT          │
│                 │───────────────>│ HA Integration       │
│ Audio Capture   │                │ Piper TTS            │
│ LED Feedback    │<───────────────│                      │
└─────────────────┘   Response     └──────────────────────┘

Pros:

  • Lower latency (~50ms wake detection)
  • Less network traffic
  • Works even if server is down
  • Better privacy (no continuous streaming)

Cons:

  • Complex model conversion (TensorFlow → ONNX → KMODEL)
  • Limited by K210 compute
  • Harder to update models
  • Requires careful optimization

Begin with server-side detection on Heimdall, then optimize to edge detection once everything works.

Phase 1: Mycroft Precise Setup on Heimdall

Install Mycroft Precise

# SSH to Heimdall
ssh alan@10.1.10.71

# Create conda environment for Precise
conda create -n precise python=3.7 -y
conda activate precise

# Install TensorFlow 1.x (Precise requires this)
pip install tensorflow==1.15.5 --break-system-packages

# Install Precise
pip install mycroft-precise --break-system-packages

# Install audio dependencies
sudo apt-get install -y portaudio19-dev sox libatlas-base-dev

# Install precise-engine (for faster inference)
wget https://github.com/MycroftAI/mycroft-precise/releases/download/v0.3.0/precise-engine_0.3.0_x86_64.tar.gz
tar xvf precise-engine_0.3.0_x86_64.tar.gz
sudo cp precise-engine/precise-engine /usr/local/bin/
sudo chmod +x /usr/local/bin/precise-engine

Verify Installation

precise-engine --version
# Should output: Precise v0.3.0

precise-listen --help
# Should show help text

Phase 2: Training Your Custom Wake Word

Step 1: Collect Wake Word Samples

You'll need ~50-100 samples of your wake word. Choose something:

  • 2-3 syllables long
  • Easy to pronounce
  • Unlikely to occur in normal speech

Example wake words:

  • "Hey Computer" (recommended - similar to commercial products)
  • "Okay Jarvis"
  • "Hello Assistant"
  • "Activate Assistant"
# Create project directory
mkdir -p ~/precise-models/hey-computer
cd ~/precise-models/hey-computer

# Record wake word samples
precise-collect

When prompted:

  1. Type your wake word ("hey computer")
  2. Press SPACE to record
  3. Say the wake word clearly
  4. Press SPACE to stop
  5. Repeat 50-100 times

Tips for good samples:

  • Vary your tone and speed
  • Different distances from mic
  • Different background noise levels
  • Different pronunciations
  • Have family members record too

Step 2: Collect "Not Wake Word" Samples

Record background audio and similar-sounding phrases:

# Create not-wake-word directory
mkdir -p not-wake-word

# Record random speech, music, TV, etc.
# These help the model learn what NOT to trigger on
precise-collect -f not-wake-word/random.wav

Collect ~200-500 samples of:

  • Normal conversation
  • TV/music in background
  • Similar sounding phrases ("hey commuter", "they computed", etc.)
  • Ambient noise
  • Other household sounds

Step 3: Generate Training Data

# Organize samples
mkdir -p hey-computer/{wake-word,not-wake-word,test/wake-word,test/not-wake-word}

# Split samples (80% train, 20% test)
# Move 80% of wake-word samples to hey-computer/wake-word/
# Move 20% to hey-computer/test/wake-word/
# Move 80% of not-wake-word to hey-computer/not-wake-word/
# Move 20% to hey-computer/test/not-wake-word/

# Generate training data
precise-train-incremental hey-computer.net hey-computer/

Step 4: Train the Model

# Basic training (will take 30-60 minutes)
precise-train -e 60 hey-computer.net hey-computer/

# For better accuracy, train longer
precise-train -e 120 hey-computer.net hey-computer/

# Watch for overfitting - validation loss should decrease
# Stop if validation loss starts increasing

Training output will show:

Epoch 1/60
loss: 0.4523 - val_loss: 0.3891
Epoch 2/60
loss: 0.3102 - val_loss: 0.2845
...

Step 5: Test the Model

# Test with microphone
precise-listen hey-computer.net

# Speak your wake word - should see "!" when detected
# Speak other phrases - should not trigger

# Test with audio files
precise-test hey-computer.net hey-computer/test/

# Should show accuracy metrics:
# Wake word accuracy: 95%+
# False positive rate: <5%

Step 6: Optimize Sensitivity

# Adjust activation threshold
precise-listen hey-computer.net -t 0.5   # Default
precise-listen hey-computer.net -t 0.7   # More conservative
precise-listen hey-computer.net -t 0.3   # More aggressive

# Find optimal threshold for your use case
# Higher = fewer false positives, more false negatives
# Lower = more false positives, fewer false negatives

Phase 3: Integration with Voice Server

Update voice_server.py

Add Mycroft Precise support to the server:

# Add to imports
from precise_runner import PreciseEngine, PreciseRunner
import pyaudio

# Add to configuration
PRECISE_MODEL = os.getenv("PRECISE_MODEL", 
                          "/home/alan/precise-models/hey-computer.net")
PRECISE_SENSITIVITY = float(os.getenv("PRECISE_SENSITIVITY", "0.5"))

# Global precise runner
precise_runner = None

def on_activation():
    """Called when wake word is detected"""
    print("Wake word detected!")
    # Trigger recording and processing
    # (Implementation depends on your audio streaming setup)

def start_precise_listener():
    """Start Mycroft Precise wake word detection"""
    global precise_runner
    
    engine = PreciseEngine(
        '/usr/local/bin/precise-engine',
        PRECISE_MODEL
    )
    
    precise_runner = PreciseRunner(
        engine,
        sensitivity=PRECISE_SENSITIVITY,
        on_activation=on_activation
    )
    
    precise_runner.start()
    print(f"Precise listening with model: {PRECISE_MODEL}")

Server-Side Wake Word Detection Architecture

For server-side detection, you need continuous audio streaming from Maix Duino:

# New endpoint for audio streaming
@app.route('/stream', methods=['POST'])
def stream_audio():
    """
    Receive continuous audio stream for wake word detection
    
    This endpoint processes incoming audio chunks and runs them
    through Mycroft Precise for wake word detection.
    """
    # Implementation here
    pass

Phase 4: Maix Duino Integration (Server-Side Detection)

Update maix_voice_client.py

For server-side detection, stream audio continuously:

# Add to configuration
STREAM_ENDPOINT = "/stream"
WAKE_WORD_CHECK_INTERVAL = 0.1  # Check every 100ms

def stream_audio_continuous():
    """
    Stream audio to server for wake word detection
    
    Server will notify us when wake word is detected
    """
    import socket
    import struct
    
    # Create socket connection
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    server_addr = (VOICE_SERVER_URL.replace('http://', '').split(':')[0], 8888)
    
    try:
        sock.connect(server_addr)
        print("Connected to wake word server")
        
        while True:
            # Capture audio chunk
            chunk = i2s_dev.record(CHUNK_SIZE)
            
            if chunk:
                # Send chunk size first, then chunk
                sock.sendall(struct.pack('>I', len(chunk)))
                sock.sendall(chunk)
            
            # Check for wake word detection signal
            # (simplified - actual implementation needs non-blocking socket)
            
            time.sleep(0.01)
    
    except Exception as e:
        print(f"Streaming error: {e}")
    finally:
        sock.close()

Phase 5: Edge Detection on Maix Duino (Advanced)

Convert Precise Model to KMODEL

This is complex and requires several conversion steps:

# Step 1: Convert TensorFlow model to ONNX
pip install tf2onnx --break-system-packages

python -m tf2onnx.convert \
    --saved-model hey-computer.net \
    --output hey-computer.onnx

# Step 2: Optimize ONNX model
pip install onnx --break-system-packages

python -c "
import onnx
from onnx import optimizer

model = onnx.load('hey-computer.onnx')
passes = ['eliminate_deadend', 'eliminate_identity', 
          'eliminate_nop_dropout', 'eliminate_nop_pad']
optimized = optimizer.optimize(model, passes)
onnx.save(optimized, 'hey-computer-opt.onnx')
"

# Step 3: Convert ONNX to KMODEL (for K210)
# Use nncase (https://github.com/kendryte/nncase)
# This step is hardware-specific and complex

# Install nncase
pip install nncase --break-system-packages

# Convert (adjust parameters based on your model)
ncc compile hey-computer-opt.onnx \
    -i onnx \
    --dataset calibration_data \
    -o hey-computer.kmodel \
    --target k210

Note: KMODEL conversion is non-trivial and may require model architecture adjustments. The K210 has limitations:

  • Max model size: ~6MB
  • Limited operators support
  • Quantization required for performance

Testing KMODEL on Maix Duino

# Load model in maix_voice_client.py
import KPU as kpu

def load_wake_word_model_kmodel():
    """Load converted KMODEL for wake word detection"""
    global kpu_task
    
    try:
        kpu_task = kpu.load("/sd/models/hey-computer.kmodel")
        print("Wake word model loaded on K210")
        return True
    except Exception as e:
        print(f"Failed to load model: {e}")
        return False

def detect_wake_word_kmodel():
    """Run wake word detection using K210 KPU"""
    global kpu_task
    
    # Capture audio
    audio_chunk = i2s_dev.record(CHUNK_SIZE)
    
    # Preprocess for model (depends on model input format)
    # This is model-specific - adjust based on your training
    
    # Run inference
    features = preprocess_audio(audio_chunk)
    output = kpu.run_yolo2(kpu_task, features)  # Adjust based on model type
    
    # Check confidence
    if output[0] > WAKE_WORD_THRESHOLD:
        return True
    
    return False

Based on testing and community feedback:

Best performers:

  1. "Hey Computer" - Clear, distinct, 2-syllable, hard consonants
  2. "Okay Jarvis" - Pop culture reference, easy to say
  3. "Hey Mycroft" - Original Mycroft wake word (lots of training data available)

Avoid:

  • Single syllable words (too easy to trigger)
  • Common phrases ("okay", "hey there")
  • Names of people in your household
  • Words that sound like common speech patterns

Training Tips

For Best Accuracy

  1. Diverse training data:

    • Multiple speakers
    • Various distances (1ft to 15ft)
    • Different noise conditions
    • Accent variations
  2. Quality over quantity:

    • 50 good samples > 200 poor samples
    • Clear pronunciation
    • Consistent volume
  3. Hard negatives:

    • Include similar-sounding phrases
    • Include partial wake words
    • Include common false triggers you notice
  4. Regular retraining:

    • Add false positives to training set
    • Add missed detections
    • Retrain every few weeks initially

Collecting Hard Negatives

# Run Precise in test mode and collect false positives
precise-listen hey-computer.net --save-false-positives

# This will save audio clips when model triggers incorrectly
# Add these to your not-wake-word training set
# Retrain to reduce false positives

Performance Benchmarks

Server-Side Detection (Heimdall)

  • Latency: 100-200ms from utterance to detection
  • Accuracy: 95%+ with good training
  • False positive rate: <1 per hour with tuning
  • CPU usage: ~5-10% (single core)
  • Network: ~128kbps continuous stream

Edge Detection (Maix Duino)

  • Latency: 50-100ms
  • Accuracy: 80-90% (limited by K210 quantization)
  • False positive rate: Varies by model optimization
  • CPU usage: ~30% K210 (leaves room for other tasks)
  • Network: 0 until wake detected

Monitoring and Debugging

Log Wake Word Detections

# Add to voice_server.py
import datetime

def log_wake_word(confidence, timestamp=None):
    """Log wake word detections for analysis"""
    if timestamp is None:
        timestamp = datetime.datetime.now()
    
    log_file = "/home/alan/voice-assistant/logs/wake_words.log"
    
    with open(log_file, 'a') as f:
        f.write(f"{timestamp.isoformat()},{confidence}\n")

Analyze False Positives

# Check wake word log
tail -f ~/voice-assistant/logs/wake_words.log

# Find patterns in false positives
grep "wake_word" ~/voice-assistant/logs/wake_words.log | \
    awk -F',' '{print $2}' | \
    sort -n | uniq -c

Production Deployment

Systemd Service with Precise

Update the systemd service to include Precise:

[Unit]
Description=Voice Assistant with Wake Word Detection
After=network.target

[Service]
Type=simple
User=alan
WorkingDirectory=/home/alan/voice-assistant
Environment="PATH=/home/alan/miniconda3/envs/precise/bin:/usr/local/bin:/usr/bin:/bin"
EnvironmentFile=/home/alan/voice-assistant/config/.env
ExecStart=/home/alan/miniconda3/envs/precise/bin/python voice_server.py --enable-precise
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target

Troubleshooting

Precise Won't Start

# Check TensorFlow version
python -c "import tensorflow as tf; print(tf.__version__)"
# Should be 1.15.x

# Check model file
file hey-computer.net
# Should be "TensorFlow SavedModel"

# Test model directly
precise-engine hey-computer.net
# Should load without errors

Low Accuracy

  1. Collect more training data - Especially hard negatives
  2. Increase training epochs - Try 200-300 epochs
  3. Verify training/test split - Should be 80/20
  4. Check audio quality - Sample rate should match (16kHz)
  5. Try different wake words - Some are easier to detect

High False Positive Rate

  1. Increase threshold - Try 0.6, 0.7, 0.8
  2. Add false positives to training - Retrain with false triggers
  3. Collect more negative samples - Expand not-wake-word set
  4. Use ensemble models - Run multiple models, require agreement

KMODEL Conversion Fails

This is expected - K210 conversion is complex:

  1. Simplify model architecture - Reduce layer count
  2. Use quantization-aware training - Train with quantization in mind
  3. Check operator support - K210 doesn't support all TF ops
  4. Consider alternatives:
    • Use pre-trained models for K210
    • Stick with server-side detection
    • Use Porcupine instead (has K210 support)

Alternative: Use Pre-trained Models

Mycroft provides some pre-trained models:

# Download Hey Mycroft model
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
tar xzf hey-mycroft.tar.gz

# Test it
precise-listen hey-mycroft.net

Then train your own wake word starting from this base:

# Fine-tune from pre-trained model
precise-train -e 60 my-wake-word.net my-wake-word/ \
    --from-checkpoint hey-mycroft.net

Next Steps

  1. Start with server-side - Get it working on Heimdall first
  2. Collect good training data - Quality samples are key
  3. Test and tune threshold - Find the sweet spot for your environment
  4. Monitor performance - Track false positives and misses
  5. Iterate on training - Add hard examples, retrain
  6. Consider edge deployment - Once server-side is solid

Resources

Conclusion

Mycroft Precise gives you full control over your wake word detection with complete privacy. Start with server-side detection for easier development, then optimize to edge detection once you have a well-trained model.

The key to success is good training data - invest time in collecting diverse, high-quality samples!