minerva/docs/MYCROFT_PRECISE_GUIDE.md

# Mycroft Precise Wake Word Training Guide

## Overview

Mycroft Precise is a neural network-based wake word detector that you can train on custom wake words. This guide covers two deployment approaches for your Maix Duino voice assistant:

1. **Server-side detection** (Recommended to start) - Run Precise on Heimdall
2. **Edge detection** (Advanced) - Convert model for K210 on Maix Duino

## Architecture Options

### Option A: Server-Side Wake Word Detection (Recommended)

```
Maix Duino                          Heimdall
┌─────────────────┐                ┌──────────────────────┐
│ Continuous      │ Audio Stream   │ Mycroft Precise      │
│ Audio Capture   │───────────────>│ Wake Word Detection  │
│                 │                │                       │
│ LED Feedback    │<───────────────│ Whisper STT          │
│ Speaker Output  │   Response     │ HA Integration       │
│                 │                │ Piper TTS            │
└─────────────────┘                └──────────────────────┘
```

**Pros:**
- Easier setup and debugging
- Better accuracy (more compute available)
- Easy to retrain and update models
- Can use ensemble models

**Cons:**
- Continuous audio streaming (bandwidth)
- Slightly higher latency (~100-200ms)
- Requires stable network

### Option B: Edge Detection on Maix Duino (Advanced)

```
Maix Duino                          Heimdall
┌─────────────────┐                ┌──────────────────────┐
│ Precise Model   │                │                       │
│ (K210 KPU)      │                │                       │
│ Wake Detection  │ Audio (on wake)│ Whisper STT          │
│                 │───────────────>│ HA Integration       │
│ Audio Capture   │                │ Piper TTS            │
│ LED Feedback    │<───────────────│                      │
└─────────────────┘   Response     └──────────────────────┘
```

**Pros:**
- Lower latency (~50ms wake detection)
- Less network traffic
- Works even if server is down
- Better privacy (no continuous streaming)

**Cons:**
- Complex model conversion (TensorFlow → ONNX → KMODEL)
- Limited by K210 compute
- Harder to update models
- Requires careful optimization

## Recommended Approach: Start with Server-Side

Begin with server-side detection on Heimdall, then optimize to edge detection once everything works.

## Phase 1: Mycroft Precise Setup on Heimdall

### Install Mycroft Precise

```bash
# SSH to Heimdall
ssh alan@10.1.10.71

# Create conda environment for Precise
conda create -n precise python=3.7 -y
conda activate precise

# Install TensorFlow 1.x (Precise requires this)
pip install tensorflow==1.15.5 --break-system-packages

# Install Precise
pip install mycroft-precise --break-system-packages

# Install audio dependencies
sudo apt-get install -y portaudio19-dev sox libatlas-base-dev

# Install precise-engine (for faster inference)
wget https://github.com/MycroftAI/mycroft-precise/releases/download/v0.3.0/precise-engine_0.3.0_x86_64.tar.gz
tar xvf precise-engine_0.3.0_x86_64.tar.gz
sudo cp precise-engine/precise-engine /usr/local/bin/
sudo chmod +x /usr/local/bin/precise-engine
```

### Verify Installation

```bash
precise-engine --version
# Should output: Precise v0.3.0

precise-listen --help
# Should show help text
```

## Phase 2: Training Your Custom Wake Word

### Step 1: Collect Wake Word Samples

You'll need ~50-100 samples of your wake word. Choose something:
- 2-3 syllables long
- Easy to pronounce
- Unlikely to occur in normal speech

Example wake words:
- "Hey Computer" (recommended - similar to commercial products)
- "Okay Jarvis"
- "Hello Assistant"
- "Activate Assistant"

```bash
# Create project directory
mkdir -p ~/precise-models/hey-computer
cd ~/precise-models/hey-computer

# Record wake word samples
precise-collect
```

When prompted:
1. Type your wake word ("hey computer")
2. Press SPACE to record
3. Say the wake word clearly
4. Press SPACE to stop
5. Repeat 50-100 times

**Tips for good samples:**
- Vary your tone and speed
- Different distances from mic
- Different background noise levels
- Different pronunciations
- Have family members record too

### Step 2: Collect "Not Wake Word" Samples

Record background audio and similar-sounding phrases:

```bash
# Create not-wake-word directory
mkdir -p not-wake-word

# Record random speech, music, TV, etc.
# These help the model learn what NOT to trigger on
precise-collect -f not-wake-word/random.wav
```

Collect ~200-500 samples of:
- Normal conversation
- TV/music in background
- Similar sounding phrases ("hey commuter", "they computed", etc.)
- Ambient noise
- Other household sounds

### Step 3: Generate Training Data

```bash
# Organize samples
mkdir -p hey-computer/{wake-word,not-wake-word,test/wake-word,test/not-wake-word}

# Split samples (80% train, 20% test)
# Move 80% of wake-word samples to hey-computer/wake-word/
# Move 20% to hey-computer/test/wake-word/
# Move 80% of not-wake-word to hey-computer/not-wake-word/
# Move 20% to hey-computer/test/not-wake-word/

# Generate training data
precise-train-incremental hey-computer.net hey-computer/
```

### Step 4: Train the Model

```bash
# Basic training (will take 30-60 minutes)
precise-train -e 60 hey-computer.net hey-computer/

# For better accuracy, train longer
precise-train -e 120 hey-computer.net hey-computer/

# Watch for overfitting - validation loss should decrease
# Stop if validation loss starts increasing
```

Training output will show:
```
Epoch 1/60
loss: 0.4523 - val_loss: 0.3891
Epoch 2/60
loss: 0.3102 - val_loss: 0.2845
...
```

### Step 5: Test the Model

```bash
# Test with microphone
precise-listen hey-computer.net

# Speak your wake word - should see "!" when detected
# Speak other phrases - should not trigger

# Test with audio files
precise-test hey-computer.net hey-computer/test/

# Should show accuracy metrics:
# Wake word accuracy: 95%+
# False positive rate: <5%
```

### Step 6: Optimize Sensitivity

```bash
# Adjust activation threshold
precise-listen hey-computer.net -t 0.5   # Default
precise-listen hey-computer.net -t 0.7   # More conservative
precise-listen hey-computer.net -t 0.3   # More aggressive

# Find optimal threshold for your use case
# Higher = fewer false positives, more false negatives
# Lower = more false positives, fewer false negatives
```

## Phase 3: Integration with Voice Server

### Update voice_server.py

Add Mycroft Precise support to the server:

```python
# Add to imports
from precise_runner import PreciseEngine, PreciseRunner
import pyaudio

# Add to configuration
PRECISE_MODEL = os.getenv("PRECISE_MODEL",
                          "/home/alan/precise-models/hey-computer.net")
PRECISE_SENSITIVITY = float(os.getenv("PRECISE_SENSITIVITY", "0.5"))

# Global precise runner
precise_runner = None

def on_activation():
    """Called when wake word is detected"""
    print("Wake word detected!")
    # Trigger recording and processing
    # (Implementation depends on your audio streaming setup)

def start_precise_listener():
    """Start Mycroft Precise wake word detection"""
    global precise_runner

    engine = PreciseEngine(
        '/usr/local/bin/precise-engine',
        PRECISE_MODEL
    )

    precise_runner = PreciseRunner(
        engine,
        sensitivity=PRECISE_SENSITIVITY,
        on_activation=on_activation
    )

    precise_runner.start()
    print(f"Precise listening with model: {PRECISE_MODEL}")
```

### Server-Side Wake Word Detection Architecture

For server-side detection, you need continuous audio streaming from Maix Duino:

```python
# New endpoint for audio streaming
@app.route('/stream', methods=['POST'])
def stream_audio():
    """
    Receive continuous audio stream for wake word detection

    This endpoint processes incoming audio chunks and runs them
    through Mycroft Precise for wake word detection.
    """
    # Implementation here
    pass
```

## Phase 4: Maix Duino Integration (Server-Side Detection)

### Update maix_voice_client.py

For server-side detection, stream audio continuously:

```python
# Add to configuration
STREAM_ENDPOINT = "/stream"
WAKE_WORD_CHECK_INTERVAL = 0.1  # Check every 100ms

def stream_audio_continuous():
    """
    Stream audio to server for wake word detection

    Server will notify us when wake word is detected
    """
    import socket
    import struct

    # Create socket connection
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    server_addr = (VOICE_SERVER_URL.replace('http://', '').split(':')[0], 8888)

    try:
        sock.connect(server_addr)
        print("Connected to wake word server")

        while True:
            # Capture audio chunk
            chunk = i2s_dev.record(CHUNK_SIZE)

            if chunk:
                # Send chunk size first, then chunk
                sock.sendall(struct.pack('>I', len(chunk)))
                sock.sendall(chunk)

            # Check for wake word detection signal
            # (simplified - actual implementation needs non-blocking socket)

            time.sleep(0.01)

    except Exception as e:
        print(f"Streaming error: {e}")
    finally:
        sock.close()
```

## Phase 5: Edge Detection on Maix Duino (Advanced)

### Convert Precise Model to KMODEL

This is complex and requires several conversion steps:

```bash
# Step 1: Convert TensorFlow model to ONNX
pip install tf2onnx --break-system-packages

python -m tf2onnx.convert \
    --saved-model hey-computer.net \
    --output hey-computer.onnx

# Step 2: Optimize ONNX model
pip install onnx --break-system-packages

python -c "
import onnx
from onnx import optimizer

model = onnx.load('hey-computer.onnx')
passes = ['eliminate_deadend', 'eliminate_identity',
          'eliminate_nop_dropout', 'eliminate_nop_pad']
optimized = optimizer.optimize(model, passes)
onnx.save(optimized, 'hey-computer-opt.onnx')
"

# Step 3: Convert ONNX to KMODEL (for K210)
# Use nncase (https://github.com/kendryte/nncase)
# This step is hardware-specific and complex

# Install nncase
pip install nncase --break-system-packages

# Convert (adjust parameters based on your model)
ncc compile hey-computer-opt.onnx \
    -i onnx \
    --dataset calibration_data \
    -o hey-computer.kmodel \
    --target k210
```

**Note:** KMODEL conversion is non-trivial and may require model architecture adjustments. The K210 has limitations:
- Max model size: ~6MB
- Limited operators support
- Quantization required for performance

### Testing KMODEL on Maix Duino

```python
# Load model in maix_voice_client.py
import KPU as kpu

def load_wake_word_model_kmodel():
    """Load converted KMODEL for wake word detection"""
    global kpu_task

    try:
        kpu_task = kpu.load("/sd/models/hey-computer.kmodel")
        print("Wake word model loaded on K210")
        return True
    except Exception as e:
        print(f"Failed to load model: {e}")
        return False

def detect_wake_word_kmodel():
    """Run wake word detection using K210 KPU"""
    global kpu_task

    # Capture audio
    audio_chunk = i2s_dev.record(CHUNK_SIZE)

    # Preprocess for model (depends on model input format)
    # This is model-specific - adjust based on your training

    # Run inference
    features = preprocess_audio(audio_chunk)
    output = kpu.run_yolo2(kpu_task, features)  # Adjust based on model type

    # Check confidence
    if output[0] > WAKE_WORD_THRESHOLD:
        return True

    return False
```

## Recommended Wake Words

Based on testing and community feedback:

**Best performers:**
1. "Hey Computer" - Clear, distinct, 2-syllable, hard consonants
2. "Okay Jarvis" - Pop culture reference, easy to say
3. "Hey Mycroft" - Original Mycroft wake word (lots of training data available)

**Avoid:**
- Single syllable words (too easy to trigger)
- Common phrases ("okay", "hey there")
- Names of people in your household
- Words that sound like common speech patterns

## Training Tips

### For Best Accuracy

1. **Diverse training data:**
   - Multiple speakers
   - Various distances (1ft to 15ft)
   - Different noise conditions
   - Accent variations

2. **Quality over quantity:**
   - 50 good samples > 200 poor samples
   - Clear pronunciation
   - Consistent volume

3. **Hard negatives:**
   - Include similar-sounding phrases
   - Include partial wake words
   - Include common false triggers you notice

4. **Regular retraining:**
   - Add false positives to training set
   - Add missed detections
   - Retrain every few weeks initially

### Collecting Hard Negatives

```bash
# Run Precise in test mode and collect false positives
precise-listen hey-computer.net --save-false-positives

# This will save audio clips when model triggers incorrectly
# Add these to your not-wake-word training set
# Retrain to reduce false positives
```

## Performance Benchmarks

### Server-Side Detection (Heimdall)
- **Latency:** 100-200ms from utterance to detection
- **Accuracy:** 95%+ with good training
- **False positive rate:** <1 per hour with tuning
- **CPU usage:** ~5-10% (single core)
- **Network:** ~128kbps continuous stream

### Edge Detection (Maix Duino)
- **Latency:** 50-100ms
- **Accuracy:** 80-90% (limited by K210 quantization)
- **False positive rate:** Varies by model optimization
- **CPU usage:** ~30% K210 (leaves room for other tasks)
- **Network:** 0 until wake detected

## Monitoring and Debugging

### Log Wake Word Detections

```python
# Add to voice_server.py
import datetime

def log_wake_word(confidence, timestamp=None):
    """Log wake word detections for analysis"""
    if timestamp is None:
        timestamp = datetime.datetime.now()

    log_file = "/home/alan/voice-assistant/logs/wake_words.log"

    with open(log_file, 'a') as f:
        f.write(f"{timestamp.isoformat()},{confidence}\n")
```

### Analyze False Positives

```bash
# Check wake word log
tail -f ~/voice-assistant/logs/wake_words.log

# Find patterns in false positives
grep "wake_word" ~/voice-assistant/logs/wake_words.log | \
    awk -F',' '{print $2}' | \
    sort -n | uniq -c
```

## Production Deployment

### Systemd Service with Precise

Update the systemd service to include Precise:

```ini
[Unit]
Description=Voice Assistant with Wake Word Detection
After=network.target

[Service]
Type=simple
User=alan
WorkingDirectory=/home/alan/voice-assistant
Environment="PATH=/home/alan/miniconda3/envs/precise/bin:/usr/local/bin:/usr/bin:/bin"
EnvironmentFile=/home/alan/voice-assistant/config/.env
ExecStart=/home/alan/miniconda3/envs/precise/bin/python voice_server.py --enable-precise
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target
```

## Troubleshooting

### Precise Won't Start

```bash
# Check TensorFlow version
python -c "import tensorflow as tf; print(tf.__version__)"
# Should be 1.15.x

# Check model file
file hey-computer.net
# Should be "TensorFlow SavedModel"

# Test model directly
precise-engine hey-computer.net
# Should load without errors
```

### Low Accuracy

1. **Collect more training data** - Especially hard negatives
2. **Increase training epochs** - Try 200-300 epochs
3. **Verify training/test split** - Should be 80/20
4. **Check audio quality** - Sample rate should match (16kHz)
5. **Try different wake words** - Some are easier to detect

### High False Positive Rate

1. **Increase threshold** - Try 0.6, 0.7, 0.8
2. **Add false positives to training** - Retrain with false triggers
3. **Collect more negative samples** - Expand not-wake-word set
4. **Use ensemble models** - Run multiple models, require agreement

### KMODEL Conversion Fails

This is expected - K210 conversion is complex:

1. **Simplify model architecture** - Reduce layer count
2. **Use quantization-aware training** - Train with quantization in mind
3. **Check operator support** - K210 doesn't support all TF ops
4. **Consider alternatives:**
   - Use pre-trained models for K210
   - Stick with server-side detection
   - Use Porcupine instead (has K210 support)

## Alternative: Use Pre-trained Models

Mycroft provides some pre-trained models:

```bash
# Download Hey Mycroft model
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
tar xzf hey-mycroft.tar.gz

# Test it
precise-listen hey-mycroft.net
```

Then train your own wake word starting from this base:

```bash
# Fine-tune from pre-trained model
precise-train -e 60 my-wake-word.net my-wake-word/ \
    --from-checkpoint hey-mycroft.net
```

## Next Steps

1. **Start with server-side** - Get it working on Heimdall first
2. **Collect good training data** - Quality samples are key
3. **Test and tune threshold** - Find the sweet spot for your environment
4. **Monitor performance** - Track false positives and misses
5. **Iterate on training** - Add hard examples, retrain
6. **Consider edge deployment** - Once server-side is solid

## Resources

- Mycroft Precise Docs: https://github.com/MycroftAI/mycroft-precise
- Training Guide: https://mycroft-ai.gitbook.io/docs/mycroft-technologies/precise
- Community Models: https://github.com/MycroftAI/precise-data
- K210 Docs: https://canaan-creative.com/developer
- nncase: https://github.com/kendryte/nncase

## Conclusion

Mycroft Precise gives you full control over your wake word detection with complete privacy. Start with server-side detection for easier development, then optimize to edge detection once you have a well-trained model.

The key to success is good training data - invest time in collecting diverse, high-quality samples!