feat: import mycroft-precise work as Minerva foundation

Ports prior voice assistant research and prototypes from devl/Devops
into the Minerva repo. Includes:

- docs/: architecture, wake word guides, ESP32-S3 spec, hardware buying guide
- scripts/: voice_server.py, voice_server_enhanced.py, setup scripts
- hardware/maixduino/: edge device scripts with WiFi credentials scrubbed
  (replaced hardcoded password with secrets.py pattern)
- config/.env.example: server config template
- .gitignore: excludes .env, secrets.py, model blobs, ELF firmware
- CLAUDE.md: Minerva product context and connection to cf-voice roadmap
This commit is contained in:
pyr0ball 2026-04-06 22:21:12 -07:00
parent fca5a107de
commit 173f7f37d4
30 changed files with 12519 additions and 0 deletions

29
.gitignore vendored Normal file
View file

@ -0,0 +1,29 @@
# Credentials
secrets.py
config/.env
*.env
!*.env.example
# Models (large binary files)
models/*.pb
models/*.pb.params
models/*.net
models/*.tflite
models/*.kmodel
# OEM firmware blobs
*.elf
*.7z
*.bin
# Python
__pycache__/
*.pyc
*.pyo
# Logs
logs/
# IDE
.vscode/
.idea/

165
CLAUDE.md Normal file
View file

@ -0,0 +1,165 @@
# Minerva — Developer Context
**Product code:** `MNRV`
**Status:** Concept / early prototype
**Domain:** Privacy-first, local-only voice assistant hardware platform
---
## What Minerva Is
A 100% local, FOSS voice assistant hardware platform. No cloud. No subscriptions. No data leaving the local network.
The goal is a reference hardware + software stack for a privacy-first voice assistant that anyone can build, extend, or self-host — including people without technical backgrounds if the assembly docs are good enough.
Core design principles (same as all CF products):
- **Local-first inference** — Whisper STT, Piper TTS, Mycroft Precise wake word all run on the host server
- **Edge where possible** — wake word detection moves to edge hardware over time (K210 → ESP32-S3 → custom)
- **No cloud dependency** — Home Assistant optional, not required
- **100% FOSS stack**
---
## Hardware Targets
### Phase 1 (current): Maix Duino (K210)
- K210 dual-core RISC-V @ 400MHz with KPU neural accelerator
- Audio: I2S microphone + speaker output
- Connectivity: ESP32 WiFi/BLE co-processor
- Programming: MaixPy (MicroPython)
- Status: server-side wake word working; edge inference in progress
### Phase 2: ESP32-S3
- More accessible, cheaper, better WiFi
- On-device wake word with Espressif ESP-SR
- See `docs/ESP32_S3_VOICE_ASSISTANT_SPEC.md`
### Phase 3: Custom hardware
- Dedicated PCB for CF reference platform
- Hardware-accelerated wake word + VAD
- Designed for accessibility: large buttons, LED feedback, easy mounting
---
## Software Stack
### Edge device (Maix Duino / ESP32-S3)
- Firmware: MaixPy or ESP-IDF
- Client: `hardware/maixduino/maix_voice_client.py`
- Audio: I2S capture and playback
- Network: WiFi → Minerva server
### Server (runs on Heimdall or any Linux box)
- Voice server: `scripts/voice_server.py` (Flask + Whisper + Precise)
- Enhanced version: `scripts/voice_server_enhanced.py` (adds speaker ID via pyannote)
- STT: Whisper (local)
- Wake word: Mycroft Precise
- TTS: Piper
- Home Assistant: REST API integration (optional)
- Conda env: `whisper_cli` (existing on Heimdall)
---
## Directory Structure
```
minerva/
├── docs/ # Architecture, guides, reference docs
│ ├── maix-voice-assistant-architecture.md
│ ├── MYCROFT_PRECISE_GUIDE.md
│ ├── PRECISE_DEPLOYMENT.md
│ ├── ESP32_S3_VOICE_ASSISTANT_SPEC.md
│ ├── HARDWARE_BUYING_GUIDE.md
│ ├── LCD_CAMERA_FEATURES.md
│ ├── K210_PERFORMANCE_VERIFICATION.md
│ ├── WAKE_WORD_ADVANCED.md
│ ├── ADVANCED_WAKE_WORD_TOPICS.md
│ └── QUESTIONS_ANSWERED.md
├── scripts/ # Server-side scripts
│ ├── voice_server.py # Core Flask + Whisper + Precise server
│ ├── voice_server_enhanced.py # + speaker identification (pyannote)
│ ├── setup_voice_assistant.sh # Server setup
│ ├── setup_precise.sh # Mycroft Precise training environment
│ └── download_pretrained_models.sh
├── hardware/
│ └── maixduino/ # K210 edge device scripts
│ ├── maix_voice_client.py # Production client
│ ├── maix_simple_record_test.py # Audio capture test
│ ├── maix_test_simple.py # Hardware/network test
│ ├── maix_debug_wifi.py # WiFi diagnostics
│ ├── maix_discover_modules.py # Module discovery
│ ├── secrets.py.example # WiFi/server credential template
│ ├── MICROPYTHON_QUIRKS.md
│ └── README.md
├── config/
│ └── .env.example # Server config template
├── models/ # Wake word models (gitignored, large)
└── CLAUDE.md # This file
```
---
## Credentials / Secrets
**Never commit real credentials.** Pattern:
- Server: copy `config/.env.example``config/.env`, fill in real values
- Edge device: copy `hardware/maixduino/secrets.py.example``secrets.py`, fill in WiFi + server URL
Both files are gitignored. `.example` files are committed as templates.
---
## Running the Server
```bash
# Activate environment
conda activate whisper_cli
# Basic server (Whisper + Precise wake word)
python scripts/voice_server.py \
--enable-precise \
--precise-model models/hey-minerva.net \
--precise-sensitivity 0.5
# Enhanced server (+ speaker identification)
python scripts/voice_server_enhanced.py \
--enable-speaker-id \
--hf-token $HF_TOKEN
# Test health
curl http://localhost:5000/health
curl http://localhost:5000/wake-word/status
```
---
## Connection to CF Voice Infrastructure
Minerva is the **hardware platform** for cf-voice. As `circuitforge_core.voice` matures:
- `cf_voice.io` (STT/TTS) → replaces the ad hoc Whisper/Piper calls in `voice_server.py`
- `cf_voice.context` (parallel classifier) → augments Mycroft Precise with tone/environment detection
- `cf_voice.telephony` → future: Minerva as an always-on household linnet node
Minerva hardware + cf-voice software = the CF reference voice assistant stack.
---
## Roadmap
See Forgejo milestones on this repo. High-level:
1. **Alpha — Server-side pipeline** — Whisper + Precise + Piper working end-to-end on Heimdall
2. **Beta — Edge wake word** — wake word on K210 or ESP32-S3; audio only streams post-wake
3. **Hardware v1** — documented reference build; buying guide; assembly instructions
4. **cf-voice integration** — Minerva uses cf_voice modules from circuitforge-core
5. **Platform** — multiple hardware targets; custom PCB design
---
## Related
- `cf-voice` module design: `circuitforge-plans/circuitforge-core/2026-04-06-cf-voice-design.md`
- `linnet` product: real-time tone annotation, will eventually embed Minerva as a hardware node
- Heimdall server: primary dev/deployment target (10.1.10.71 on LAN)

24
config/.env.example Normal file
View file

@ -0,0 +1,24 @@
# Minerva Voice Server — configuration
# Copy to config/.env and fill in real values. Never commit .env.
# Server
SERVER_HOST=0.0.0.0
SERVER_PORT=5000
# Whisper STT
WHISPER_MODEL=base
# Mycroft Precise wake word
# PRECISE_MODEL=/path/to/wake-word.net
# PRECISE_SENSITIVITY=0.5
# Home Assistant integration (optional)
# HA_URL=http://homeassistant.local:8123
# HA_TOKEN=your_long_lived_access_token_here
# HuggingFace (for speaker identification, optional)
# HF_TOKEN=your_huggingface_token_here
# Logging
LOG_LEVEL=INFO
LOG_FILE=logs/minerva.log

905
docs/ADVANCED_WAKE_WORD_TOPICS.md Executable file
View file

@ -0,0 +1,905 @@
# Advanced Wake Word Topics - Pre-trained Models, Multiple Wake Words, and Voice Adaptation
## Pre-trained Mycroft Models
### Yes! Pre-trained Models Exist
Mycroft AI provides several pre-trained wake word models you can use immediately:
**Available Models:**
- **Hey Mycroft** - Original Mycroft wake word (most training data)
- **Hey Jarvis** - Popular alternative
- **Christopher** - Alternative wake word
- **Hey Ezra** - Another option
### Download Pre-trained Models
```bash
# On Heimdall
conda activate precise
cd ~/precise-models
# Create directory for pre-trained models
mkdir -p pretrained
cd pretrained
# Download Hey Mycroft (recommended starting point)
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
tar xzf hey-mycroft.tar.gz
# Download other models
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-jarvis.tar.gz
tar xzf hey-jarvis.tar.gz
# List available models
ls -lh *.net
```
### Test Pre-trained Model
```bash
conda activate precise
# Test Hey Mycroft
precise-listen hey-mycroft.net
# Speak "Hey Mycroft" - should see "!" when detected
# Press Ctrl+C to exit
# Test with different threshold
precise-listen hey-mycroft.net -t 0.7 # More conservative
```
### Use Pre-trained Model in Voice Server
```bash
cd ~/voice-assistant
# Start server with Hey Mycroft model
python voice_server.py \
--enable-precise \
--precise-model ~/precise-models/pretrained/hey-mycroft.net \
--precise-sensitivity 0.5
```
### Fine-tune Pre-trained Models
You can use pre-trained models as a **starting point** and fine-tune with your voice:
```bash
cd ~/precise-models
mkdir -p hey-mycroft-custom
# Copy base model
cp pretrained/hey-mycroft.net hey-mycroft-custom/
# Collect your samples
cd hey-mycroft-custom
precise-collect # Record 20-30 samples of YOUR voice
# Fine-tune from pre-trained model
precise-train -e 30 hey-mycroft-custom.net . \
--from-checkpoint ../pretrained/hey-mycroft.net
# This is MUCH faster than training from scratch!
```
**Benefits:**
- ✅ Start with proven model
- ✅ Much less training data needed (20-30 vs 100+ samples)
- ✅ Faster training (30 mins vs 60 mins)
- ✅ Good baseline accuracy
## Multiple Wake Words
### Architecture Options
#### Option 1: Multiple Models in Parallel (Server-Side Only)
Run multiple Precise instances simultaneously:
```python
# In voice_server.py - Multiple wake word detection
from precise_runner import PreciseEngine, PreciseRunner
import threading
# Global runners
precise_runners = {}
def on_wake_word_detected(wake_word_name):
"""Callback factory for different wake words"""
def callback():
print(f"Wake word detected: {wake_word_name}")
wake_word_queue.put({
'wake_word': wake_word_name,
'timestamp': time.time()
})
return callback
def start_multiple_wake_words(wake_word_configs):
"""
Start multiple wake word detectors
Args:
wake_word_configs: List of dicts with 'name', 'model', 'sensitivity'
Example:
configs = [
{'name': 'hey mycroft', 'model': 'hey-mycroft.net', 'sensitivity': 0.5},
{'name': 'hey jarvis', 'model': 'hey-jarvis.net', 'sensitivity': 0.5}
]
"""
global precise_runners
for config in wake_word_configs:
engine = PreciseEngine(
'/usr/local/bin/precise-engine',
config['model']
)
runner = PreciseRunner(
engine,
sensitivity=config['sensitivity'],
on_activation=on_wake_word_detected(config['name'])
)
runner.start()
precise_runners[config['name']] = runner
print(f"Started wake word detector: {config['name']}")
```
**Server-Side Multiple Wake Words:**
```bash
# Start server with multiple wake words
python voice_server.py \
--enable-precise \
--precise-models "hey-mycroft:~/models/hey-mycroft.net:0.5,hey-jarvis:~/models/hey-jarvis.net:0.5"
```
**Performance Impact:**
- CPU: ~5-10% per model (can run 2-3 easily)
- Memory: ~50-100MB per model
- Latency: Minimal (all run in parallel)
#### Option 2: Single Model, Multiple Phrases (Edge or Server)
Train ONE model that responds to multiple phrases:
```bash
cd ~/precise-models/multi-wake
conda activate precise
# Record samples for BOTH wake words in the SAME dataset
# Label all as "wake-word" regardless of which phrase
mkdir -p wake-word not-wake-word
# Record "Hey Mycroft" samples
precise-collect # Save to wake-word/hey-mycroft-*.wav
# Record "Hey Computer" samples
precise-collect # Save to wake-word/hey-computer-*.wav
# Record negatives
precise-collect -f not-wake-word/random.wav
# Train single model on both phrases
precise-train -e 60 multi-wake.net .
```
**Pros:**
- ✅ Single model = less compute
- ✅ Works on edge (K210)
- ✅ Easy to deploy
**Cons:**
- ❌ Can't tell which wake word was used
- ❌ May reduce accuracy for each individual phrase
- ❌ Higher false positive risk
#### Option 3: Sequential Detection (Edge)
Detect wake word, then identify which one:
```python
# Pseudo-code for edge detection
if wake_word_detected():
audio_snippet = last_2_seconds()
# Run all models on the audio snippet
scores = {
'hey-mycroft': model1.score(audio_snippet),
'hey-jarvis': model2.score(audio_snippet),
'hey-computer': model3.score(audio_snippet)
}
# Use highest scoring wake word
wake_word = max(scores, key=scores.get)
```
### Recommendations
**Server-Side (Heimdall):**
- ✅ **Use Option 1** - Multiple models in parallel
- Run 2-3 wake words easily
- Each can have different sensitivity
- Can identify which wake word was used
- Example: "Hey Mycroft" for commands, "Hey Jarvis" for queries
**Edge (Maix Duino K210):**
- ✅ **Use Option 2** - Single multi-phrase model
- K210 can handle 1 model efficiently
- Train on 2-3 phrases max
- Simpler deployment
- Lower latency
## Voice Adaptation & Multi-User Support
### Approach 1: Inclusive Training (Recommended)
Train ONE model on EVERYONE'S voices:
```bash
cd ~/precise-models/family-wake-word
conda activate precise
# Record samples from each family member
# Alice records 30 samples
precise-collect # Save as wake-word/alice-*.wav
# Bob records 30 samples
precise-collect # Save as wake-word/bob-*.wav
# Carol records 30 samples
precise-collect # Save as wake-word/carol-*.wav
# Train on all voices
precise-train -e 60 family-wake-word.net .
```
**Pros:**
- ✅ Everyone can use the system
- ✅ Single model deployment
- ✅ Works for all family members
- ✅ Simple maintenance
**Cons:**
- ❌ Can't identify who spoke
- ❌ May need more training data
- ❌ No personalization
**Best for:** Family voice assistant, shared devices
### Approach 2: Speaker Identification (Advanced)
Detect wake word, then identify speaker:
```python
# Architecture with speaker ID
# Step 1: Precise detects wake word
if wake_word_detected():
# Step 2: Capture voice sample
voice_sample = record_audio(duration=3)
# Step 3: Speaker identification
speaker = identify_speaker(voice_sample)
# Uses voice embeddings/neural network
# Step 4: Process with user context
process_command(voice_sample, user=speaker)
```
**Implementation Options:**
#### Option A: Use resemblyzer (Voice Embeddings)
```bash
pip install resemblyzer --break-system-packages
# Enrollment phase
python enroll_users.py
# Each user records 10-20 seconds of speech
# System creates voice profile (embedding)
# Runtime
python speaker_id.py
# Compares incoming audio to stored embeddings
# Returns most likely speaker
```
**Example Code:**
```python
from resemblyzer import VoiceEncoder, preprocess_wav
import numpy as np
# Initialize encoder
encoder = VoiceEncoder()
# Enrollment - do once per user
def enroll_user(name, audio_files):
"""Create voice profile for user"""
embeddings = []
for audio_file in audio_files:
wav = preprocess_wav(audio_file)
embedding = encoder.embed_utterance(wav)
embeddings.append(embedding)
# Average embeddings for robustness
user_profile = np.mean(embeddings, axis=0)
# Save profile
np.save(f'profiles/{name}.npy', user_profile)
return user_profile
# Identification - run each time
def identify_speaker(audio_file, profiles_dir='profiles'):
"""Identify which enrolled user is speaking"""
wav = preprocess_wav(audio_file)
test_embedding = encoder.embed_utterance(wav)
# Load all profiles
profiles = {}
for profile_file in os.listdir(profiles_dir):
name = profile_file.replace('.npy', '')
profile = np.load(os.path.join(profiles_dir, profile_file))
profiles[name] = profile
# Calculate similarity to each profile
similarities = {}
for name, profile in profiles.items():
similarity = np.dot(test_embedding, profile)
similarities[name] = similarity
# Return most similar
best_match = max(similarities, key=similarities.get)
confidence = similarities[best_match]
if confidence > 0.7: # Threshold
return best_match
else:
return "unknown"
```
#### Option B: Use pyannote.audio (Production-grade)
```bash
pip install pyannote.audio --break-system-packages
# Requires HuggingFace token (same as diarization)
```
**Example:**
```python
from pyannote.audio import Inference
# Initialize
inference = Inference(
"pyannote/embedding",
use_auth_token="your_hf_token"
)
# Enroll users
alice_profile = inference("alice_sample.wav")
bob_profile = inference("bob_sample.wav")
# Identify
test_embedding = inference("test_audio.wav")
# Compare
from scipy.spatial.distance import cosine
alice_similarity = 1 - cosine(test_embedding, alice_profile)
bob_similarity = 1 - cosine(test_embedding, bob_profile)
if alice_similarity > bob_similarity and alice_similarity > 0.7:
speaker = "Alice"
elif bob_similarity > 0.7:
speaker = "Bob"
else:
speaker = "Unknown"
```
**Pros:**
- ✅ Can identify individual users
- ✅ Personalized responses
- ✅ User-specific commands/permissions
- ✅ Better for privacy (know who's speaking)
**Cons:**
- ❌ More complex implementation
- ❌ Requires enrollment phase
- ❌ Additional processing time (~100-200ms)
- ❌ May fail with similar voices
### Approach 3: Per-User Wake Word Models
Each person has their OWN wake word:
```bash
# Alice's wake word: "Hey Mycroft"
# Train on ONLY Alice's voice
# Bob's wake word: "Hey Jarvis"
# Train on ONLY Bob's voice
# Carol's wake word: "Hey Computer"
# Train on ONLY Carol's voice
```
**Deployment:**
Run all 3 models in parallel (server-side):
```python
wake_word_configs = [
{'name': 'Alice', 'wake_word': 'hey mycroft', 'model': 'alice-wake.net'},
{'name': 'Bob', 'wake_word': 'hey jarvis', 'model': 'bob-wake.net'},
{'name': 'Carol', 'wake_word': 'hey computer', 'model': 'carol-wake.net'}
]
```
**Pros:**
- ✅ Automatic user identification
- ✅ Highest accuracy per user
- ✅ Clear user separation
- ✅ No additional speaker ID needed
**Cons:**
- ❌ Requires 3x models (server only)
- ❌ Users must remember their wake word
- ❌ 3x CPU usage (~15-30%)
- ❌ Can't work on edge (K210)
### Approach 4: Context-Based Adaptation
No speaker ID, but learn from interaction:
```python
# Track command patterns
user_context = {
'last_command': 'turn on living room lights',
'frequent_entities': ['light.living_room', 'light.bedroom'],
'time_of_day_patterns': {'morning': 'coffee maker', 'evening': 'tv'},
'location': 'home' # vs 'away'
}
# Use context to improve intent recognition
if "turn on the lights" and time.is_morning():
# Probably means bedroom lights (based on history)
entity = user_context['frequent_entities'][0]
```
**Pros:**
- ✅ No enrollment needed
- ✅ Improves over time
- ✅ Simple to implement
- ✅ Works with any number of users
**Cons:**
- ❌ No true user identification
- ❌ May make incorrect assumptions
- ❌ Privacy concerns (tracking behavior)
## Recommended Strategy
### For Your Use Case
Based on your home lab setup, I recommend:
#### Phase 1: Single Wake Word, Inclusive Training (Week 1-2)
```bash
# Start simple
cd ~/precise-models/hey-computer
conda activate precise
# Have all family members record samples
# Alice: 30 samples of "Hey Computer"
# Bob: 30 samples of "Hey Computer"
# You: 30 samples of "Hey Computer"
# Train single model on all voices
precise-train -e 60 hey-computer.net .
# Deploy to server
python voice_server.py \
--enable-precise \
--precise-model hey-computer.net
```
**Why:**
- Simple to setup and test
- Everyone can use it immediately
- Single model = easier debugging
- Works on edge if you migrate later
#### Phase 2: Add Speaker Identification (Week 3-4)
```bash
# Install resemblyzer
pip install resemblyzer --break-system-packages
# Enroll users
python enroll_users.py
# Each person speaks for 20 seconds
# Update voice_server.py to identify speaker
# Use speaker ID for personalized responses
```
**Why:**
- Enables personalization
- Can track preferences per user
- User-specific command permissions
- Better privacy (know who's speaking)
#### Phase 3: Multiple Wake Words (Month 2+)
```bash
# Add alternative wake words for different contexts
# "Hey Mycroft" - General commands
# "Hey Jarvis" - Media/Plex control
# "Computer" - Quick commands (lights, temp)
# Deploy multiple models on server
python voice_server.py \
--enable-precise \
--precise-models "mycroft:hey-mycroft.net:0.5,jarvis:hey-jarvis.net:0.5"
```
**Why:**
- Different wake words for different contexts
- Reduces false positives (more specific triggers)
- Fun factor (Jarvis for media!)
- Server can handle 2-3 easily
## Implementation Guide: Multiple Wake Words
### Update voice_server.py for Multiple Wake Words
```python
# Add to voice_server.py
def start_multiple_wake_words(configs):
"""
Start multiple wake word detectors
Args:
configs: List of dicts with 'name', 'model_path', 'sensitivity'
"""
global precise_runners
precise_runners = {}
for config in configs:
try:
engine = PreciseEngine(
DEFAULT_PRECISE_ENGINE,
config['model_path']
)
def make_callback(wake_word_name):
def callback():
print(f"Wake word detected: {wake_word_name}")
wake_word_queue.put({
'wake_word': wake_word_name,
'timestamp': time.time(),
'source': 'precise'
})
return callback
runner = PreciseRunner(
engine,
sensitivity=config['sensitivity'],
on_activation=make_callback(config['name'])
)
runner.start()
precise_runners[config['name']] = runner
print(f"✓ Started: {config['name']} (sensitivity: {config['sensitivity']})")
except Exception as e:
print(f"✗ Failed to start {config['name']}: {e}")
return len(precise_runners) > 0
# Add to main()
parser.add_argument('--precise-models',
help='Multiple models: name:path:sensitivity,name2:path2:sensitivity2')
# Parse multiple models
if args.precise_models:
configs = []
for model_spec in args.precise_models.split(','):
name, path, sensitivity = model_spec.split(':')
configs.append({
'name': name,
'model_path': os.path.expanduser(path),
'sensitivity': float(sensitivity)
})
start_multiple_wake_words(configs)
```
### Usage Example
```bash
cd ~/voice-assistant
# Start with multiple wake words
python voice_server.py \
--enable-precise \
--precise-models "\
hey-mycroft:~/precise-models/pretrained/hey-mycroft.net:0.5,\
hey-jarvis:~/precise-models/pretrained/hey-jarvis.net:0.5"
```
## Implementation Guide: Speaker Identification
### Add to voice_server.py
```python
# Add resemblyzer support
try:
from resemblyzer import VoiceEncoder, preprocess_wav
import numpy as np
SPEAKER_ID_AVAILABLE = True
except ImportError:
SPEAKER_ID_AVAILABLE = False
print("Warning: resemblyzer not available. Speaker ID disabled.")
# Initialize encoder
voice_encoder = None
speaker_profiles = {}
def load_speaker_profiles(profiles_dir='~/voice-assistant/profiles'):
"""Load enrolled speaker profiles"""
global speaker_profiles, voice_encoder
if not SPEAKER_ID_AVAILABLE:
return False
profiles_dir = os.path.expanduser(profiles_dir)
if not os.path.exists(profiles_dir):
print(f"No speaker profiles found at {profiles_dir}")
return False
# Initialize encoder
voice_encoder = VoiceEncoder()
# Load all profiles
for profile_file in os.listdir(profiles_dir):
if profile_file.endswith('.npy'):
name = profile_file.replace('.npy', '')
profile = np.load(os.path.join(profiles_dir, profile_file))
speaker_profiles[name] = profile
print(f"Loaded speaker profile: {name}")
return len(speaker_profiles) > 0
def identify_speaker(audio_path, threshold=0.7):
"""Identify speaker from audio file"""
if not SPEAKER_ID_AVAILABLE or not speaker_profiles:
return None
try:
# Get embedding for test audio
wav = preprocess_wav(audio_path)
test_embedding = voice_encoder.embed_utterance(wav)
# Compare to all profiles
similarities = {}
for name, profile in speaker_profiles.items():
similarity = np.dot(test_embedding, profile)
similarities[name] = similarity
# Get best match
best_match = max(similarities, key=similarities.get)
confidence = similarities[best_match]
print(f"Speaker ID: {best_match} (confidence: {confidence:.2f})")
if confidence > threshold:
return best_match
else:
return "unknown"
except Exception as e:
print(f"Error identifying speaker: {e}")
return None
# Update process endpoint to include speaker ID
@app.route('/process', methods=['POST'])
def process():
"""Process complete voice command with speaker identification"""
# ... existing code ...
# Add speaker identification
speaker = identify_speaker(temp_path) if speaker_profiles else None
if speaker:
print(f"Detected speaker: {speaker}")
# Could personalize response based on speaker
# ... rest of processing ...
```
### Enrollment Script
Create `enroll_speaker.py`:
```python
#!/usr/bin/env python3
"""
Enroll users for speaker identification
Usage:
python enroll_speaker.py --name Alice --audio alice_sample.wav
python enroll_speaker.py --name Alice --duration 20 # Record live
"""
import argparse
import os
import numpy as np
from resemblyzer import VoiceEncoder, preprocess_wav
import pyaudio
import wave
def record_audio(duration=20, sample_rate=16000):
"""Record audio from microphone"""
print(f"Recording for {duration} seconds...")
print("Speak naturally - read a paragraph, have a conversation, etc.")
chunk = 1024
format = pyaudio.paInt16
channels = 1
p = pyaudio.PyAudio()
stream = p.open(
format=format,
channels=channels,
rate=sample_rate,
input=True,
frames_per_buffer=chunk
)
frames = []
for i in range(0, int(sample_rate / chunk * duration)):
data = stream.read(chunk)
frames.append(data)
stream.stop_stream()
stream.close()
p.terminate()
# Save to temp file
temp_file = f"/tmp/enrollment_{os.getpid()}.wav"
wf = wave.open(temp_file, 'wb')
wf.setnchannels(channels)
wf.setsampwidth(p.get_sample_size(format))
wf.setframerate(sample_rate)
wf.writeframes(b''.join(frames))
wf.close()
return temp_file
def enroll_speaker(name, audio_file, profiles_dir='~/voice-assistant/profiles'):
"""Create voice profile for speaker"""
profiles_dir = os.path.expanduser(profiles_dir)
os.makedirs(profiles_dir, exist_ok=True)
# Initialize encoder
encoder = VoiceEncoder()
# Process audio
wav = preprocess_wav(audio_file)
embedding = encoder.embed_utterance(wav)
# Save profile
profile_path = os.path.join(profiles_dir, f'{name}.npy')
np.save(profile_path, embedding)
print(f"✓ Enrolled speaker: {name}")
print(f" Profile saved to: {profile_path}")
return profile_path
def main():
parser = argparse.ArgumentParser(description="Enroll speaker for voice identification")
parser.add_argument('--name', required=True, help='Speaker name')
parser.add_argument('--audio', help='Path to audio file (wav)')
parser.add_argument('--duration', type=int, default=20,
help='Recording duration if not using audio file')
parser.add_argument('--profiles-dir', default='~/voice-assistant/profiles',
help='Directory to save profiles')
args = parser.parse_args()
# Get audio file
if args.audio:
audio_file = args.audio
if not os.path.exists(audio_file):
print(f"Error: Audio file not found: {audio_file}")
return 1
else:
audio_file = record_audio(args.duration)
# Enroll speaker
try:
enroll_speaker(args.name, audio_file, args.profiles_dir)
return 0
except Exception as e:
print(f"Error enrolling speaker: {e}")
return 1
if __name__ == '__main__':
import sys
sys.exit(main())
```
## Performance Comparison
### Single Wake Word
- **Latency:** 100-200ms
- **CPU:** ~5-10% (idle)
- **Memory:** ~100MB
- **Accuracy:** 95%+
### Multiple Wake Words (3 models)
- **Latency:** 100-200ms (parallel)
- **CPU:** ~15-30% (idle)
- **Memory:** ~300MB
- **Accuracy:** 95%+ each
### With Speaker Identification
- **Additional latency:** +100-200ms
- **Additional CPU:** +5% during ID
- **Additional memory:** +50MB
- **Accuracy:** 85-95% (depending on enrollment quality)
## Best Practices
### Wake Word Selection
1. **Different enough** - "Hey Mycroft" vs "Hey Jarvis" (not "Hey Alice" vs "Hey Alex")
2. **Clear consonants** - Easier to detect
3. **2-3 syllables** - Not too short, not too long
4. **Test in environment** - Check for false triggers
### Training
1. **Include all users** - If using single model
2. **Diverse conditions** - Different rooms, noise levels
3. **Regular updates** - Add false positives weekly
4. **Per-user models** - Higher accuracy, more compute
### Speaker Identification
1. **Quality enrollment** - 20+ seconds of clear speech
2. **Re-enroll periodically** - Voices change (colds, etc.)
3. **Test thresholds** - Balance accuracy vs false IDs
4. **Graceful fallback** - Handle unknown speakers
## Recommended Path for You
```bash
# Week 1: Start with pre-trained "Hey Mycroft"
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
precise-listen hey-mycroft.net # Test it!
# Week 2: Fine-tune with your voices
precise-train -e 30 hey-mycroft-custom.net . \
--from-checkpoint hey-mycroft.net
# Week 3: Add speaker identification
pip install resemblyzer
python enroll_speaker.py --name Alan --duration 20
python enroll_speaker.py --name [Family Member] --duration 20
# Week 4: Add second wake word ("Hey Jarvis" for Plex?)
wget hey-jarvis.tar.gz
# Run both in parallel
# Month 2+: Optimize and expand
# - More wake words for different contexts
# - Per-user wake word models
# - Context-aware responses
```
This gives you a smooth progression from simple to advanced!

File diff suppressed because it is too large Load diff

542
docs/HARDWARE_BUYING_GUIDE.md Executable file
View file

@ -0,0 +1,542 @@
# Voice Assistant Hardware - Buying Guide for Second Unit
**Date:** 2025-11-29
**Context:** You have one Maix Duino (K210), planning multi-room deployment
**Question:** What should I buy for the second unit?
---
## Quick Answer
**Best Overall:** **Buy another Maix Duino K210** (~$30-40)
**Runner-up:** **ESP32-S3 with audio board** (~$20-30)
**Budget:** **Generic ESP32 + I2S** (~$15-20)
**Future-proof:** **Sipeed Maix-III** (~$60-80, when available)
---
## Analysis: Why Another Maix Duino K210?
### Pros ✅
- **Identical to first unit** - Code reuse, same workflow
- **Proven solution** - You'll know exactly what to expect
- **Stock availability** - Still widely available despite being "outdated"
- **Same accessories** - Microphones, displays, cables compatible
- **Edge detection ready** - Can upgrade to edge wake word later
- **Low cost** - ~$30-40 for full kit with LCD and camera
- **Multi-room consistency** - All units behave identically
### Cons ❌
- "Outdated" hardware (but doesn't matter for your use case)
- Limited future support from Sipeed
### Verdict: ✅ **RECOMMENDED - Best choice for consistency**
---
## Alternative Options
### Option 1: Another Maix Duino K210
**Price:** $30-40 (kit with LCD)
**Where:** AliExpress, Amazon, Seeed Studio
**Specific Model:**
- **Sipeed Maix Duino** (original, what you have)
- Includes: LCD, camera module
- Need to add: I2S microphone
**Why Choose:**
- Identical setup to first unit
- Code works without modification
- Same troubleshooting experience
- Bulk buy discount possible
**Link Examples:**
- Seeed Studio: https://www.seeedstudio.com/Sipeed-Maix-Duino-Kit-for-RISC-V-AI-IoT.html
- AliExpress: Search "Sipeed Maix Duino" (~$25-35)
---
### Option 2: Sipeed Maix Bit/Dock (K210 variant)
**Price:** $15-25 (smaller form factor)
**Differences from Maix Duino:**
- Smaller board
- May need separate LCD
- Same K210 chip
- Same capabilities
**Why Choose:**
- Cheaper
- More compact
- Same software
**Why Skip:**
- Need separate accessories
- Different form factor means different mounting
- Less convenient than all-in-one Duino
**Verdict:** ⚠️ Only if you want smaller/cheaper
---
### Option 3: ESP32-S3 with Audio Kit
**Price:** $20-30
**Chip:** ESP32-S3 (Xtensa dual-core @ 240MHz)
**Examples:**
- **ESP32-S3-Box** (~$30) - Has LCD, microphone, speaker built-in
- **Seeed XIAO ESP32-S3 Sense** (~$15) - Tiny, needs accessories
- **M5Stack Core S3** (~$50) - Premium, all-in-one
**Pros:**
- ✅ More modern than K210
- ✅ Better WiFi/BLE support
- ✅ Lower power consumption
- ✅ Active development
- ✅ Arduino/ESP-IDF support
**Cons:**
- ❌ No KPU (neural accelerator)
- ❌ Different code needed (ESP32 vs MaixPy)
- ❌ Less ML capability (for future edge wake word)
- ❌ Different ecosystem
**Best ESP32-S3 Choice:** **ESP32-S3-Box**
- All-in-one like your Maix Duino
- Built-in mic, speaker, LCD
- Good for server-side wake word
- Cheaper than Maix Duino
**Verdict:** 🤔 Good alternative if you want to experiment
---
### Option 4: Raspberry Pi Zero 2 W
**Price:** $15-20 (board only, need accessories)
**Pros:**
- ✅ Full Linux
- ✅ Familiar ecosystem
- ✅ Tons of support
- ✅ Easy Python development
**Cons:**
- ❌ No neural accelerator
- ❌ No dedicated audio hardware
- ❌ More power hungry (~500mW vs 200mW)
- ❌ Overkill for audio streaming
- ❌ Need USB sound card or I2S HAT
- ❌ Larger form factor
**Verdict:** ❌ Not ideal for this project
---
### Option 5: Sipeed Maix-III AXera-Pi (Future)
**Price:** $60-80 (when available)
**Chip:** AX620A (much more powerful than K210)
**Pros:**
- ✅ Modern hardware (2023)
- ✅ Better AI performance
- ✅ Linux + Python support
- ✅ Sipeed ecosystem continuity
- ✅ Great for edge wake word
**Cons:**
- ❌ More expensive
- ❌ Newer = less community support
- ❌ Overkill for server-side wake word
- ❌ Stock availability varies
**Verdict:** 🔮 Future-proof option if budget allows
---
### Option 6: Generic ESP32 + I2S Breakout
**Price:** $10-15 (cheapest option)
**What You Need:**
- ESP32 DevKit (~$5)
- I2S MEMS mic (~$5)
- Optional: I2S speaker amp (~$5)
**Pros:**
- ✅ Cheapest option
- ✅ Minimal, focused on audio only
- ✅ Very low power
- ✅ WiFi built-in
**Cons:**
- ❌ No LCD (would need separate)
- ❌ No camera
- ❌ DIY assembly required
- ❌ No neural accelerator
- ❌ Different code from K210
**Verdict:** 💰 Budget choice, but less polished
---
## Comparison Table
| Option | Price | Same Code? | LCD | AI Accel | Best For |
|--------|-------|------------|-----|----------|----------|
| **Maix Duino K210** | $30-40 | ✅ Yes | ✅ Included | ✅ KPU | **Multi-room consistency** |
| Maix Bit/Dock (K210) | $15-25 | ✅ Yes | ⚠️ Optional | ✅ KPU | Compact/Budget |
| ESP32-S3-Box | $25-35 | ❌ No | ✅ Included | ❌ No | Modern alternative |
| ESP32-S3 DIY | $15-25 | ❌ No | ❌ No | ❌ No | Custom build |
| Raspberry Pi Zero 2 W | $30+ | ❌ No | ❌ No | ❌ No | Linux/overkill |
| Maix-III | $60-80 | ⚠️ Similar | ✅ Varies | ✅ NPU | Future-proof |
| Generic ESP32 | $10-15 | ❌ No | ❌ No | ❌ No | Absolute budget |
---
## Recommended Purchase Plan
### Phase 1: Second Identical Unit (NOW)
**Buy:** Sipeed Maix Duino K210 (same as first)
**Cost:** ~$30-40
**Why:** Code reuse, proven solution, multi-room consistency
**What to Order:**
- [ ] Sipeed Maix Duino board with LCD and camera
- [ ] I2S MEMS microphone (if not included)
- [ ] Small speaker or audio output (3-5W)
- [ ] USB-C cable
- [ ] MicroSD card (4GB+)
**Total Cost:** ~$40-50 with accessories
---
### Phase 2: Third+ Units (LATER)
**Option A:** More Maix Duinos (if still available)
**Option B:** Switch to ESP32-S3-Box for variety/testing
**Option C:** Wait for Maix-III if you want cutting edge
---
## Where to Buy Maix Duino
### Recommended Sellers
**1. Seeed Studio (Official Partner)**
- URL: https://www.seeedstudio.com/
- Search: "Sipeed Maix Duino"
- Price: ~$35-45
- Shipping: International, good support
- **Pro:** Official, reliable, good documentation
- **Con:** Can be out of stock
**2. AliExpress (Direct from Sipeed/China)**
- Search: "Sipeed Maix Duino"
- Price: ~$25-35
- Shipping: 2-4 weeks (free or cheap)
- **Pro:** Cheapest, often bundled with accessories
- **Con:** Longer shipping, variable quality control
- **Tip:** Look for "Sipeed Official Store"
**3. Amazon**
- Search: "Maix Duino K210"
- Price: ~$40-50
- Shipping: Fast (Prime eligible sometimes)
- **Pro:** Fast shipping, easy returns
- **Con:** Higher price, limited stock
**4. Adafruit / SparkFun**
- May carry Sipeed products
- Higher price but US-based support
- Check availability
---
## Accessories to Buy
### Essential (for each unit)
**1. I2S MEMS Microphone**
- **Recommended:** Adafruit I2S MEMS Microphone Breakout (~$7)
- Model: SPH0645LM4H
- URL: https://www.adafruit.com/product/3421
- **Alternative:** INMP441 I2S Microphone (~$3 on AliExpress)
- Cheaper, works well
- Search: "INMP441 I2S microphone"
**2. Speaker / Audio Output**
- **Option A:** Small 3-5W speaker (~$5-10)
- Search: "3W 8 ohm speaker"
- **Option B:** I2S speaker amplifier + speaker
- MAX98357A I2S amp (~$5)
- 4-8 ohm speaker (~$5)
- **Option C:** Line out to existing speakers (cheapest)
**3. MicroSD Card**
- 4GB or larger
- FAT32 formatted
- Class 10 recommended
- ~$5
**4. USB-C Cable**
- For power and programming
- ~$3-5
---
### Optional but Nice
**1. Enclosure/Case**
- 3D print custom case
- Find STL files on Thingiverse
- Or use small project box (~$5)
**2. Microphone Array** (for better pickup)
- 2 or 4-mic array board (~$15-25)
- Better voice detection
- Phase 2+ enhancement
**3. Battery Pack** (for portable testing)
- USB-C power bank
- Makes testing easier
- Already have? Use it!
**4. Mounting Hardware**
- Velcro strips
- 3M command strips
- Wall mount brackets
- ~$5
---
## Multi-Unit Strategy
### Same Hardware (Recommended)
**Buy:** 2-4x Maix Duino K210 units
**Benefit:**
- All units identical
- Same code deployment
- Easy troubleshooting
- Bulk buy discount
**Deployment:**
- Unit 1: Living room
- Unit 2: Bedroom
- Unit 3: Kitchen
- Unit 4: Office
### Mixed Hardware (Experimental)
**Buy:**
- 2x Maix Duino K210 (proven)
- 1x ESP32-S3-Box (modern)
- 1x Maix-III (future-proof)
**Benefit:**
- Test different platforms
- Evaluate performance
- Future-proofing
**Drawback:**
- More complex code
- Different troubleshooting
- Inconsistent UX
**Verdict:** ⚠️ Only if you want to experiment
---
## Budget Options
### Ultra-Budget Multi-Room (~$50 total)
- 2x Generic ESP32 + I2S mic ($10 each = $20)
- 2x Speakers ($5 each = $10)
- 2x SD cards ($5 each = $10)
- Cables ($10)
- **Total:** ~$50 for 2 units
**Pros:** Cheap
**Cons:** No LCD, DIY assembly, different code
---
### Mid-Budget Multi-Room (~$100 total)
- 2x Maix Duino K210 ($35 each = $70)
- 2x I2S mics ($5 each = $10)
- 2x Speakers ($5 each = $10)
- Accessories ($10)
- **Total:** ~$100 for 2 units
**Pros:** Proven, consistent, LCD included
**Cons:** "Outdated" hardware (doesn't matter for your use)
---
### Premium Multi-Room (~$200 total)
- 2x Maix-III AXera-Pi ($70 each = $140)
- 2x I2S mics ($10 each = $20)
- 2x Speakers ($10 each = $20)
- Accessories ($20)
- **Total:** ~$200 for 2 units
**Pros:** Future-proof, modern, powerful
**Cons:** More expensive, newer = less support
---
## My Recommendation
### For Second Unit: Buy Another Maix Duino K210 ✅
**Reasoning:**
1. **Code reuse** - Everything you develop for unit 1 works on unit 2
2. **Known quantity** - No surprises, you know it works
3. **Multi-room consistency** - All units behave the same
4. **Edge wake word ready** - Can upgrade later if desired
5. **Cost-effective** - ~$40 for full kit with LCD
6. **Stock available** - Still widely sold despite being "outdated"
**Where to Buy:**
- **Best:** AliExpress "Sipeed Official Store" (~$30 + shipping)
- **Fastest:** Amazon (~$45 with Prime)
- **Support:** Seeed Studio (~$40 + shipping)
**What to Order:**
```
Shopping List for Second Unit:
[ ] 1x Sipeed Maix Duino Kit (board + LCD + camera) - $30-35
[ ] 1x I2S MEMS microphone (INMP441 or SPH0645) - $5-7
[ ] 1x Small speaker (3W, 8 ohm) - $5-10
[ ] 1x MicroSD card (8GB+, Class 10) - $5
[ ] 1x USB-C cable - $3-5
[ ] Optional: Enclosure/mounting - $5-10
Total: ~$50-75 (depending on shipping and options)
```
---
### For Third+ Units: Evaluate
By the time you're ready for 3rd/4th units:
- You'll have experience with K210
- You'll know if you want consistency (more K210s)
- Or variety (try ESP32-S3 or Maix-III)
- Maix-III may have better availability
- Prices may have changed
**Decision:** Revisit when units 1 and 2 are working
---
## Future-Proofing Considerations
### Will K210 be Supported?
- **MaixPy:** Still actively maintained for K210
- **Community:** Large existing user base
- **Models:** Pre-trained models still work
- **Lifespan:** Good for 3-5+ years
**Verdict:** ✅ Safe to buy more K210s now
### When to Switch Hardware?
Consider switching when:
- [ ] K210 becomes hard to find
- [ ] You need better performance (edge ML)
- [ ] Power consumption is critical
- [ ] New features require newer hardware
**Timeline:** Probably 2-3 years out
---
## Special Considerations
### Different Rooms, Different Needs?
**Living Room (Primary):**
- Needs: Best audio, LCD display, polish
- **Hardware:** Maix Duino K210 with all features
**Bedroom (Secondary):**
- Needs: Simple, no bright LCD at night
- **Hardware:** Maix Duino K210, disable LCD at night
**Kitchen (Ambient Noise):**
- Needs: Better microphone array
- **Hardware:** Maix Duino K210 + 4-mic array
**Office (Minimal):**
- Needs: Cheap, basic audio only
- **Hardware:** Generic ESP32 + I2S mic
### All Same vs Customized?
**Recommendation:** Start with all same (Maix Duino), customize later if needed.
---
## Action Plan
### This Week
1. **Order second Maix Duino K210** (~$30-40)
2. **Order I2S microphone** (~$5-7)
3. **Order speaker** (~$5-10)
4. **Order SD card** (~$5)
**Total Investment:** ~$50-65
### Next Month
1. Wait for delivery (2-4 weeks from AliExpress)
2. Test unit 1 while waiting
3. Refine code and setup process
4. Prepare for unit 2 deployment
### In 2-3 Months
1. Deploy unit 2 (should be easy after unit 1)
2. Test multi-room
3. Decide on unit 3/4 based on experience
4. Consider bulk order if expanding
---
## Summary
**Buy for Second Unit:**
- ✅ **Sipeed Maix Duino K210** (same as first) - ~$35
- ✅ **I2S MEMS microphone** (INMP441) - ~$5
- ✅ **Small speaker** (3W, 8 ohm) - ~$8
- ✅ **MicroSD card** (8GB Class 10) - ~$5
- ✅ **USB-C cable** - ~$5
**Total:** ~$60 shipped
**Why:** Code reuse, consistency, proven solution, future-expandable
**Where:** AliExpress (cheap) or Amazon (fast)
**When:** Order now, 2-4 weeks delivery
**Third+ Units:** Decide after testing 2 units (probably buy more K210s)
---
## Quick Links
**Official Sipeed Store (AliExpress):**
https://sipeed.aliexpress.com/store/1101739727
**Seeed Studio:**
https://www.seeedstudio.com/catalogsearch/result/?q=maix+duino
**Amazon Search:**
"Sipeed Maix Duino K210"
**Microphone (Adafruit):**
https://www.adafruit.com/product/3421
**Alternative Mic (AliExpress):**
Search: "INMP441 I2S microphone breakout"
---
**Happy Building! 🏠🎙️**

View file

@ -0,0 +1,223 @@
# K210 Performance Verification for Voice Assistant
**Date:** 2025-11-29
**Source:** https://github.com/sipeed/MaixPy Performance Comparison
**Question:** Is K210 suitable for our Mycroft Precise wake word detection project?
---
## K210 Specifications
- **Processor:** K210 dual-core RISC-V @ 400MHz
- **AI Accelerator:** KPU (Neural Network Processor)
- **SRAM:** 8MB
- **Status:** Considered "outdated" by Sipeed (2018 release)
---
## Performance Comparison (from MaixPy GitHub)
### YOLOv2 Object Detection
| Chip | Performance | Notes |
|------|------------|-------|
| K210 | 1.8 ms | Limited to older models |
| V831 | 20-40 ms | More modern, but slower |
| R329 | N/A | Newer hardware |
### Our Use Case: Audio Processing
**For wake word detection, we need:**
- Audio input (16kHz, mono) ✅ K210 has I2S
- Real-time processing ✅ K210 KPU can handle this
- Network communication ✅ K210 has ESP32 WiFi
- Low latency (<100ms) Achievable
---
## Deployment Strategy Analysis
### Option A: Server-Side Wake Word (Recommended)
**K210 Role:** Audio I/O only
- Capture audio from I2S microphone ✅ Well supported
- Stream to Heimdall via WiFi ✅ No problem
- Receive and play TTS audio ✅ Works fine
- LED/display feedback ✅ Easy
**K210 Requirements:** MINIMAL
- No AI processing needed
- Simple audio streaming
- Network communication only
- **Verdict:** ✅ K210 is MORE than capable
### Option B: Edge Wake Word (Future)
**K210 Role:** Wake word detection on-device
- Load KMODEL wake word model ⚠️ Needs conversion
- Run inference on KPU ⚠️ Quantization required
- Detect wake word locally ⚠️ Possible but limited
**K210 Limitations:**
- KMODEL conversion complex (TF→ONNX→KMODEL)
- Quantization may reduce accuracy (80-90% vs 95%+)
- Limited to simpler models
- **Verdict:** ⚠️ Possible but challenging
---
## Why K210 is PERFECT for Our Project
### 1. We're Starting with Server-Side Detection
- K210 only does audio I/O
- All AI processing on Heimdall (powerful server)
- No need for cutting-edge hardware
- **K210 is ideal for this role**
### 2. Audio Processing is Not Computationally Intensive
Unlike YOLOv2 (60 FPS video processing):
- Audio: 16kHz sample rate = 16,000 samples/second
- Wake word: Simple streaming
- No real-time neural network inference needed (server-side)
- **K210's "old" specs don't matter**
### 3. Edge Detection is Optional (Future Enhancement)
- We can prove the concept with server-side first
- Edge detection is a nice-to-have optimization
- If we need edge later, we can:
- Use simpler wake word models
- Accept slightly lower accuracy
- Or upgrade hardware then
- **Starting point doesn't require latest hardware**
### 4. K210 Advantages We Actually Care About
- ✅ Well-documented (mature platform)
- ✅ Stable MaixPy firmware
- ✅ Large community and examples
- ✅ Proven audio processing
- ✅ Already have the hardware!
- ✅ Cost-effective ($30 vs $100+ newer boards)
---
## Performance Targets vs K210 Capabilities
### What We Need:
- Audio capture: 16kHz, 1 channel ✅ K210: Easy
- Audio streaming: ~128 kbps over WiFi ✅ K210: No problem
- Wake word latency: <200ms K210: Achievable (server-side)
- LED feedback: Instant ✅ K210: Trivial
- Audio playback: 16kHz TTS ✅ K210: Supported
### What We DON'T Need (for initial deployment):
- ❌ Real-time video processing
- ❌ Complex neural networks on device
- ❌ Multi-model inference
- ❌ High-resolution image processing
- ❌ Latest and greatest AI accelerator
---
## Comparison to Alternatives
### If we bought newer hardware:
**V831 ($50-70):**
- Pros: Newer, better supported
- Cons:
- More expensive
- SLOWER at neural networks than K210
- Still need server for Whisper anyway
- Overkill for audio I/O
**ESP32-S3 ($10-20):**
- Pros: Cheap, WiFi built-in
- Cons:
- No KPU (if we want edge detection later)
- Less capable for ML
- Would work for server-side though
**Raspberry Pi Zero 2 W ($15):**
- Pros: Full Linux, familiar
- Cons:
- No dedicated audio hardware
- No neural accelerator
- More power hungry
- Overkill for our needs
**Verdict:** K210 is actually the sweet spot for this project!
---
## Real-World Comparison
### What K210 CAN Do (Proven):
- Audio classification ✅
- Simple keyword spotting ✅
- Voice activity detection ✅
- Audio streaming ✅
- Multi-microphone beamforming ✅
### What We're Asking It To Do:
- Stream audio to server ✅ Much easier
- (Optional future) Simple wake word detection ✅ Proven capability
---
## Recommendation: Proceed with K210
### Phase 1: Server-Side (Now)
K210 role: Audio I/O device
- **Difficulty:** Easy
- **Performance:** Excellent
- **K210 utilization:** ~10-20%
- **Status:** No concerns whatsoever
### Phase 2: Edge Detection (Future)
K210 role: Wake word detection + audio I/O
- **Difficulty:** Moderate (model conversion)
- **Performance:** Good enough (80-90% accuracy)
- **K210 utilization:** ~30-40%
- **Status:** Feasible, community has done it
---
## Conclusion
**Is K210 outdated?** Yes, for cutting-edge ML applications.
**Is K210 suitable for our project?** ABSOLUTELY YES!
**Why:**
1. We're using server-side processing (K210 just streams audio)
2. K210's audio capabilities are excellent
3. Mature platform = more examples and stability
4. Already have the hardware
5. Cost-effective
6. Can optionally upgrade to edge detection later
**The "outdated" warning is for people wanting latest ML performance. We're using it as an audio I/O device with WiFi - it's perfect for that!**
---
## Additional Notes
### From MaixPy GitHub Warning:
> "We now recommend users choose the MaixCAM ... For 2018 K210 ... limited performance"
**Our Response:**
- We don't need 2024 performance for audio streaming
- Server does the heavy lifting (Heimdall with NVIDIA GPU)
- K210 mature platform is actually an advantage
- If we need more later, we can upgrade edge device while keeping server
### Community Validation:
Many Mycroft Precise + K210 projects exist:
- Audio streaming: Proven ✅
- Edge wake word: Proven ✅
- Full voice assistant: Proven ✅
**The K210 is "outdated" for video/vision ML, not for audio projects.**
---
**Final Verdict:** ✅ PROCEED WITH CONFIDENCE
The K210 is perfect for our use case. Ignore the "outdated" warning - that's for people doing real-time video processing or wanting the latest ML features. For a voice assistant where the heavy lifting happens server-side, the K210 is an excellent, mature, cost-effective choice!

566
docs/LCD_CAMERA_FEATURES.md Executable file
View file

@ -0,0 +1,566 @@
# Maix Duino LCD & Camera Feature Analysis
**Date:** 2025-11-29
**Hardware:** Sipeed Maix Duino (K210)
**Question:** What's the overhead for using LCD display and camera?
---
## Hardware Capabilities
### LCD Display
- **Resolution:** Typically 320x240 or 240x135 (depending on model)
- **Interface:** SPI
- **Color:** RGB565 (16-bit color)
- **Frame Rate:** Up to 60 FPS (limited by SPI bandwidth)
- **Status:** ✅ Included with most Maix Duino kits
### Camera
- **Resolution:** Various (OV2640 common: 2MP, up to 1600x1200)
- **Interface:** DVP (Digital Video Port)
- **Frame Rate:** Up to 60 FPS (lower at high resolution)
- **Status:** ✅ Often included with Maix Duino kits
### K210 Resources
- **CPU:** Dual-core RISC-V @ 400MHz
- **KPU:** Neural network accelerator
- **SRAM:** 8MB total (6MB available for apps)
- **Flash:** 16MB
---
## LCD Usage for Voice Assistant
### Use Case 1: Status Display (Minimal Overhead)
**What to Show:**
- Current state (idle/listening/processing/responding)
- Wake word detected indicator
- WiFi status and signal strength
- Server connection status
- Volume level
- Time/date
**Overhead:**
- **CPU:** ~2-5% (simple text/icons)
- **RAM:** ~200KB (framebuffer + assets)
- **Power:** ~50mW additional
- **Complexity:** Low (MaixPy has built-in LCD support)
**Code Example:**
```python
import lcd
import image
lcd.init()
lcd.rotation(2) # Rotate if needed
# Simple status display
img = image.Image(size=(320, 240))
img.draw_string(10, 10, "Listening...", color=(0, 255, 0), scale=3)
img.draw_circle(300, 20, 10, color=(0, 255, 0), fill=True) # Status LED
lcd.display(img)
```
**Verdict:** ✅ **Very Low Overhead - Highly Recommended**
---
### Use Case 2: Audio Waveform Visualizer (Moderate Overhead)
#### Input Waveform (Microphone)
**What to Show:**
- Real-time audio level meter
- Waveform display (oscilloscope style)
- VU meter
- Frequency spectrum (simple bars)
**Overhead:**
- **CPU:** ~10-15% (real-time drawing)
- **RAM:** ~300KB (framebuffer + audio buffer)
- **Frame Rate:** 15-30 FPS (sufficient for audio visualization)
- **Complexity:** Moderate (drawing primitives + FFT)
**Implementation:**
```python
import lcd, audio, image
import array
lcd.init()
audio.init()
def draw_waveform(audio_buffer):
img = image.Image(size=(320, 240))
# Draw waveform
width = 320
height = 240
center = height // 2
# Sample every Nth point to fit on screen
step = len(audio_buffer) // width
for x in range(width - 1):
y1 = center + (audio_buffer[x * step] // 256)
y2 = center + (audio_buffer[(x + 1) * step] // 256)
img.draw_line(x, y1, x + 1, y2, color=(0, 255, 0))
# Add level meter
level = max(abs(min(audio_buffer)), abs(max(audio_buffer)))
bar_height = (level * height) // 32768
img.draw_rectangle(0, height - bar_height, 20, bar_height,
color=(0, 255, 0), fill=True)
lcd.display(img)
```
**Verdict:** ✅ **Moderate Overhead - Feasible and Cool!**
---
#### Output Waveform (TTS Response)
**What to Show:**
- TTS audio being played back
- Speaking animation (mouth/sound waves)
- Response text scrolling
**Overhead:**
- **CPU:** ~10-15% (similar to input)
- **RAM:** ~300KB
- **Complexity:** Moderate
**Note:** Can reuse same visualization code as input waveform.
**Verdict:** ✅ **Same as Input - Totally Doable**
---
### Use Case 3: Spectrum Analyzer (Higher Overhead)
**What to Show:**
- Frequency bars (FFT visualization)
- 8-16 frequency bands
- Classic "equalizer" look
**Overhead:**
- **CPU:** ~20-30% (FFT computation + drawing)
- **RAM:** ~500KB (FFT buffers + framebuffer)
- **Complexity:** Moderate-High (FFT required)
**Implementation Note:**
- K210 KPU can accelerate FFT operations
- Can do simple 8-band analysis with minimal CPU
- More bands = more CPU
**Verdict:** ⚠️ **Higher Overhead - Use Sparingly**
---
### Use Case 4: Interactive UI (High Overhead)
**What to Show:**
- Touchscreen controls (if touchscreen available)
- Settings menu
- Volume slider
- Wake word selection
- Network configuration
**Overhead:**
- **CPU:** ~20-40% (touch detection + UI rendering)
- **RAM:** ~1MB (UI framework + assets)
- **Complexity:** High (need UI framework)
**Verdict:** ⚠️ **High Overhead - Nice-to-Have Later**
---
## Camera Usage for Voice Assistant
### Use Case 1: Person Detection (Wake on Face)
**What to Do:**
- Detect person in frame
- Only listen when someone present
- Privacy mode: disable when no one around
**Overhead:**
- **CPU:** ~30-40% (KPU handles inference)
- **RAM:** ~1.5MB (model + frame buffers)
- **Power:** ~200mW additional
- **Complexity:** Moderate (pre-trained models available)
**Pros:**
- ✅ Privacy enhancement (only listen when occupied)
- ✅ Power saving (sleep when empty room)
- ✅ Pre-trained models available for K210
**Cons:**
- ❌ Adds latency (check camera before listening)
- ❌ Privacy concerns (camera always on)
- ❌ Moderate resource usage
**Verdict:** 🤔 **Interesting but Complex - Phase 2+**
---
### Use Case 2: Visual Context (Future AI Integration)
**What to Do:**
- "What am I holding?" queries
- Visual scene understanding
- QR code scanning
- Gesture control
**Overhead:**
- **CPU:** 40-60% (vision processing)
- **RAM:** 2-3MB (models + buffers)
- **Complexity:** High (requires vision models)
**Verdict:** ❌ **Too Complex for Initial Release - Future Feature**
---
### Use Case 3: Visual Wake Word (Gesture Detection)
**What to Do:**
- Wave hand to activate
- Thumbs up/down for feedback
- Alternative to voice wake word
**Overhead:**
- **CPU:** ~30-40% (gesture detection)
- **RAM:** ~1.5MB
- **Complexity:** Moderate-High
**Verdict:** 🤔 **Novel Idea - Phase 3+**
---
## Recommended LCD Implementation
### Phase 1: Basic Status Display (Recommended NOW)
```
┌─────────────────────────┐
│ Voice Assistant │
│ │
│ Status: Listening ● │
│ WiFi: ████░░ 75% │
│ Server: Connected │
│ │
│ Volume: [██████░░░] │
│ │
│ Time: 14:23 │
└─────────────────────────┘
```
**Features:**
- Current state indicator
- WiFi signal strength
- Server connection status
- Volume level bar
- Clock
- Wake word indicator (pulsing circle)
**Overhead:** ~2-5% CPU, 200KB RAM
---
### Phase 2: Waveform Visualization (Cool Addition)
```
┌─────────────────────────┐
│ Listening... [●] │
├─────────────────────────┤
│ ╱╲ ╱╲ ╱╲ ╱╲ │
╲╱ ╲ ╲╱ ╲ │
│ │
│ Level: [████░░░░░░] │
└─────────────────────────┘
```
**Features:**
- Real-time waveform (15-30 FPS)
- Audio level meter
- State indicator
- Simple and clean
**Overhead:** ~10-15% CPU, 300KB RAM
---
### Phase 3: Enhanced Visualizer (Polish)
```
┌─────────────────────────┐
│ Hey Computer! [●] │
├─────────────────────────┤
│ ▁▂▃▄▅▆▇█ ▁▂▃▄▅▆▇█ │
│ ▁▂▃▄▅▆▇█ ▁▂▃▄▅▆▇█ │
│ │
│ "Turn off the lights" │
└─────────────────────────┘
```
**Features:**
- Spectrum analyzer (8-16 bands)
- Transcription display
- Animated response
- More polished UI
**Overhead:** ~20-30% CPU, 500KB RAM
---
## Resource Budget Analysis
### Total K210 Resources
- **CPU:** 2 cores @ 400MHz (assume ~100% available)
- **RAM:** 6MB available for app
- **Bandwidth:** SPI (LCD), I2S (audio), WiFi
### Current Voice Assistant Usage (Server-Side Wake Word)
| Component | CPU % | RAM (KB) |
|-----------|-------|----------|
| Audio Capture (I2S) | 5% | 128 |
| Audio Playback | 5% | 128 |
| WiFi Streaming | 10% | 256 |
| Network Stack | 5% | 512 |
| MaixPy Runtime | 10% | 1024 |
| **Base Total** | **35%** | **~2MB** |
### With LCD Features
| Display Mode | CPU % | RAM (KB) | Total CPU | Total RAM |
|--------------|-------|----------|-----------|-----------|
| **None** | 0% | 0 | 35% | 2MB |
| **Status Only** | 2-5% | 200 | 37-40% | 2.2MB |
| **Waveform** | 10-15% | 300 | 45-50% | 2.3MB |
| **Spectrum** | 20-30% | 500 | 55-65% | 2.5MB |
### With Camera Features
| Feature | CPU % | RAM (KB) | Feasible? |
|---------|-------|----------|-----------|
| Person Detection | 30-40% | 1500 | ⚠️ Tight |
| Gesture Control | 30-40% | 1500 | ⚠️ Tight |
| Visual Context | 40-60% | 2500 | ❌ Too much |
---
## Recommendations
### ✅ IMPLEMENT NOW: Basic Status Display
- **Why:** Very low overhead, huge UX improvement
- **Overhead:** 2-5% CPU, 200KB RAM
- **Benefit:** Users know what's happening at a glance
- **Difficulty:** Easy (MaixPy has good LCD support)
### ✅ IMPLEMENT SOON: Waveform Visualizer
- **Why:** Cool factor, moderate overhead
- **Overhead:** 10-15% CPU, 300KB RAM
- **Benefit:** Engaging, confirms mic is working, looks professional
- **Difficulty:** Moderate (simple drawing code)
### 🤔 CONSIDER LATER: Spectrum Analyzer
- **Why:** Higher overhead, diminishing returns
- **Overhead:** 20-30% CPU, 500KB RAM
- **Benefit:** Looks cool but not essential
- **Difficulty:** Moderate-High (FFT required)
### ❌ SKIP FOR NOW: Camera Features
- **Why:** High overhead, complex, privacy concerns
- **Overhead:** 30-60% CPU, 1.5-2.5MB RAM
- **Benefit:** Novel but not core functionality
- **Difficulty:** High (model integration, privacy handling)
---
## Implementation Priority
### Phase 1 (Week 1): Core Functionality
- [x] Audio capture and streaming
- [x] Server integration
- [ ] Basic LCD status display
- Idle/Listening/Processing states
- WiFi status
- Connection indicator
### Phase 2 (Week 2-3): Visual Enhancement
- [ ] Audio waveform visualizer
- Input (microphone) waveform
- Output (TTS) waveform
- Level meters
- Clean, minimal design
### Phase 3 (Month 2): Polish
- [ ] Spectrum analyzer option
- [ ] Animated transitions
- [ ] Settings display
- [ ] Network configuration UI (optional)
### Phase 4 (Month 3+): Advanced Features
- [ ] Camera person detection (privacy mode)
- [ ] Gesture control experiments
- [ ] Visual wake word alternative
---
## Code Structure Recommendation
```python
# main.py structure with modular display
import lcd, audio, network
from display_manager import DisplayManager
from audio_processor import AudioProcessor
from voice_client import VoiceClient
# Initialize
lcd.init()
display = DisplayManager(mode='waveform') # or 'status' or 'spectrum'
# Main loop
while True:
# Audio processing
audio_buffer = audio.capture()
# Update display (non-blocking)
if display.mode == 'status':
display.show_status(state='listening', wifi_level=75)
elif display.mode == 'waveform':
display.show_waveform(audio_buffer)
elif display.mode == 'spectrum':
display.show_spectrum(audio_buffer)
# Network communication
voice_client.stream_audio(audio_buffer)
```
---
## Measured Overhead (Estimated)
### Status Display Only
- **CPU:** 38% total (3% for display)
- **RAM:** 2.2MB total (200KB for display)
- **Battery Life:** -2% (minimal impact)
- **WiFi Latency:** No impact
- **Verdict:** ✅ Negligible impact, worth it!
### Waveform Visualizer
- **CPU:** 48% total (13% for display)
- **RAM:** 2.3MB total (300KB for display)
- **Battery Life:** -5% (minor impact)
- **WiFi Latency:** No impact (still <200ms)
- **Verdict:** ✅ Acceptable, looks great!
### Spectrum Analyzer
- **CPU:** 60% total (25% for display)
- **RAM:** 2.5MB total (500KB for display)
- **Battery Life:** -8% (noticeable)
- **WiFi Latency:** Possible minor impact
- **Verdict:** ⚠️ Usable but pushing limits
---
## Camera: Should You Use It?
### Pros
- ✅ Already have the hardware (free!)
- ✅ Novel features (person detection, gestures)
- ✅ Privacy enhancement potential
- ✅ Future-proofing
### Cons
- ❌ High resource usage (30-60% CPU, 1.5-2.5MB RAM)
- ❌ Complex implementation
- ❌ Privacy concerns (camera always on)
- ❌ Not core to voice assistant
- ❌ Competes with audio processing resources
### Recommendation
**Skip camera for initial implementation.** Focus on core voice assistant functionality. Revisit in Phase 3+ when:
1. Core features are stable
2. You want to experiment
3. You have time for optimization
4. You want to differentiate from commercial assistants
---
## Final Recommendations
### Start With (NOW):
```python
# Simple status display
# - State indicator
# - WiFi status
# - Connection status
# - Time/date
# Overhead: ~3% CPU, 200KB RAM
```
### Add Next (Week 2):
```python
# Waveform visualizer
# - Real-time audio waveform
# - Level meter
# - Clean design
# Overhead: +10% CPU, +100KB RAM
```
### Maybe Later (Month 2+):
```python
# Spectrum analyzer
# - 8-16 frequency bands
# - FFT visualization
# - Optional mode
# Overhead: +15% CPU, +200KB RAM
```
### Skip (For Now):
```python
# Camera features
# - Person detection
# - Gestures
# - Visual context
# Too complex, revisit later
```
---
## Example: Combined Status + Waveform Display
```
┌───────────────────────────────┐
│ Voice Assistant [LISTENING]│
├───────────────────────────────┤
│ │
│ ╱╲ ╱╲ ╱╲ ╱╲ ╱╲ │
╲╱ ╲ ╲╱ ╲ │
│ ╲╱ ╲╱ │
│ │
│ Vol: [████████░░] WiFi: ▂▃▅█ │
│ │
│ Server: 10.1.10.71 ● 14:23 │
└───────────────────────────────┘
```
**Total Overhead:** ~15% CPU, 300KB RAM
**Impact:** Minimal, excellent UX improvement
**Coolness Factor:** 9/10
---
## Conclusion
### LCD: YES! Definitely Use It! ✅
- **Status display:** Low overhead, huge benefit
- **Waveform:** Moderate overhead, looks amazing
- **Spectrum:** Higher overhead, nice-to-have
**Recommendation:** Start with status, add waveform, consider spectrum later.
### Camera: Skip For Now ❌
- High overhead
- Complex implementation
- Not core functionality
- Revisit in Phase 3+
**Focus on nailing the voice assistant first, then add visual features incrementally!**
---
**TL;DR:** Use the LCD for status + waveform visualization (~15% overhead total). Skip the camera for now. Your K210 can easily handle this! 🎉

638
docs/MYCROFT_PRECISE_GUIDE.md Executable file
View file

@ -0,0 +1,638 @@
# Mycroft Precise Wake Word Training Guide
## Overview
Mycroft Precise is a neural network-based wake word detector that you can train on custom wake words. This guide covers two deployment approaches for your Maix Duino voice assistant:
1. **Server-side detection** (Recommended to start) - Run Precise on Heimdall
2. **Edge detection** (Advanced) - Convert model for K210 on Maix Duino
## Architecture Options
### Option A: Server-Side Wake Word Detection (Recommended)
```
Maix Duino Heimdall
┌─────────────────┐ ┌──────────────────────┐
│ Continuous │ Audio Stream │ Mycroft Precise │
│ Audio Capture │───────────────>│ Wake Word Detection │
│ │ │ │
│ LED Feedback │<───────────────│ Whisper STT │
│ Speaker Output │ Response │ HA Integration │
│ │ │ Piper TTS │
└─────────────────┘ └──────────────────────┘
```
**Pros:**
- Easier setup and debugging
- Better accuracy (more compute available)
- Easy to retrain and update models
- Can use ensemble models
**Cons:**
- Continuous audio streaming (bandwidth)
- Slightly higher latency (~100-200ms)
- Requires stable network
### Option B: Edge Detection on Maix Duino (Advanced)
```
Maix Duino Heimdall
┌─────────────────┐ ┌──────────────────────┐
│ Precise Model │ │ │
│ (K210 KPU) │ │ │
│ Wake Detection │ Audio (on wake)│ Whisper STT │
│ │───────────────>│ HA Integration │
│ Audio Capture │ │ Piper TTS │
│ LED Feedback │<───────────────│ │
└─────────────────┘ Response └──────────────────────┘
```
**Pros:**
- Lower latency (~50ms wake detection)
- Less network traffic
- Works even if server is down
- Better privacy (no continuous streaming)
**Cons:**
- Complex model conversion (TensorFlow → ONNX → KMODEL)
- Limited by K210 compute
- Harder to update models
- Requires careful optimization
## Recommended Approach: Start with Server-Side
Begin with server-side detection on Heimdall, then optimize to edge detection once everything works.
## Phase 1: Mycroft Precise Setup on Heimdall
### Install Mycroft Precise
```bash
# SSH to Heimdall
ssh alan@10.1.10.71
# Create conda environment for Precise
conda create -n precise python=3.7 -y
conda activate precise
# Install TensorFlow 1.x (Precise requires this)
pip install tensorflow==1.15.5 --break-system-packages
# Install Precise
pip install mycroft-precise --break-system-packages
# Install audio dependencies
sudo apt-get install -y portaudio19-dev sox libatlas-base-dev
# Install precise-engine (for faster inference)
wget https://github.com/MycroftAI/mycroft-precise/releases/download/v0.3.0/precise-engine_0.3.0_x86_64.tar.gz
tar xvf precise-engine_0.3.0_x86_64.tar.gz
sudo cp precise-engine/precise-engine /usr/local/bin/
sudo chmod +x /usr/local/bin/precise-engine
```
### Verify Installation
```bash
precise-engine --version
# Should output: Precise v0.3.0
precise-listen --help
# Should show help text
```
## Phase 2: Training Your Custom Wake Word
### Step 1: Collect Wake Word Samples
You'll need ~50-100 samples of your wake word. Choose something:
- 2-3 syllables long
- Easy to pronounce
- Unlikely to occur in normal speech
Example wake words:
- "Hey Computer" (recommended - similar to commercial products)
- "Okay Jarvis"
- "Hello Assistant"
- "Activate Assistant"
```bash
# Create project directory
mkdir -p ~/precise-models/hey-computer
cd ~/precise-models/hey-computer
# Record wake word samples
precise-collect
```
When prompted:
1. Type your wake word ("hey computer")
2. Press SPACE to record
3. Say the wake word clearly
4. Press SPACE to stop
5. Repeat 50-100 times
**Tips for good samples:**
- Vary your tone and speed
- Different distances from mic
- Different background noise levels
- Different pronunciations
- Have family members record too
### Step 2: Collect "Not Wake Word" Samples
Record background audio and similar-sounding phrases:
```bash
# Create not-wake-word directory
mkdir -p not-wake-word
# Record random speech, music, TV, etc.
# These help the model learn what NOT to trigger on
precise-collect -f not-wake-word/random.wav
```
Collect ~200-500 samples of:
- Normal conversation
- TV/music in background
- Similar sounding phrases ("hey commuter", "they computed", etc.)
- Ambient noise
- Other household sounds
### Step 3: Generate Training Data
```bash
# Organize samples
mkdir -p hey-computer/{wake-word,not-wake-word,test/wake-word,test/not-wake-word}
# Split samples (80% train, 20% test)
# Move 80% of wake-word samples to hey-computer/wake-word/
# Move 20% to hey-computer/test/wake-word/
# Move 80% of not-wake-word to hey-computer/not-wake-word/
# Move 20% to hey-computer/test/not-wake-word/
# Generate training data
precise-train-incremental hey-computer.net hey-computer/
```
### Step 4: Train the Model
```bash
# Basic training (will take 30-60 minutes)
precise-train -e 60 hey-computer.net hey-computer/
# For better accuracy, train longer
precise-train -e 120 hey-computer.net hey-computer/
# Watch for overfitting - validation loss should decrease
# Stop if validation loss starts increasing
```
Training output will show:
```
Epoch 1/60
loss: 0.4523 - val_loss: 0.3891
Epoch 2/60
loss: 0.3102 - val_loss: 0.2845
...
```
### Step 5: Test the Model
```bash
# Test with microphone
precise-listen hey-computer.net
# Speak your wake word - should see "!" when detected
# Speak other phrases - should not trigger
# Test with audio files
precise-test hey-computer.net hey-computer/test/
# Should show accuracy metrics:
# Wake word accuracy: 95%+
# False positive rate: <5%
```
### Step 6: Optimize Sensitivity
```bash
# Adjust activation threshold
precise-listen hey-computer.net -t 0.5 # Default
precise-listen hey-computer.net -t 0.7 # More conservative
precise-listen hey-computer.net -t 0.3 # More aggressive
# Find optimal threshold for your use case
# Higher = fewer false positives, more false negatives
# Lower = more false positives, fewer false negatives
```
## Phase 3: Integration with Voice Server
### Update voice_server.py
Add Mycroft Precise support to the server:
```python
# Add to imports
from precise_runner import PreciseEngine, PreciseRunner
import pyaudio
# Add to configuration
PRECISE_MODEL = os.getenv("PRECISE_MODEL",
"/home/alan/precise-models/hey-computer.net")
PRECISE_SENSITIVITY = float(os.getenv("PRECISE_SENSITIVITY", "0.5"))
# Global precise runner
precise_runner = None
def on_activation():
"""Called when wake word is detected"""
print("Wake word detected!")
# Trigger recording and processing
# (Implementation depends on your audio streaming setup)
def start_precise_listener():
"""Start Mycroft Precise wake word detection"""
global precise_runner
engine = PreciseEngine(
'/usr/local/bin/precise-engine',
PRECISE_MODEL
)
precise_runner = PreciseRunner(
engine,
sensitivity=PRECISE_SENSITIVITY,
on_activation=on_activation
)
precise_runner.start()
print(f"Precise listening with model: {PRECISE_MODEL}")
```
### Server-Side Wake Word Detection Architecture
For server-side detection, you need continuous audio streaming from Maix Duino:
```python
# New endpoint for audio streaming
@app.route('/stream', methods=['POST'])
def stream_audio():
"""
Receive continuous audio stream for wake word detection
This endpoint processes incoming audio chunks and runs them
through Mycroft Precise for wake word detection.
"""
# Implementation here
pass
```
## Phase 4: Maix Duino Integration (Server-Side Detection)
### Update maix_voice_client.py
For server-side detection, stream audio continuously:
```python
# Add to configuration
STREAM_ENDPOINT = "/stream"
WAKE_WORD_CHECK_INTERVAL = 0.1 # Check every 100ms
def stream_audio_continuous():
"""
Stream audio to server for wake word detection
Server will notify us when wake word is detected
"""
import socket
import struct
# Create socket connection
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server_addr = (VOICE_SERVER_URL.replace('http://', '').split(':')[0], 8888)
try:
sock.connect(server_addr)
print("Connected to wake word server")
while True:
# Capture audio chunk
chunk = i2s_dev.record(CHUNK_SIZE)
if chunk:
# Send chunk size first, then chunk
sock.sendall(struct.pack('>I', len(chunk)))
sock.sendall(chunk)
# Check for wake word detection signal
# (simplified - actual implementation needs non-blocking socket)
time.sleep(0.01)
except Exception as e:
print(f"Streaming error: {e}")
finally:
sock.close()
```
## Phase 5: Edge Detection on Maix Duino (Advanced)
### Convert Precise Model to KMODEL
This is complex and requires several conversion steps:
```bash
# Step 1: Convert TensorFlow model to ONNX
pip install tf2onnx --break-system-packages
python -m tf2onnx.convert \
--saved-model hey-computer.net \
--output hey-computer.onnx
# Step 2: Optimize ONNX model
pip install onnx --break-system-packages
python -c "
import onnx
from onnx import optimizer
model = onnx.load('hey-computer.onnx')
passes = ['eliminate_deadend', 'eliminate_identity',
'eliminate_nop_dropout', 'eliminate_nop_pad']
optimized = optimizer.optimize(model, passes)
onnx.save(optimized, 'hey-computer-opt.onnx')
"
# Step 3: Convert ONNX to KMODEL (for K210)
# Use nncase (https://github.com/kendryte/nncase)
# This step is hardware-specific and complex
# Install nncase
pip install nncase --break-system-packages
# Convert (adjust parameters based on your model)
ncc compile hey-computer-opt.onnx \
-i onnx \
--dataset calibration_data \
-o hey-computer.kmodel \
--target k210
```
**Note:** KMODEL conversion is non-trivial and may require model architecture adjustments. The K210 has limitations:
- Max model size: ~6MB
- Limited operators support
- Quantization required for performance
### Testing KMODEL on Maix Duino
```python
# Load model in maix_voice_client.py
import KPU as kpu
def load_wake_word_model_kmodel():
"""Load converted KMODEL for wake word detection"""
global kpu_task
try:
kpu_task = kpu.load("/sd/models/hey-computer.kmodel")
print("Wake word model loaded on K210")
return True
except Exception as e:
print(f"Failed to load model: {e}")
return False
def detect_wake_word_kmodel():
"""Run wake word detection using K210 KPU"""
global kpu_task
# Capture audio
audio_chunk = i2s_dev.record(CHUNK_SIZE)
# Preprocess for model (depends on model input format)
# This is model-specific - adjust based on your training
# Run inference
features = preprocess_audio(audio_chunk)
output = kpu.run_yolo2(kpu_task, features) # Adjust based on model type
# Check confidence
if output[0] > WAKE_WORD_THRESHOLD:
return True
return False
```
## Recommended Wake Words
Based on testing and community feedback:
**Best performers:**
1. "Hey Computer" - Clear, distinct, 2-syllable, hard consonants
2. "Okay Jarvis" - Pop culture reference, easy to say
3. "Hey Mycroft" - Original Mycroft wake word (lots of training data available)
**Avoid:**
- Single syllable words (too easy to trigger)
- Common phrases ("okay", "hey there")
- Names of people in your household
- Words that sound like common speech patterns
## Training Tips
### For Best Accuracy
1. **Diverse training data:**
- Multiple speakers
- Various distances (1ft to 15ft)
- Different noise conditions
- Accent variations
2. **Quality over quantity:**
- 50 good samples > 200 poor samples
- Clear pronunciation
- Consistent volume
3. **Hard negatives:**
- Include similar-sounding phrases
- Include partial wake words
- Include common false triggers you notice
4. **Regular retraining:**
- Add false positives to training set
- Add missed detections
- Retrain every few weeks initially
### Collecting Hard Negatives
```bash
# Run Precise in test mode and collect false positives
precise-listen hey-computer.net --save-false-positives
# This will save audio clips when model triggers incorrectly
# Add these to your not-wake-word training set
# Retrain to reduce false positives
```
## Performance Benchmarks
### Server-Side Detection (Heimdall)
- **Latency:** 100-200ms from utterance to detection
- **Accuracy:** 95%+ with good training
- **False positive rate:** <1 per hour with tuning
- **CPU usage:** ~5-10% (single core)
- **Network:** ~128kbps continuous stream
### Edge Detection (Maix Duino)
- **Latency:** 50-100ms
- **Accuracy:** 80-90% (limited by K210 quantization)
- **False positive rate:** Varies by model optimization
- **CPU usage:** ~30% K210 (leaves room for other tasks)
- **Network:** 0 until wake detected
## Monitoring and Debugging
### Log Wake Word Detections
```python
# Add to voice_server.py
import datetime
def log_wake_word(confidence, timestamp=None):
"""Log wake word detections for analysis"""
if timestamp is None:
timestamp = datetime.datetime.now()
log_file = "/home/alan/voice-assistant/logs/wake_words.log"
with open(log_file, 'a') as f:
f.write(f"{timestamp.isoformat()},{confidence}\n")
```
### Analyze False Positives
```bash
# Check wake word log
tail -f ~/voice-assistant/logs/wake_words.log
# Find patterns in false positives
grep "wake_word" ~/voice-assistant/logs/wake_words.log | \
awk -F',' '{print $2}' | \
sort -n | uniq -c
```
## Production Deployment
### Systemd Service with Precise
Update the systemd service to include Precise:
```ini
[Unit]
Description=Voice Assistant with Wake Word Detection
After=network.target
[Service]
Type=simple
User=alan
WorkingDirectory=/home/alan/voice-assistant
Environment="PATH=/home/alan/miniconda3/envs/precise/bin:/usr/local/bin:/usr/bin:/bin"
EnvironmentFile=/home/alan/voice-assistant/config/.env
ExecStart=/home/alan/miniconda3/envs/precise/bin/python voice_server.py --enable-precise
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
```
## Troubleshooting
### Precise Won't Start
```bash
# Check TensorFlow version
python -c "import tensorflow as tf; print(tf.__version__)"
# Should be 1.15.x
# Check model file
file hey-computer.net
# Should be "TensorFlow SavedModel"
# Test model directly
precise-engine hey-computer.net
# Should load without errors
```
### Low Accuracy
1. **Collect more training data** - Especially hard negatives
2. **Increase training epochs** - Try 200-300 epochs
3. **Verify training/test split** - Should be 80/20
4. **Check audio quality** - Sample rate should match (16kHz)
5. **Try different wake words** - Some are easier to detect
### High False Positive Rate
1. **Increase threshold** - Try 0.6, 0.7, 0.8
2. **Add false positives to training** - Retrain with false triggers
3. **Collect more negative samples** - Expand not-wake-word set
4. **Use ensemble models** - Run multiple models, require agreement
### KMODEL Conversion Fails
This is expected - K210 conversion is complex:
1. **Simplify model architecture** - Reduce layer count
2. **Use quantization-aware training** - Train with quantization in mind
3. **Check operator support** - K210 doesn't support all TF ops
4. **Consider alternatives:**
- Use pre-trained models for K210
- Stick with server-side detection
- Use Porcupine instead (has K210 support)
## Alternative: Use Pre-trained Models
Mycroft provides some pre-trained models:
```bash
# Download Hey Mycroft model
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
tar xzf hey-mycroft.tar.gz
# Test it
precise-listen hey-mycroft.net
```
Then train your own wake word starting from this base:
```bash
# Fine-tune from pre-trained model
precise-train -e 60 my-wake-word.net my-wake-word/ \
--from-checkpoint hey-mycroft.net
```
## Next Steps
1. **Start with server-side** - Get it working on Heimdall first
2. **Collect good training data** - Quality samples are key
3. **Test and tune threshold** - Find the sweet spot for your environment
4. **Monitor performance** - Track false positives and misses
5. **Iterate on training** - Add hard examples, retrain
6. **Consider edge deployment** - Once server-side is solid
## Resources
- Mycroft Precise Docs: https://github.com/MycroftAI/mycroft-precise
- Training Guide: https://mycroft-ai.gitbook.io/docs/mycroft-technologies/precise
- Community Models: https://github.com/MycroftAI/precise-data
- K210 Docs: https://canaan-creative.com/developer
- nncase: https://github.com/kendryte/nncase
## Conclusion
Mycroft Precise gives you full control over your wake word detection with complete privacy. Start with server-side detection for easier development, then optimize to edge detection once you have a well-trained model.
The key to success is good training data - invest time in collecting diverse, high-quality samples!

577
docs/PRECISE_DEPLOYMENT.md Executable file
View file

@ -0,0 +1,577 @@
# Mycroft Precise Deployment Guide
## Quick Reference: Server vs Edge Detection
### Server-Side Detection (Recommended for Start)
**Setup:**
```bash
# 1. On Heimdall: Setup Precise
./setup_precise.sh --wake-word "hey computer"
# 2. Train your model (follow scripts in ~/precise-models/hey-computer/)
cd ~/precise-models/hey-computer
./1-record-wake-word.sh
./2-record-not-wake-word.sh
# Organize samples, then:
./3-train-model.sh
./4-test-model.sh
# 3. Start voice server with Precise
cd ~/voice-assistant
conda activate precise
python voice_server.py \
--enable-precise \
--precise-model ~/precise-models/hey-computer/hey-computer.net \
--precise-sensitivity 0.5
```
**Architecture:**
- Maix Duino → Continuous audio stream → Heimdall
- Heimdall runs Precise on audio stream
- On wake word: Process command with Whisper
- Response → TTS → Stream back to Maix Duino
**Pros:** Easier setup, better accuracy, simple updates
**Cons:** More network traffic, requires stable connection
### Edge Detection (Advanced - Future Phase)
**Setup:**
```bash
# 1. Train model on Heimdall (same as above)
# 2. Convert to KMODEL for K210
# 3. Deploy to Maix Duino
# (See MYCROFT_PRECISE_GUIDE.md for detailed conversion steps)
```
**Architecture:**
- Maix Duino runs Precise locally on K210
- Only sends audio after wake word detected
- Lower latency, less network traffic
**Pros:** Lower latency, less bandwidth, works offline
**Cons:** Complex conversion, lower accuracy, harder updates
## Phase-by-Phase Deployment
### Phase 1: Server Setup (Day 1)
```bash
# On Heimdall
ssh alan@10.1.10.71
# 1. Setup voice assistant base
./setup_voice_assistant.sh
# 2. Setup Mycroft Precise
./setup_precise.sh --wake-word "hey computer"
# 3. Configure environment
vim ~/voice-assistant/config/.env
```
Update `.env`:
```bash
HA_URL=http://your-home-assistant:8123
HA_TOKEN=your_token_here
PRECISE_MODEL=/home/alan/precise-models/hey-computer/hey-computer.net
PRECISE_SENSITIVITY=0.5
```
### Phase 2: Wake Word Training (Day 1-2)
```bash
# Navigate to training directory
cd ~/precise-models/hey-computer
conda activate precise
# Record samples (30-60 minutes)
./1-record-wake-word.sh # Record 50-100 wake word samples
./2-record-not-wake-word.sh # Record 200-500 negative samples
# Organize samples
# Move 80% of wake-word recordings to wake-word/
# Move 20% of wake-word recordings to test/wake-word/
# Move 80% of not-wake-word to not-wake-word/
# Move 20% of not-wake-word to test/not-wake-word/
# Train model (30-60 minutes)
./3-train-model.sh
# Test model
./4-test-model.sh
# Evaluate on test set
./5-evaluate-model.sh
# Tune threshold
./6-tune-threshold.sh
```
### Phase 3: Server Integration (Day 2)
#### Option A: Manual Testing
```bash
cd ~/voice-assistant
conda activate precise
# Start server with Precise enabled
python voice_server.py \
--enable-precise \
--precise-model ~/precise-models/hey-computer/hey-computer.net \
--precise-sensitivity 0.5 \
--ha-url http://your-ha:8123 \
--ha-token your_token
```
#### Option B: Systemd Service
Update systemd service to use Precise environment:
```bash
sudo vim /etc/systemd/system/voice-assistant.service
```
```ini
[Unit]
Description=Voice Assistant with Wake Word Detection
After=network.target
[Service]
Type=simple
User=alan
WorkingDirectory=/home/alan/voice-assistant
Environment="PATH=/home/alan/miniconda3/envs/precise/bin:/usr/local/bin:/usr/bin:/bin"
EnvironmentFile=/home/alan/voice-assistant/config/.env
ExecStart=/home/alan/miniconda3/envs/precise/bin/python voice_server.py \
--enable-precise \
--precise-model /home/alan/precise-models/hey-computer/hey-computer.net \
--precise-sensitivity 0.5
Restart=on-failure
RestartSec=10
StandardOutput=append:/home/alan/voice-assistant/logs/voice_assistant.log
StandardError=append:/home/alan/voice-assistant/logs/voice_assistant_error.log
[Install]
WantedBy=multi-user.target
```
Enable and start:
```bash
sudo systemctl daemon-reload
sudo systemctl enable voice-assistant
sudo systemctl start voice-assistant
sudo systemctl status voice-assistant
```
### Phase 4: Maix Duino Setup (Day 2-3)
For server-side wake word detection, Maix Duino streams audio:
Update `maix_voice_client.py`:
```python
# Use simplified mode - just stream audio
# Server handles wake word detection
CONTINUOUS_STREAM = True # Enable continuous streaming
WAKE_WORD_CHECK_INTERVAL = 0 # Server-side detection
```
Flash and test:
1. Copy updated script to SD card
2. Boot Maix Duino
3. Check serial console for connection
4. Speak wake word
5. Verify server logs show detection
### Phase 5: Testing & Tuning (Day 3-7)
#### Test Wake Word Detection
```bash
# Monitor server logs
journalctl -u voice-assistant -f
# Or check detections via API
curl http://10.1.10.71:5000/wake-word/detections
```
#### Test End-to-End Flow
1. Say wake word: "Hey Computer"
2. Wait for LED/beep on Maix Duino
3. Say command: "Turn on the living room lights"
4. Verify HA command executes
5. Hear TTS response
#### Monitor Performance
```bash
# Check wake word log
tail -f ~/voice-assistant/logs/wake_words.log
# Check false positive rate
grep "wake_word" ~/voice-assistant/logs/wake_words.log | wc -l
# Check accuracy
# Should see detections when you say wake word
# Should NOT see detections during normal conversation
```
#### Tune Sensitivity
If too many false positives:
```bash
# Increase threshold (more conservative)
# Edit systemd service or restart with:
python voice_server.py --precise-sensitivity 0.7
```
If missing wake words:
```bash
# Decrease threshold (more aggressive)
python voice_server.py --precise-sensitivity 0.3
```
#### Collect Hard Examples
```bash
# When you notice false positives, record them
cd ~/precise-models/hey-computer
precise-collect -f not-wake-word/false-positive-$(date +%s).wav
# When wake word is missed, record it
precise-collect -f wake-word/missed-$(date +%s).wav
# After collecting 10-20 examples, retrain
./3-train-model.sh
```
## Monitoring Commands
### Check System Status
```bash
# Service status
sudo systemctl status voice-assistant
# Server health
curl http://10.1.10.71:5000/health
# Wake word status
curl http://10.1.10.71:5000/wake-word/status
# Recent detections
curl http://10.1.10.71:5000/wake-word/detections
```
### View Logs
```bash
# Real-time server logs
journalctl -u voice-assistant -f
# Last 50 lines
journalctl -u voice-assistant -n 50
# Specific log file
tail -f ~/voice-assistant/logs/voice_assistant.log
# Wake word detections
tail -f ~/voice-assistant/logs/wake_words.log
# Maix Duino serial console
screen /dev/ttyUSB0 115200
```
### Performance Metrics
```bash
# CPU usage (should be ~5-10% idle, spikes during processing)
top -p $(pgrep -f voice_server.py)
# Memory usage
ps aux | grep voice_server.py
# Network traffic (if streaming audio)
iftop -i eth0 # or your network interface
```
## Troubleshooting
### Wake Word Not Detecting
**Check model is loaded:**
```bash
curl http://10.1.10.71:5000/wake-word/status
# Should show: "enabled": true
```
**Test model directly:**
```bash
conda activate precise
precise-listen ~/precise-models/hey-computer/hey-computer.net
# Speak wake word - should see "!"
```
**Check sensitivity:**
```bash
# Try lower threshold
precise-listen ~/precise-models/hey-computer/hey-computer.net -t 0.3
```
**Verify audio input:**
```bash
# Test microphone
arecord -d 5 test.wav
aplay test.wav
```
### Too Many False Positives
**Increase threshold:**
```bash
# Edit service or restart with higher sensitivity
python voice_server.py --precise-sensitivity 0.7
```
**Retrain with false positives:**
```bash
cd ~/precise-models/hey-computer
# Record false triggers in not-wake-word/
precise-collect -f not-wake-word/false-triggers.wav
# Add to not-wake-word training set
./3-train-model.sh
```
### Server Won't Start with Precise
**Check Precise installation:**
```bash
conda activate precise
python -c "from precise_runner import PreciseRunner; print('OK')"
```
**Check engine:**
```bash
precise-engine --version
# Should show: Precise v0.3.0
```
**Check model file:**
```bash
ls -lh ~/precise-models/hey-computer/hey-computer.net
file ~/precise-models/hey-computer/hey-computer.net
```
**Check permissions:**
```bash
chmod +x /usr/local/bin/precise-engine
chmod 644 ~/precise-models/hey-computer/hey-computer.net
```
### Audio Quality Issues
**Test audio path:**
```bash
# Record test on server
arecord -f S16_LE -r 16000 -c 1 -d 5 test.wav
# Transcribe with Whisper
conda activate voice-assistant
python -c "
import whisper
model = whisper.load_model('base')
result = model.transcribe('test.wav')
print(result['text'])
"
```
**If poor quality:**
- Check microphone connection
- Verify sample rate (16kHz)
- Test with USB microphone
- Check for interference/noise
### Maix Duino Connection Issues
**Check WiFi:**
```python
# In Maix Duino serial console
import network
wlan = network.WLAN(network.STA_IF)
print(wlan.isconnected())
print(wlan.ifconfig())
```
**Check server reachability:**
```python
# From Maix Duino
import urequests
response = urequests.get('http://10.1.10.71:5000/health')
print(response.json())
```
**Check audio streaming:**
```bash
# On Heimdall, monitor network
sudo tcpdump -i any -n host <maix-duino-ip>
# Should see continuous packets when streaming
```
## Optimization Tips
### Reduce Latency
1. **Use smaller Whisper model:**
```bash
# Edit .env
WHISPER_MODEL=base # or tiny
```
2. **Optimize Precise sensitivity:**
```bash
# Find sweet spot between false positives and latency
# Lower threshold = faster trigger but more false positives
```
3. **Pre-load models:**
```python
# Models load on startup, not first request
# Adds ~30s startup time but eliminates first-request delay
```
### Improve Accuracy
1. **Use larger Whisper model:**
```bash
WHISPER_MODEL=large
```
2. **Train more wake word samples:**
```bash
# Aim for 100+ high-quality samples
# Diverse speakers, conditions, distances
```
3. **Increase training epochs:**
```bash
# In 3-train-model.sh
precise-train -e 120 hey-computer.net . # vs default 60
```
### Reduce False Positives
1. **Collect hard negatives:**
```bash
# Record TV, music, similar phrases
# Add to not-wake-word training set
```
2. **Increase threshold:**
```bash
--precise-sensitivity 0.7 # vs default 0.5
```
3. **Use ensemble model:**
```python
# Run multiple models, require agreement
# Advanced - requires code modification
```
## Production Checklist
- [ ] Wake word model trained with 50+ samples
- [ ] Model tested with <5% false positive rate
- [ ] Server service enabled and auto-starting
- [ ] Home Assistant token configured
- [ ] Maix Duino WiFi configured
- [ ] End-to-end test successful
- [ ] Logs rotating properly
- [ ] Monitoring in place
- [ ] Backup of trained model
- [ ] Documentation updated
## Backup and Recovery
### Backup Trained Model
```bash
# Backup model
cp ~/precise-models/hey-computer/hey-computer.net \
~/precise-models/hey-computer/hey-computer.net.backup
# Backup to another host
scp ~/precise-models/hey-computer/hey-computer.net \
user@backup-host:/path/to/backups/
```
### Restore from Backup
```bash
# Restore model
cp ~/precise-models/hey-computer/hey-computer.net.backup \
~/precise-models/hey-computer/hey-computer.net
# Restart service
sudo systemctl restart voice-assistant
```
## Next Steps
Once basic server-side detection is working:
1. **Add more intents** - Expand Home Assistant control
2. **Implement TTS playback** - Complete the audio response loop
3. **Multi-room support** - Deploy multiple Maix Duino units
4. **Voice profiles** - Train model on family members
5. **Edge deployment** - Convert model for K210 (advanced)
## Resources
- Main guide: MYCROFT_PRECISE_GUIDE.md
- Quick start: QUICKSTART.md
- Architecture: maix-voice-assistant-architecture.md
- Mycroft Docs: https://github.com/MycroftAI/mycroft-precise
- Community: https://community.mycroft.ai/
## Support
### Log an Issue
```bash
# Collect debug info
echo "=== System Info ===" > debug.log
uname -a >> debug.log
conda list >> debug.log
echo "=== Service Status ===" >> debug.log
systemctl status voice-assistant >> debug.log
echo "=== Recent Logs ===" >> debug.log
journalctl -u voice-assistant -n 100 >> debug.log
echo "=== Wake Word Status ===" >> debug.log
curl http://10.1.10.71:5000/wake-word/status >> debug.log
```
Then share `debug.log` when asking for help.
### Common Issues Database
| Symptom | Likely Cause | Solution |
|---------|--------------|----------|
| No wake detection | Model not loaded | Check `/wake-word/status` |
| Service won't start | Missing dependencies | Reinstall Precise |
| High false positives | Low threshold | Increase to 0.7+ |
| Missing wake words | High threshold | Decrease to 0.3-0.4 |
| Poor transcription | Bad audio quality | Check microphone |
| HA commands fail | Wrong token | Update .env |
| High CPU usage | Large Whisper model | Use smaller model |
## Conclusion
With Mycroft Precise, you have complete control over your wake word detection. Start with server-side detection for easier debugging, collect good training data, and tune the threshold for your environment. Once it's working well, you can optionally optimize to edge detection for lower latency.
The key to success: **Quality training data > Quantity**
Happy voice assisting! 🎙️

470
docs/QUESTIONS_ANSWERED.md Executable file
View file

@ -0,0 +1,470 @@
# Your Questions Answered - Quick Reference
## TL;DR: Yes, Yes, and Multiple Options!
### Q1: Pre-trained "Hey Mycroft" Model?
**Answer: YES! ✅**
Download and use immediately:
```bash
./quick_start_hey_mycroft.sh
# Done in 5 minutes - no training!
```
The pre-trained model works great and saves you 1-2 hours of training time.
### Q2: Multiple Wake Words?
**Answer: YES! ✅ (with considerations)**
**Server-side (Heimdall):** Easy, run 3-5 wake words
```bash
python voice_server_enhanced.py \
--enable-precise \
--multi-wake-word
```
**Edge (K210):** Feasible for 1-2, challenging for 3+
### Q3: Adopting New Users' Voices?
**Answer: Multiple approaches ✅**
**Best option:** Train one model with everyone's voices upfront
**Alternative:** Incremental retraining as new users join
**Advanced:** Speaker identification with personalization
---
## Detailed Answers
### 1. Pre-trained "Hey Mycroft" Model
#### Where to Get It
```bash
# Quick start script does this for you
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
tar xzf hey-mycroft.tar.gz
```
#### How to Use
**Instant deployment:**
```bash
python voice_server.py \
--enable-precise \
--precise-model ~/precise-models/pretrained/hey-mycroft.net
```
**Fine-tune with your voice:**
```bash
# Record 20-30 samples of your voice saying "Hey Mycroft"
precise-collect
# Fine-tune from pre-trained
precise-train -e 30 my-hey-mycroft.net . \
--from-checkpoint ~/precise-models/pretrained/hey-mycroft.net
```
#### Advantages
**Zero training time** - Works immediately
**Proven accuracy** - Tested by thousands
**Good baseline** - Already includes diverse voices
**Easy fine-tuning** - Add your voice in 30 mins vs 60+ mins from scratch
#### When to Use Pre-trained vs Custom
**Use Pre-trained "Hey Mycroft" when:**
- You want to test quickly
- "Hey Mycroft" is an acceptable wake word
- You want proven accuracy out-of-box
**Train Custom when:**
- You want a different wake word ("Hey Computer", "Jarvis", etc.)
- Maximum accuracy for your specific environment
- Family-specific wake word
**Hybrid (Recommended):**
- Start with pre-trained "Hey Mycroft"
- Test and learn the system
- Fine-tune with your samples
- Or add custom wake word later
---
### 2. Multiple Wake Words
#### Can You Have Multiple?
**Yes!** Options:
#### Option A: Server-Side (Recommended)
**Easy implementation:**
```bash
# Use the enhanced server
python voice_server_enhanced.py \
--enable-precise \
--multi-wake-word
```
**Configured wake words:**
- "Hey Mycroft" (pre-trained)
- "Hey Computer" (custom)
- "Jarvis" (custom)
**Resource impact:**
- 3 models = ~15-30% CPU (Heimdall handles easily)
- ~300-600MB RAM
- Each model runs independently
**Example use cases:**
```python
"Hey Mycroft, what's the time?" → General assistant
"Jarvis, run diagnostics" → Personal assistant mode
"Emergency, call help" → Priority/emergency mode
```
#### Option B: Edge (K210)
**Feasible for 1-2 wake words:**
```python
# Sequential checking
for model in ['hey-mycroft.kmodel', 'emergency.kmodel']:
if detect_wake_word(model):
return model
```
**Limitations:**
- +50-100ms latency per additional model
- Memory constraints (6MB total for all models)
- More models = more power consumption
**Recommendation:**
- K210: 1 wake word (optimal)
- K210: 2 wake words (acceptable)
- K210: 3+ wake words (not recommended)
#### Option C: Contextual Wake Words
Different wake words for different purposes:
```python
wake_word_contexts = {
'hey_mycroft': 'general_assistant',
'emergency': 'priority_emergency',
'goodnight': 'bedtime_routine',
}
```
#### Should You Use Multiple?
**One wake word is usually enough!**
Commercial products (Alexa, Google) use one wake word and they work fine.
**Use multiple when:**
- Different family members want different wake words
- You want context-specific behaviors (emergency vs. general)
- You enjoy the flexibility
**Start with one, add more later if needed.**
---
### 3. Adopting New Users' Voices
#### Challenge
Same wake word, different voices:
- Mom says "Hey Mycroft" (soprano)
- Dad says "Hey Mycroft" (bass)
- Kids say "Hey Mycroft" (high-pitched)
All need to work!
#### Solution 1: Diverse Training (Recommended)
**During initial training, have everyone record samples:**
```bash
cd ~/precise-models/family-hey-mycroft
# Session 1: Mom records 30 samples
precise-collect # Mom speaks "Hey Mycroft" 30 times
# Session 2: Dad records 30 samples
precise-collect # Dad speaks "Hey Mycroft" 30 times
# Session 3: Kids record 20 samples each
precise-collect # Kids speak "Hey Mycroft" 40 times total
# Train one model with all voices
precise-train -e 60 family-hey-mycroft.net .
# Deploy
python voice_server.py \
--enable-precise \
--precise-model family-hey-mycroft.net
```
**Pros:**
✅ One model works for everyone
✅ Simple deployment
✅ No switching needed
✅ Works from day one
**Cons:**
❌ Need everyone's time upfront
❌ Slightly lower per-person accuracy than individual models
#### Solution 2: Incremental Training
**Start with one person, add others over time:**
```bash
# Week 1: Train with Dad's voice
precise-train -e 60 hey-mycroft.net .
# Week 2: Mom wants to use it
# Collect Mom's samples
precise-collect # Mom records 20-30 samples
# Add to training set
cp mom-samples/* wake-word/
# Retrain from checkpoint (faster!)
precise-train -e 30 hey-mycroft.net . \
--from-checkpoint hey-mycroft.net
# Now works for both Dad and Mom!
# Week 3: Kids want in
# Repeat process...
```
**Pros:**
✅ Don't need everyone upfront
✅ Easy to add new users
✅ Model improves gradually
**Cons:**
❌ New users may have issues initially
❌ Requires periodic retraining
#### Solution 3: Speaker Identification (Advanced)
**Identify who's speaking, use personalized model/settings:**
```bash
# Install speaker ID
pip install pyannote.audio scipy --break-system-packages
# Use enhanced server
python voice_server_enhanced.py \
--enable-precise \
--enable-speaker-id \
--hf-token YOUR_HF_TOKEN
```
**Enroll users:**
```bash
# Record 30-second voice sample from each person
# POST to /speakers/enroll with audio + name
curl -F "name=alan" \
-F "audio=@alan_voice.wav" \
http://localhost:5000/speakers/enroll
curl -F "name=sarah" \
-F "audio=@sarah_voice.wav" \
http://localhost:5000/speakers/enroll
```
**Benefits:**
```python
# Different responses per user
if speaker == 'alan':
turn_on('light.alan_office')
elif speaker == 'sarah':
turn_on('light.sarah_office')
# Different permissions
if speaker == 'kids' and command.startswith('buy'):
return "Sorry, kids can't make purchases"
```
**Pros:**
✅ Personalized responses
✅ User-specific settings
✅ Better accuracy (optimized per voice)
✅ Can track who said what
**Cons:**
❌ More complex
❌ Privacy considerations
❌ Additional CPU/RAM (~10% + 200MB)
❌ Requires voice enrollment
#### Solution 4: Pre-trained Model (Easiest)
**"Hey Mycroft" already includes diverse voices!**
```bash
# Just use it - already trained on many voices
./quick_start_hey_mycroft.sh
```
The community model was trained with:
- Male and female voices
- Different accents
- Different ages
- Various environments
**It should work for most family members out-of-box!**
Then fine-tune if needed.
---
## Recommended Path for Your Situation
### Scenario: Family of 3-4 People
**Week 1: Quick Start**
```bash
# Use pre-trained "Hey Mycroft"
./quick_start_hey_mycroft.sh
# Test with all family members
# Likely works for everyone already!
```
**Week 2: Fine-tune if Needed**
```bash
# If someone has issues:
# Have them record 20 samples
# Fine-tune the model
precise-train -e 30 family-hey-mycroft.net . \
--from-checkpoint ~/precise-models/pretrained/hey-mycroft.net
```
**Week 3: Add Features**
```bash
# If you want personalization:
python voice_server_enhanced.py \
--enable-speaker-id
# Enroll each family member
```
### Scenario: Just You (or 1-2 People)
**Option 1: Pre-trained**
```bash
./quick_start_hey_mycroft.sh
# Done!
```
**Option 2: Custom Wake Word**
```bash
# Train custom "Hey Computer"
cd ~/precise-models/hey-computer
./1-record-wake-word.sh # 50 samples
./2-record-not-wake-word.sh # 200 samples
./3-train-model.sh
```
### Scenario: Multiple People + Multiple Wake Words
**Full setup:**
```bash
# Pre-trained for family
./quick_start_hey_mycroft.sh
# Personal wake word for Dad
cd ~/precise-models/jarvis
# Train custom wake word
# Emergency wake word
cd ~/precise-models/emergency
# Train emergency wake word
# Run multi-wake-word server
python voice_server_enhanced.py \
--enable-precise \
--multi-wake-word \
--enable-speaker-id
```
---
## Quick Decision Matrix
| Your Situation | Recommendation |
|----------------|----------------|
| **Just getting started** | Pre-trained "Hey Mycroft" |
| **Want different wake word** | Train custom model |
| **Family of 3-4** | Pre-trained + fine-tune if needed |
| **Want personalization** | Add speaker ID |
| **Multiple purposes** | Multiple wake words (server-side) |
| **Deploying to K210** | 1 wake word, no speaker ID |
---
## Files to Use
**Quick start with pre-trained:**
- `quick_start_hey_mycroft.sh` - Zero training, 5 minutes!
**Multiple wake words:**
- `voice_server_enhanced.py` - Multi-wake-word + speaker ID support
**Training custom:**
- `setup_precise.sh` - Setup training environment
- Scripts in `~/precise-models/your-wake-word/`
**Documentation:**
- `WAKE_WORD_ADVANCED.md` - Detailed guide (this is comprehensive!)
- `PRECISE_DEPLOYMENT.md` - Production deployment
---
## Summary
**Yes**, pre-trained "Hey Mycroft" exists and works great
**Yes**, you can have multiple wake words (server-side is easy)
**Yes**, multiple approaches for multi-user support
**Recommended approach:**
1. Start with `./quick_start_hey_mycroft.sh` (5 mins)
2. Test with all family members
3. Fine-tune if anyone has issues
4. Add speaker ID later if you want personalization
5. Consider multiple wake words only if you have specific use cases
**Keep it simple!** One pre-trained wake word works for most people.
---
## Next Actions
**Ready to start?**
```bash
# 5-minute quick start
./quick_start_hey_mycroft.sh
# Or read more first
cat WAKE_WORD_ADVANCED.md
```
**Questions?**
- Pre-trained models: See WAKE_WORD_ADVANCED.md § Pre-trained
- Multiple wake words: See WAKE_WORD_ADVANCED.md § Multiple Wake Words
- Voice adaptation: See WAKE_WORD_ADVANCED.md § Voice Adaptation
**Happy voice assisting! 🎙️**

421
docs/QUICKSTART.md Executable file
View file

@ -0,0 +1,421 @@
# Maix Duino Voice Assistant - Quick Start Guide
## Overview
This guide will walk you through setting up a local, privacy-focused voice assistant using your Maix Duino board and Home Assistant integration. All processing happens on your local network - no cloud services required.
## What You'll Build
- Wake word detection on Maix Duino (edge device)
- Speech-to-text using Whisper on Heimdall
- Home Assistant integration for smart home control
- Text-to-speech responses using Piper
- All processing local to your 10.1.10.0/24 network
## Hardware Requirements
- [x] Sipeed Maix Duino board (you have this!)
- [ ] I2S MEMS microphone (or microphone array)
- [ ] Small speaker (3-5W) or audio output
- [ ] MicroSD card (4GB+) formatted as FAT32
- [ ] USB-C cable for power and programming
## Network Prerequisites
- Maix Duino will need WiFi access to your 10.1.10.0/24 network
- Heimdall (10.1.10.71) for AI processing
- Home Assistant instance (configure URL in setup)
## Setup Process
### Phase 1: Server Setup (Heimdall)
#### Step 1: Run the setup script
```bash
# Transfer files to Heimdall
scp setup_voice_assistant.sh voice_server.py alan@10.1.10.71:~/
# SSH to Heimdall
ssh alan@10.1.10.71
# Make setup script executable and run it
chmod +x setup_voice_assistant.sh
./setup_voice_assistant.sh
```
#### Step 2: Configure Home Assistant access
```bash
# Edit the config file
vim ~/voice-assistant/config/.env
```
Update these values:
```env
HA_URL=http://your-home-assistant:8123
HA_TOKEN=your_long_lived_access_token_here
```
To get a long-lived access token:
1. Open Home Assistant
2. Click your profile (bottom left)
3. Scroll to "Long-Lived Access Tokens"
4. Click "Create Token"
5. Copy the token and paste it in .env
#### Step 3: Test the server
```bash
cd ~/voice-assistant
./test_server.sh
```
You should see:
```
Loading Whisper model: medium
Whisper model loaded successfully
Starting voice processing server on 0.0.0.0:5000
```
#### Step 4: Test with curl (from another terminal)
```bash
# Test health endpoint
curl http://10.1.10.71:5000/health
# Should return:
# {"status":"healthy","whisper_loaded":true,"ha_connected":true}
```
### Phase 2: Maix Duino Setup
#### Step 1: Flash MaixPy firmware
1. Download latest MaixPy firmware from: https://dl.sipeed.com/MAIX/MaixPy/release/
2. Download Kflash GUI: https://github.com/sipeed/kflash_gui
3. Connect Maix Duino via USB
4. Flash firmware using Kflash GUI
#### Step 2: Prepare SD card
```bash
# Format SD card as FAT32
# Create directory structure:
mkdir -p /path/to/sdcard/models
# Copy the client script
cp maix_voice_client.py /path/to/sdcard/main.py
```
#### Step 3: Configure WiFi settings
Edit `/path/to/sdcard/main.py`:
```python
# WiFi Settings
WIFI_SSID = "YourNetworkName"
WIFI_PASSWORD = "YourPassword"
# Server Settings
VOICE_SERVER_URL = "http://10.1.10.71:5000"
```
#### Step 4: Test the board
1. Insert SD card into Maix Duino
2. Connect to serial console (115200 baud)
```bash
screen /dev/ttyUSB0 115200
# or
minicom -D /dev/ttyUSB0 -b 115200
```
3. Power on the board
4. Watch the serial output for connection status
### Phase 3: Integration & Testing
#### Test 1: Basic connectivity
1. Maix Duino should connect to WiFi and display IP on LCD
2. Server should show in logs when Maix connects
#### Test 2: Audio capture
The current implementation uses amplitude-based wake word detection as a placeholder. To test:
1. Clap loudly near the microphone
2. Speak a command (e.g., "turn on the living room lights")
3. Watch the LCD for transcription and response
#### Test 3: Home Assistant control
Supported commands (add more in voice_server.py):
- "Turn on the living room lights"
- "Turn off the bedroom lights"
- "What's the temperature?"
- "Toggle the kitchen lights"
### Phase 4: Wake Word Training (Advanced)
The placeholder wake word detection uses simple amplitude triggering. For production use:
#### Option A: Use Porcupine (easiest)
1. Sign up at: https://console.picovoice.ai/
2. Train custom wake word
3. Download .ppn model
4. Convert to .kmodel for K210
#### Option B: Use Mycroft Precise (FOSS)
```bash
# On a machine with GPU
conda create -n precise python=3.6
conda activate precise
pip install precise-runner
# Record wake word samples
precise-collect
# Train model
precise-train -e 60 my-wake-word.net my-wake-word/
# Convert to .kmodel
# (requires additional tools - see MaixPy docs)
```
## Architecture Diagram
```
┌─────────────────────────────────────────────────────────────┐
│ Your Home Network (10.1.10.0/24) │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Maix Duino │────────>│ Heimdall │ │
│ │ 10.1.10.xxx │ Audio │ 10.1.10.71 │ │
│ │ │<────────│ │ │
│ │ - Wake Word │ Response│ - Whisper │ │
│ │ - Mic Input │ │ - Piper TTS │ │
│ │ - Speaker │ │ - Flask API │ │
│ └──────────────┘ └──────┬───────┘ │
│ │ │
│ │ REST API │
│ v │
│ ┌──────────────┐ │
│ │ Home Asst. │ │
│ │ homeassistant│ │
│ │ │ │
│ │ - Devices │ │
│ │ - Automation │ │
│ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
```
## Troubleshooting
### Maix Duino won't connect to WiFi
```python
# Check serial output for errors
# Common issues:
# - Wrong SSID/password
# - WPA3 not supported (use WPA2)
# - 5GHz network (use 2.4GHz)
```
### Whisper transcription is slow
```bash
# Use a smaller model on Heimdall
# Edit ~/voice-assistant/config/.env:
WHISPER_MODEL=base # or tiny for fastest
```
### Home Assistant commands don't work
```bash
# Check server logs
journalctl -u voice-assistant -f
# Test HA connection manually
curl -H "Authorization: Bearer YOUR_TOKEN" \
http://your-ha:8123/api/states
```
### Audio quality is poor
1. Check microphone connections
2. Adjust `SAMPLE_RATE` in maix_voice_client.py
3. Test with USB microphone first
4. Consider microphone array for better pickup
### Out of memory on Maix Duino
```python
# In main_loop(), add more frequent GC:
if gc.mem_free() < 200000: # Increase threshold
gc.collect()
```
## Adding New Intents
Edit `voice_server.py` and add patterns to `IntentParser.PATTERNS`:
```python
PATTERNS = {
# Existing patterns...
'set_temperature': [
r'set (?:the )?temperature to (\d+)',
r'make it (\d+) degrees',
],
}
```
Then add the handler in `execute_intent()`:
```python
elif intent == 'set_temperature':
temp = params.get('temperature')
success = ha_client.call_service(
'climate', 'set_temperature',
entity_id, temperature=temp
)
return f"Set temperature to {temp} degrees"
```
## Entity Mapping
Add your Home Assistant entities to `IntentParser.ENTITY_MAP`:
```python
ENTITY_MAP = {
# Lights
'living room light': 'light.living_room',
'bedroom light': 'light.bedroom',
# Climate
'thermostat': 'climate.main_floor',
'temperature': 'sensor.main_floor_temperature',
# Switches
'coffee maker': 'switch.coffee_maker',
'fan': 'switch.bedroom_fan',
# Media
'tv': 'media_player.living_room_tv',
'music': 'media_player.whole_house',
}
```
## Performance Tuning
### Reduce latency
1. Use Whisper `tiny` or `base` model
2. Implement streaming audio (currently batch)
3. Pre-load TTS models
4. Use faster TTS engine (e.g., espeak)
### Improve accuracy
1. Use Whisper `large` model (slower)
2. Train custom wake word
3. Add NLU layer (Rasa, spaCy)
4. Collect and fine-tune on your voice
## Next Steps
### Short term
- [ ] Add more Home Assistant entity mappings
- [ ] Implement Piper TTS playback on Maix Duino
- [ ] Train custom wake word model
- [ ] Add LED animations for better feedback
- [ ] Implement conversation context
### Medium term
- [ ] Multi-room support (multiple Maix Duino units)
- [ ] Voice profiles for different users
- [ ] Integration with Plex for media control
- [ ] Calendar and reminder functionality
- [ ] Weather updates from local weather station
### Long term
- [ ] Custom skills/plugins system
- [ ] Integration with other services (Nextcloud, Matrix)
- [ ] Sound event detection (doorbell, smoke alarm)
- [ ] Intercom functionality between rooms
- [ ] Voice-controlled automation creation
## Alternatives & Fallbacks
If the Maix Duino proves limiting:
### Raspberry Pi Zero 2 W
- More processing power
- Better software support
- USB audio support
- Cost: ~$15
### ESP32-S3
- Better WiFi
- More RAM (8MB)
- Cheaper (~$10)
- Good community support
### Orange Pi Zero 2
- ARM Cortex-A53 quad-core
- 512MB-1GB RAM
- Full Linux support
- Cost: ~$20
## Resources
### Documentation
- Maix Duino: https://wiki.sipeed.com/hardware/en/maix/
- MaixPy: https://maixpy.sipeed.com/
- Whisper: https://github.com/openai/whisper
- Piper TTS: https://github.com/rhasspy/piper
- Home Assistant API: https://developers.home-assistant.io/
### Community Projects
- Rhasspy: https://rhasspy.readthedocs.io/
- Willow: https://github.com/toverainc/willow
- Mycroft: https://mycroft.ai/
### Wake Word Tools
- Porcupine: https://picovoice.ai/platform/porcupine/
- Mycroft Precise: https://github.com/MycroftAI/mycroft-precise
- Snowboy (archived): https://github.com/Kitt-AI/snowboy
## Getting Help
### Check logs
```bash
# Server logs (if using systemd)
sudo journalctl -u voice-assistant -f
# Or manual log file
tail -f ~/voice-assistant/logs/voice_assistant.log
# Maix Duino serial console
screen /dev/ttyUSB0 115200
```
### Common issues and solutions
See the Troubleshooting section above
### Useful commands
```bash
# Restart service
sudo systemctl restart voice-assistant
# Check service status
sudo systemctl status voice-assistant
# Test HA connection
curl http://10.1.10.71:5000/health
# Monitor Maix Duino
minicom -D /dev/ttyUSB0 -b 115200
```
## Cost Breakdown
| Item | Cost | Status |
|------|------|--------|
| Maix Duino | $30 | Have it! |
| I2S Microphone | $5-10 | Need |
| Speaker | $10 | Need (or use existing) |
| MicroSD Card | $5 | Have it? |
| **Total** | **$15-25** | (vs $50+ commercial) |
**Benefits of local solution:**
- No subscription fees
- Complete privacy (no cloud)
- Customizable to your needs
- Integration with existing infrastructure
- Learning experience!
## Conclusion
You now have everything you need to build a local, privacy-focused voice assistant! The setup leverages your existing infrastructure (Heimdall for processing, Home Assistant for automation) while keeping costs minimal.
Start with the basic setup, test each component, then iterate and improve. The beauty of this approach is you can enhance it over time without being locked into a commercial platform.
Good luck, and enjoy your new voice assistant! 🎙️

723
docs/WAKE_WORD_ADVANCED.md Executable file
View file

@ -0,0 +1,723 @@
# Wake Word Models: Pre-trained, Multiple, and Voice Adaptation
## Pre-trained Wake Word Models
### Yes! "Hey Mycroft" Already Exists
Mycroft provides several pre-trained models that you can use immediately:
#### Available Pre-trained Models
**Hey Mycroft** (Official)
```bash
# Download from Mycroft's model repository
cd ~/precise-models/pretrained
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
tar xzf hey-mycroft.tar.gz
# Test immediately
conda activate precise
precise-listen hey-mycroft.net
# Should detect "Hey Mycroft" right away!
```
**Other Available Models:**
- **Hey Mycroft** - Best tested, most reliable
- **Christopher** - Alternative wake word
- **Hey Jarvis** - Community contributed
- **Computer** - Star Trek style
#### Using Pre-trained Models
**Option 1: Use as-is**
```bash
# Just point your server to the pre-trained model
python voice_server.py \
--enable-precise \
--precise-model ~/precise-models/pretrained/hey-mycroft.net \
--precise-sensitivity 0.5
```
**Option 2: Fine-tune for your voice**
```bash
# Use pre-trained as starting point, add your samples
cd ~/precise-models/my-hey-mycroft
# Record additional samples
precise-collect
# Train from checkpoint (much faster than from scratch!)
precise-train -e 30 my-hey-mycroft.net . \
--from-checkpoint ~/precise-models/pretrained/hey-mycroft.net
# This adds your voice/environment while keeping the base model
```
**Option 3: Ensemble with custom**
```python
# Use both pre-trained and custom model
# Require both to agree (reduces false positives)
# See implementation below
```
### Advantages of Pre-trained Models
**Instant deployment** - No training required
**Proven accuracy** - Tested by thousands of users
**Good starting point** - Fine-tune rather than train from scratch
**Multiple speakers** - Already includes diverse voices
**Save time** - Skip 1-2 hours of training
### Disadvantages
**Generic** - Not optimized for your voice/environment
**May need tuning** - Threshold adjustment required
**Limited choice** - Only a few wake words available
### Recommendation
**Start with "Hey Mycroft"** pre-trained model:
1. Deploy immediately (zero training time)
2. Test in your environment
3. Collect false positives/negatives
4. Fine-tune with your examples
5. Best of both worlds!
## Multiple Wake Words
### Can You Have Multiple Wake Words?
**Short answer:** Yes, but with tradeoffs.
### Implementation Approaches
#### Approach 1: Server-Side Multiple Models (Recommended)
Run multiple Precise models in parallel on Heimdall:
```python
# In voice_server.py
from precise_runner import PreciseEngine, PreciseRunner
# Global runners for each wake word
precise_runners = {}
wake_word_configs = {
'hey_mycroft': {
'model': '~/precise-models/pretrained/hey-mycroft.net',
'sensitivity': 0.5,
'response': 'Yes?'
},
'hey_computer': {
'model': '~/precise-models/hey-computer/hey-computer.net',
'sensitivity': 0.5,
'response': 'I\'m listening'
},
'jarvis': {
'model': '~/precise-models/jarvis/jarvis.net',
'sensitivity': 0.6,
'response': 'At your service, sir'
}
}
def on_wake_word_detected(wake_word_name):
"""Callback with wake word identifier"""
def callback():
print(f"Wake word detected: {wake_word_name}")
wake_word_queue.put({
'timestamp': time.time(),
'wake_word': wake_word_name,
'response': wake_word_configs[wake_word_name]['response']
})
return callback
def start_multiple_wake_words():
"""Start multiple Precise listeners"""
for name, config in wake_word_configs.items():
engine = PreciseEngine(
'/usr/local/bin/precise-engine',
os.path.expanduser(config['model'])
)
runner = PreciseRunner(
engine,
sensitivity=config['sensitivity'],
on_activation=on_wake_word_detected(name)
)
runner.start()
precise_runners[name] = runner
print(f"Started wake word listener: {name}")
```
**Resource Usage:**
- CPU: ~5-10% per model (3 models = ~15-30%)
- RAM: ~100-200MB per model
- Still very manageable on Heimdall
**Pros:**
✅ Different wake words for different purposes
✅ Family members can choose preferred wake word
✅ Context-aware responses
✅ Easy to add/remove models
**Cons:**
❌ Higher CPU usage (scales linearly)
❌ Increased false positive risk (3x models = 3x chance)
❌ More complex configuration
#### Approach 2: Edge Multiple Models (K210)
**Challenge:** K210 has limited resources
**Option A: Sequential checking** (Feasible)
```python
# Check each model in sequence
models = ['hey-mycroft.kmodel', 'hey-computer.kmodel']
for model in models:
kpu_task = kpu.load(f"/sd/models/{model}")
result = kpu.run(kpu_task, audio_features)
if result > threshold:
return model # Wake word detected
```
**Resource impact:**
- Latency: +50-100ms per additional model
- Memory: Models must fit in 6MB total
- CPU: ~30% per model check
**Option B: Combined model** (Advanced)
```python
# Train a single model that recognizes multiple phrases
# Each phrase maps to different output class
# More complex training but single inference
```
**Recommendation for edge:**
- **1-2 wake words max** on K210
- **Server-side** for 3+ wake words
#### Approach 3: Contextual Wake Words
Different wake words trigger different behaviors:
```python
wake_word_contexts = {
'hey_mycroft': 'general', # General commands
'hey_assistant': 'general', # Alternative general
'emergency': 'priority', # High priority
'goodnight': 'bedtime', # Bedtime routine
}
def handle_wake_word(wake_word, command):
context = wake_word_contexts[wake_word]
if context == 'priority':
# Skip queue, process immediately
# Maybe call emergency contact
pass
elif context == 'bedtime':
# Trigger bedtime automation
# Lower volume for responses
pass
else:
# Normal processing
pass
```
### Best Practices for Multiple Wake Words
1. **Start with one** - Get it working well first
2. **Add gradually** - One at a time, test thoroughly
3. **Different purposes** - Each wake word should have a reason
4. **Monitor performance** - Track false positives per wake word
5. **User preference** - Let family members choose their favorite
### Recommended Configuration
**For most users:**
```python
wake_words = {
'hey_mycroft': 'primary', # Main wake word (pre-trained)
'hey_computer': 'alternative' # Custom trained for your voice
}
```
**For power users:**
```python
wake_words = {
'hey_mycroft': 'general',
'jarvis': 'personal_assistant', # Custom responses
'computer': 'technical_queries', # Different intent parser
}
```
**For families:**
```python
wake_words = {
'hey_mycroft': 'shared', # Everyone can use
'dad': 'user_alan', # Personalized
'mom': 'user_sarah', # Personalized
'kids': 'user_children', # Kid-safe responses
}
```
## Voice Adaptation and Multi-User Support
### Challenge: Different Voices, Same Wake Word
When multiple people use the system:
- Different accents
- Different speech patterns
- Different pronunciations
- Different vocal characteristics
### Solution Approaches
#### Approach 1: Diverse Training Data (Recommended)
**During initial training:**
```bash
# Have everyone in household record samples
cd ~/precise-models/hey-computer
# Alan records 30 samples
precise-collect # Record as user 1
# Sarah records 30 samples
precise-collect # Record as user 2
# Kids record 20 samples
precise-collect # Record as user 3
# Combine all in training set
# Train one model that works for everyone
./3-train-model.sh
```
**Pros:**
✅ Single model for everyone
✅ No user switching needed
✅ Simple to maintain
✅ Works immediately for all users
**Cons:**
❌ May have lower per-person accuracy
❌ Requires upfront time from everyone
❌ Hard to add new users later
#### Approach 2: Incremental Training
Start with your voice, add others over time:
```bash
# Week 1: Train with Alan's voice
cd ~/precise-models/hey-computer
# Record and train with Alan's samples
precise-train -e 60 hey-computer.net .
# Week 2: Sarah wants to use it
# Collect Sarah's samples
mkdir -p sarah-samples/wake-word
precise-collect # Sarah records 20-30 samples
# Add to existing training set
cp sarah-samples/wake-word/* wake-word/
# Retrain (continue from checkpoint)
precise-train -e 30 hey-computer.net . \
--from-checkpoint hey-computer.net
# Now works for both Alan and Sarah!
```
**Pros:**
✅ Gradual improvement
✅ Don't need everyone upfront
✅ Easy to add new users
✅ Maintains accuracy for existing users
**Cons:**
❌ May not work well for new users initially
❌ Requires retraining periodically
#### Approach 3: Per-User Models with Speaker Identification
Train separate models + identify who's speaking:
**Step 1: Train per-user wake word models**
```bash
# Alan's model
~/precise-models/hey-computer-alan/
# Sarah's model
~/precise-models/hey-computer-sarah/
# Kids' model
~/precise-models/hey-computer-kids/
```
**Step 2: Use speaker identification**
```python
# Pseudo-code for speaker identification
def identify_speaker(audio):
"""
Identify speaker from voice characteristics
Using speaker embeddings (x-vectors, d-vectors)
"""
# Extract speaker embedding
embedding = speaker_encoder.encode(audio)
# Compare to known users
similarities = {
'alan': cosine_similarity(embedding, alan_embedding),
'sarah': cosine_similarity(embedding, sarah_embedding),
'kids': cosine_similarity(embedding, kids_embedding),
}
# Return most similar
return max(similarities, key=similarities.get)
def process_command(audio):
# Detect wake word with all models
wake_detected = check_all_models(audio)
if wake_detected:
# Identify speaker
speaker = identify_speaker(audio)
# Use speaker-specific model for better accuracy
model = f'~/precise-models/hey-computer-{speaker}/'
# Continue with speaker context
process_with_context(audio, speaker)
```
**Speaker identification libraries:**
- **Resemblyzer** - Simple speaker verification
- **speechbrain** - Complete toolkit
- **pyannote.audio** - You already use this for diarization!
**Implementation:**
```bash
# You already have pyannote for diarization!
conda activate voice-assistant
pip install pyannote.audio --break-system-packages
# Can use speaker embeddings for identification
```
```python
from pyannote.audio import Inference
# Load speaker embedding model
inference = Inference(
"pyannote/embedding",
use_auth_token=hf_token
)
# Extract embeddings for known users
alan_embedding = inference("alan_voice_sample.wav")
sarah_embedding = inference("sarah_voice_sample.wav")
# Compare with incoming audio
unknown_embedding = inference(audio_buffer)
from scipy.spatial.distance import cosine
alan_similarity = 1 - cosine(unknown_embedding, alan_embedding)
sarah_similarity = 1 - cosine(unknown_embedding, sarah_embedding)
if alan_similarity > 0.8:
user = 'alan'
elif sarah_similarity > 0.8:
user = 'sarah'
else:
user = 'unknown'
```
**Pros:**
✅ Personalized responses per user
✅ Better accuracy (model optimized for each voice)
✅ User-specific preferences/permissions
✅ Can track who said what
**Cons:**
❌ More complex setup
❌ Higher resource usage
❌ Requires voice samples from each user
❌ Privacy considerations
#### Approach 4: Adaptive/Online Learning
Model improves automatically based on usage:
```python
class AdaptiveWakeWord:
def __init__(self, base_model):
self.base_model = base_model
self.user_samples = []
self.retrain_threshold = 50 # Retrain after N samples
def on_detection(self, audio, user_confirmed=True):
"""User confirms this was correct detection"""
if user_confirmed:
self.user_samples.append(audio)
# Periodically retrain
if len(self.user_samples) >= self.retrain_threshold:
self.retrain_with_samples()
self.user_samples = []
def retrain_with_samples(self):
"""Background retraining with collected samples"""
# Add samples to training set
# Retrain model
# Swap in new model
pass
```
**Pros:**
✅ Automatic improvement
✅ Adapts to user's voice over time
✅ No manual retraining
✅ Gets better with use
**Cons:**
❌ Complex implementation
❌ Requires user feedback mechanism
❌ Risk of drift/degradation
❌ Background training overhead
## Recommended Strategy
### Phase 1: Single Wake Word, Single Model
```bash
# Week 1-2
# Use pre-trained "Hey Mycroft"
# OR train custom "Hey Computer" with all family members' voices
# Keep it simple, get it working
```
### Phase 2: Add Fine-tuning
```bash
# Week 3-4
# Collect false positives/negatives
# Retrain with household-specific data
# Optimize threshold
```
### Phase 3: Consider Multiple Wake Words
```bash
# Month 2
# If needed, add second wake word
# "Hey Mycroft" for general
# "Jarvis" for personal assistant tasks
```
### Phase 4: Personalization
```bash
# Month 3+
# If desired, add speaker identification
# Personalized responses
# User-specific preferences
```
## Practical Examples
### Example 1: Family of 4, Single Model
```bash
# Training session with everyone
cd ~/precise-models/hey-mycroft-family
# Dad records 25 samples
precise-collect
# Mom records 25 samples
precise-collect
# Kid 1 records 15 samples
precise-collect
# Kid 2 records 15 samples
precise-collect
# Collect shared negative samples (200+)
# TV, music, conversation, etc.
precise-collect -f not-wake-word/household.wav
# Train single model for everyone
precise-train -e 60 hey-mycroft-family.net .
# Deploy
python voice_server.py \
--enable-precise \
--precise-model hey-mycroft-family.net
```
**Result:** Everyone can use it, one model, simple.
### Example 2: Two Wake Words, Different Purposes
```python
# voice_server.py configuration
wake_words = {
'hey_mycroft': {
'model': 'hey-mycroft.net',
'sensitivity': 0.5,
'intent_parser': 'general', # All commands
'response': 'Yes?'
},
'emergency': {
'model': 'emergency.net',
'sensitivity': 0.7, # Higher threshold
'intent_parser': 'emergency', # Limited commands
'response': 'Emergency mode activated'
}
}
# "Hey Mycroft, turn on the lights" - works
# "Emergency, call for help" - triggers emergency protocol
```
### Example 3: Speaker Identification + Personalization
```python
# Enhanced processing with speaker ID
def process_with_speaker_id(audio, speaker):
# Different HA entity based on speaker
entity_maps = {
'alan': {
'bedroom_light': 'light.master_bedroom',
'office_light': 'light.alan_office',
},
'sarah': {
'bedroom_light': 'light.master_bedroom',
'office_light': 'light.sarah_office',
},
'kids': {
'bedroom_light': 'light.kids_bedroom',
'tv': None, # Kids can't control TV
}
}
# Transcribe command
text = whisper_transcribe(audio)
# "Turn on bedroom light"
if 'bedroom light' in text:
entity = entity_maps[speaker]['bedroom_light']
ha_client.turn_on(entity)
response = f"Turned on your bedroom light"
return response
```
## Resource Requirements
### Single Wake Word
- **CPU:** 5-10% (Heimdall)
- **RAM:** 100-200MB
- **Model size:** 1-3MB
- **Training time:** 30-60 min
### Multiple Wake Words (3 models)
- **CPU:** 15-30% (Heimdall)
- **RAM:** 300-600MB
- **Model size:** 3-9MB total
- **Training time:** 90-180 min
### With Speaker Identification
- **CPU:** +5-10% for speaker ID
- **RAM:** +200-300MB for embedding model
- **Model size:** +50MB for speaker model
- **Setup time:** +30-60 min for voice enrollment
### K210 Edge (Maix Duino)
- **Single model:** Feasible, ~30% CPU
- **2 models:** Feasible, ~60% CPU, higher latency
- **3+ models:** Not recommended
- **Speaker ID:** Not feasible (limited RAM/compute)
## Quick Decision Guide
**Just getting started?**
→ Use pre-trained "Hey Mycroft"
**Want custom wake word?**
→ Train one model with all family voices
**Need multiple wake words?**
→ Start server-side with 2-3 models
**Want personalization?**
→ Add speaker identification
**Deploying to edge (K210)?**
→ Stick to 1-2 wake words maximum
**Family of 4+ people?**
→ Train single model with everyone's voice
**Privacy is paramount?**
→ Skip speaker ID, use single universal model
## Testing Multiple Wake Words
```bash
# Test all wake words quickly
conda activate precise
# Terminal 1: Hey Mycroft
precise-listen hey-mycroft.net
# Terminal 2: Hey Computer
precise-listen hey-computer.net
# Terminal 3: Emergency
precise-listen emergency.net
# Say each wake word, verify correct detection
```
## Conclusion
### For Your Maix Duino Project:
**Recommended approach:**
1. **Start with "Hey Mycroft"** - Use pre-trained model
2. **Fine-tune if needed** - Add your household's voices
3. **Consider 2nd wake word** - Only if you have a specific use case
4. **Speaker ID** - Phase 2/3 enhancement, not critical for MVP
5. **Keep it simple** - One wake word works great for most users
**The pre-trained "Hey Mycroft" model saves you 1-2 hours** and works immediately. You can always fine-tune or add custom wake words later!
**Multiple wake words are cool but not necessary** - Most commercial products use just one. Focus on making one wake word work really well before adding more.
**Voice adaptation** - Training with multiple voices upfront is simpler than per-user models. Save speaker ID for later if you need personalization.
## Quick Start with Pre-trained
```bash
# On Heimdall
cd ~/precise-models/pretrained
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
tar xzf hey-mycroft.tar.gz
# Test it
conda activate precise
precise-listen hey-mycroft.net
# Deploy
cd ~/voice-assistant
python voice_server.py \
--enable-precise \
--precise-model ~/precise-models/pretrained/hey-mycroft.net
# You're done! No training needed!
```
**That's it - you have a working wake word in 5 minutes!** 🎉

411
docs/WAKE_WORD_QUICK_REF.md Executable file
View file

@ -0,0 +1,411 @@
# Wake Word Quick Reference Card
## 🎯 TL;DR: What Should I Do?
### Recommendation for Your Setup
**Week 1:** Use pre-trained "Hey Mycroft"
```bash
./download_pretrained_models.sh --model hey-mycroft
precise-listen ~/precise-models/pretrained/hey-mycroft.net
```
**Week 2-3:** Fine-tune with all family members' voices
```bash
cd ~/precise-models/hey-mycroft-family
precise-train -e 30 custom.net . --from-checkpoint ../pretrained/hey-mycroft.net
```
**Week 4+:** Add speaker identification
```bash
pip install resemblyzer
python enroll_speaker.py --name Alan --duration 20
python enroll_speaker.py --name [Family] --duration 20
```
**Month 2+:** Add second wake word (Hey Jarvis for Plex?)
```bash
./download_pretrained_models.sh --model hey-jarvis
# Run both in parallel on server
```
---
## 📋 Pre-trained Models
### Available Models (Ready to Use!)
| Wake Word | Download | Best For |
|-----------|----------|----------|
| **Hey Mycroft** ⭐ | `--model hey-mycroft` | Default choice, most data |
| **Hey Jarvis** | `--model hey-jarvis` | Pop culture, media control |
| **Christopher** | `--model christopher` | Unique, less common |
| **Hey Ezra** | `--model hey-ezra` | Alternative option |
### Quick Download
```bash
# Download one
./download_pretrained_models.sh --model hey-mycroft
# Download all
./download_pretrained_models.sh --test-all
# Test immediately
precise-listen ~/precise-models/pretrained/hey-mycroft.net
```
---
## 🔢 Multiple Wake Words
### Option 1: Multiple Models (Server-Side) ⭐ RECOMMENDED
**What:** Run 2-3 different wake word models simultaneously
**Where:** Heimdall (server)
**Performance:** ~15-30% CPU for 3 models
```bash
# Start with multiple wake words
python voice_server.py \
--enable-precise \
--precise-models "\
hey-mycroft:~/models/hey-mycroft.net:0.5,\
hey-jarvis:~/models/hey-jarvis.net:0.5"
```
**Pros:**
- ✅ Can identify which wake word was used
- ✅ Different contexts (Mycroft=commands, Jarvis=media)
- ✅ Easy to add/remove wake words
- ✅ Each can have different sensitivity
**Cons:**
- ❌ Only works server-side (not on Maix Duino)
- ❌ Higher CPU usage (but still reasonable)
**Use When:**
- You want different wake words for different purposes
- Server has CPU to spare (yours does!)
- Want flexibility to add wake words later
### Option 2: Single Multi-Phrase Model (Edge-Compatible)
**What:** One model responds to multiple phrases
**Where:** Server OR Maix Duino
**Performance:** Same as single model
```bash
# Train on multiple phrases
cd ~/precise-models/multi-wake
# Record "Hey Mycroft" samples → wake-word/
# Record "Hey Computer" samples → wake-word/
# Record negatives → not-wake-word/
precise-train -e 60 multi-wake.net .
```
**Pros:**
- ✅ Single model = less compute
- ✅ Works on edge (K210)
- ✅ Simple deployment
**Cons:**
- ❌ Can't tell which wake word was used
- ❌ May reduce accuracy
- ❌ Higher false positive risk
**Use When:**
- Deploying to Maix Duino (edge)
- Want backup wake words
- Don't care which was used
---
## 👥 Multi-User Support
### Option 1: Inclusive Training ⭐ START HERE
**What:** One model, all voices
**How:** All family members record samples
```bash
cd ~/precise-models/family-wake
# Alice records 30 samples
# Bob records 30 samples
# You record 30 samples
precise-train -e 60 family-wake.net .
```
**Pros:**
- ✅ Everyone can use it
- ✅ Simple deployment
- ✅ Single model
**Cons:**
- ❌ Can't identify who spoke
- ❌ No personalization
**Use When:**
- Just getting started
- Don't need to know who spoke
- Want simplicity
### Option 2: Speaker Identification (Week 4+)
**What:** Detect wake word, then identify speaker
**How:** Voice embeddings (resemblyzer or pyannote)
```bash
# Install
pip install resemblyzer
# Enroll users
python enroll_speaker.py --name Alan --duration 20
python enroll_speaker.py --name Alice --duration 20
python enroll_speaker.py --name Bob --duration 20
# Server identifies speaker automatically
```
**Pros:**
- ✅ Personalized responses
- ✅ User-specific permissions
- ✅ Better privacy
- ✅ Track preferences
**Cons:**
- ❌ More complex
- ❌ Requires enrollment
- ❌ +100-200ms latency
- ❌ May fail with similar voices
**Use When:**
- Want personalization
- Need user-specific commands
- Ready for advanced features
### Option 3: Per-User Wake Words (Advanced)
**What:** Each person has their own wake word
**How:** Multiple models, one per person
```bash
# Alice: "Hey Mycroft"
# Bob: "Hey Jarvis"
# You: "Hey Computer"
# Run all 3 models in parallel
```
**Pros:**
- ✅ Automatic user ID
- ✅ Highest accuracy per user
- ✅ Clear separation
**Cons:**
- ❌ 3x models = 3x CPU
- ❌ Users must remember their word
- ❌ Server-only (not edge)
**Use When:**
- Need automatic user ID
- Have CPU to spare
- Users want their own wake word
---
## 🎯 Decision Tree
```
START: Want to use voice assistant
├─ Single user or don't care who spoke?
│ └─ Use: Inclusive Training (Option 1)
│ └─ Download: Hey Mycroft (pre-trained)
├─ Multiple users AND need to know who spoke?
│ └─ Use: Speaker Identification (Option 2)
│ └─ Start with: Hey Mycroft + resemblyzer
├─ Want different wake words for different purposes?
│ └─ Use: Multiple Models (Option 1)
│ └─ Download: Hey Mycroft + Hey Jarvis
└─ Deploying to Maix Duino (edge)?
└─ Use: Single Multi-Phrase Model (Option 2)
└─ Train: Custom model with 2-3 phrases
```
---
## 📊 Comparison Table
| Feature | Inclusive | Speaker ID | Per-User Wake | Multiple Wake |
|---------|-----------|------------|---------------|---------------|
| **Setup Time** | 2 hours | 4 hours | 6 hours | 3 hours |
| **Complexity** | ⭐ Easy | ⭐⭐⭐ Medium | ⭐⭐⭐⭐ Hard | ⭐⭐ Easy |
| **CPU Usage** | 5-10% | 10-15% | 15-30% | 15-30% |
| **Latency** | 100ms | 300ms | 100ms | 100ms |
| **User ID** | ❌ No | ✅ Yes | ✅ Yes | ❌ No |
| **Edge Deploy** | ✅ Yes | ⚠️ Maybe | ❌ No | ⚠️ Partial |
| **Personalize** | ❌ No | ✅ Yes | ✅ Yes | ⚠️ Partial |
---
## 🚀 Recommended Timeline
### Week 1: Get It Working
```bash
# Use pre-trained Hey Mycroft
./download_pretrained_models.sh --model hey-mycroft
# Test it
precise-listen ~/precise-models/pretrained/hey-mycroft.net
# Deploy to server
python voice_server.py --enable-precise \
--precise-model ~/precise-models/pretrained/hey-mycroft.net
```
### Week 2-3: Make It Yours
```bash
# Fine-tune with your family's voices
cd ~/precise-models/hey-mycroft-family
# Have everyone record 20-30 samples
precise-collect # Alice
precise-collect # Bob
precise-collect # You
# Train
precise-train -e 30 custom.net . \
--from-checkpoint ../pretrained/hey-mycroft.net
```
### Week 4+: Add Intelligence
```bash
# Speaker identification
pip install resemblyzer
python enroll_speaker.py --name Alan --duration 20
python enroll_speaker.py --name Alice --duration 20
# Now server knows who's speaking!
```
### Month 2+: Expand Features
```bash
# Add second wake word for media control
./download_pretrained_models.sh --model hey-jarvis
# Run both: Mycroft for commands, Jarvis for Plex
python voice_server.py --enable-precise \
--precise-models "mycroft:hey-mycroft.net:0.5,jarvis:hey-jarvis.net:0.5"
```
---
## 💡 Pro Tips
### Wake Word Selection
- ✅ **DO:** Choose clear, distinct wake words
- ✅ **DO:** Test in your environment
- ❌ **DON'T:** Use similar-sounding words
- ❌ **DON'T:** Use common phrases
### Training
- ✅ **DO:** Include all intended users
- ✅ **DO:** Record in various conditions
- ✅ **DO:** Add false positives to training
- ❌ **DON'T:** Rush the training process
### Deployment
- ✅ **DO:** Start simple (one wake word)
- ✅ **DO:** Test thoroughly before adding features
- ✅ **DO:** Monitor false positive rate
- ❌ **DON'T:** Deploy too many wake words at once
### Speaker ID
- ✅ **DO:** Use 20+ seconds for enrollment
- ✅ **DO:** Re-enroll if accuracy drops
- ✅ **DO:** Test threshold values
- ❌ **DON'T:** Expect 100% accuracy
---
## 🔧 Quick Commands
```bash
# Download pre-trained model
./download_pretrained_models.sh --model hey-mycroft
# Test model
precise-listen ~/precise-models/pretrained/hey-mycroft.net
# Fine-tune from pre-trained
precise-train -e 30 custom.net . \
--from-checkpoint ~/precise-models/pretrained/hey-mycroft.net
# Enroll speaker
python enroll_speaker.py --name Alan --duration 20
# Start with single wake word
python voice_server.py --enable-precise \
--precise-model hey-mycroft.net
# Start with multiple wake words
python voice_server.py --enable-precise \
--precise-models "mycroft:hey-mycroft.net:0.5,jarvis:hey-jarvis.net:0.5"
# Check status
curl http://10.1.10.71:5000/wake-word/status
# Monitor detections
curl http://10.1.10.71:5000/wake-word/detections
```
---
## 📚 See Also
- **Full guide:** [ADVANCED_WAKE_WORD_TOPICS.md](ADVANCED_WAKE_WORD_TOPICS.md)
- **Training:** [MYCROFT_PRECISE_GUIDE.md](MYCROFT_PRECISE_GUIDE.md)
- **Deployment:** [PRECISE_DEPLOYMENT.md](PRECISE_DEPLOYMENT.md)
- **Getting started:** [QUICKSTART.md](QUICKSTART.md)
---
## ❓ FAQ
**Q: Can I use "Hey Mycroft" right away?**
A: Yes! Download with `./download_pretrained_models.sh --model hey-mycroft`
**Q: How many wake words can I run at once?**
A: 2-3 comfortably on server. Maix Duino can handle 1.
**Q: Can I train my own custom wake word?**
A: Yes! See MYCROFT_PRECISE_GUIDE.md Phase 2.
**Q: Does speaker ID work with multiple wake words?**
A: Yes! Wake word detected → Speaker identified → Personalized response.
**Q: Can I use this on Maix Duino?**
A: Server-side (start here), then convert to KMODEL (advanced).
**Q: How accurate is speaker identification?**
A: 85-95% with good enrollment. Re-enroll if accuracy drops.
**Q: What if someone has a cold?**
A: May reduce accuracy temporarily. System should recover when voice returns to normal.
**Q: Can kids use it?**
A: Yes! Include their voices in training or enroll them separately.
---
**Quick Decision:** Start with pre-trained Hey Mycroft. Add features later!
```bash
./download_pretrained_models.sh --model hey-mycroft
precise-listen ~/precise-models/pretrained/hey-mycroft.net
# It just works! ✨
```

View file

@ -0,0 +1,347 @@
# Maix Duino Voice Assistant - System Architecture
## Overview
Local voice assistant using Sipeed Maix Duino board integrated with Home Assistant, leveraging existing home lab infrastructure for AI processing.
## Hardware Components
### Maix Duino Board
- **Processor**: K210 dual-core RISC-V @ 400MHz
- **AI Accelerator**: KPU for neural network inference
- **Audio**: I2S microphone + speaker output
- **Connectivity**: ESP32 for WiFi/BLE
- **Programming**: MaixPy (MicroPython)
### Recommended Accessories
- I2S MEMS microphone (or microphone array for better pickup)
- Small speaker (3-5W) or audio output to existing speakers
- USB-C power supply (5V/2A minimum)
## Software Architecture
### Edge Layer (Maix Duino)
```
┌─────────────────────────────────────┐
│ Maix Duino (MaixPy) │
├─────────────────────────────────────┤
│ • Wake Word Detection (KPU) │
│ • Audio Capture (I2S) │
│ • Audio Streaming → Heimdall │
│ • Audio Playback ← Heimdall │
│ • LED Feedback (listening status) │
└─────────────────────────────────────┘
↕ WiFi/HTTP/WebSocket
┌─────────────────────────────────────┐
│ Voice Processing Server │
│ (Heimdall - 10.1.10.71) │
├─────────────────────────────────────┤
│ • Whisper STT (existing setup!) │
│ • Intent Recognition (Rasa/custom) │
│ • Piper TTS │
│ • Home Assistant API Client │
└─────────────────────────────────────┘
↕ REST API/MQTT
┌─────────────────────────────────────┐
│ Home Assistant │
│ (Your HA instance) │
├─────────────────────────────────────┤
│ • Device Control │
│ • State Management │
│ • Automation Triggers │
└─────────────────────────────────────┘
```
## Communication Flow
### 1. Wake Word Detection (Local)
```
User says "Hey Assistant"
Maix Duino KPU detects wake word
LED turns on (listening mode)
Start audio streaming to Heimdall
```
### 2. Speech Processing (Heimdall)
```
Audio stream received
Whisper transcribes to text
Intent parser extracts command
Query Home Assistant API
Generate response text
Piper TTS creates audio
Stream audio back to Maix Duino
```
### 3. Playback & Feedback
```
Receive audio stream
Play through speaker
LED indicates completion
Return to wake word detection
```
## Network Configuration
### Maix Duino Network Settings
- **IP**: 10.1.10.xxx (assign static via DHCP reservation)
- **Gateway**: 10.1.10.1
- **DNS**: 10.1.10.4 (Pi-hole)
### Service Endpoints
- **Voice Processing Server**: http://10.1.10.71:5000
- **Home Assistant**: (your existing HA URL)
- **MQTT Broker**: (optional, if using MQTT)
### Caddy Reverse Proxy Entry
Add to `/mnt/project/epona_-_Caddyfile`:
```caddy
# Voice Assistant API
handle /voice-assistant* {
uri strip_prefix /voice-assistant
reverse_proxy http://10.1.10.71:5000
}
```
## Software Stack
### Maix Duino (MaixPy)
- **Firmware**: Latest MaixPy release
- **Libraries**:
- `Maix.KPU` - Neural network inference
- `Maix.I2S` - Audio capture/playback
- `socket` - Network communication
- `ujson` - JSON handling
### Heimdall Server (Python)
- **Environment**: Create new conda env
```bash
conda create -n voice-assistant python=3.10
conda activate voice-assistant
```
- **Dependencies**:
- `openai-whisper` (already installed!)
- `piper-tts` - Text-to-speech
- `flask` - REST API server
- `requests` - HTTP client
- `pyaudio` - Audio handling
- `websockets` - Real-time streaming
### Optional: Intent Recognition
- **Rasa** - Full NLU framework (heavier but powerful)
- **Simple pattern matching** - Lightweight, start here
- **LLM-based** - Use your existing LLM setup on Heimdall
## Data Flow Examples
### Example 1: Turn on lights
```
User: "Hey Assistant, turn on the living room lights"
Wake word detected → Start recording
Whisper STT: "turn on the living room lights"
Intent Parser: {
"action": "turn_on",
"entity": "light.living_room"
}
Home Assistant API:
POST /api/services/light/turn_on
{"entity_id": "light.living_room"}
Response: "Living room lights turned on"
Piper TTS → Audio playback
```
### Example 2: Get status
```
User: "What's the temperature?"
Whisper STT: "what's the temperature"
Intent Parser: {
"action": "get_state",
"entity": "sensor.temperature"
}
Home Assistant API:
GET /api/states/sensor.temperature
Response: "The temperature is 72 degrees"
Piper TTS → Audio playback
```
## Phase 1 Implementation Plan
### Step 1: Maix Duino Setup (Week 1)
- [ ] Flash latest MaixPy firmware
- [ ] Test audio input/output
- [ ] Implement basic network communication
- [ ] Test streaming audio to server
### Step 2: Server Setup (Week 1-2)
- [ ] Create conda environment on Heimdall
- [ ] Set up Flask API server
- [ ] Integrate Whisper (already have this!)
- [ ] Install and test Piper TTS
- [ ] Create basic Home Assistant API client
### Step 3: Wake Word Training (Week 2)
- [ ] Record wake word samples
- [ ] Train custom wake word model
- [ ] Convert model for K210 KPU
- [ ] Test on-device detection
### Step 4: Integration (Week 3)
- [ ] Connect all components
- [ ] Test end-to-end flow
- [ ] Add error handling
- [ ] Implement fallbacks
### Step 5: Enhancement (Week 4+)
- [ ] Add more intents
- [ ] Improve NLU accuracy
- [ ] Add multi-room support
- [ ] Implement conversation context
## Development Tools
### Testing Wake Word
```python
# Use existing diarization.py for testing audio quality
python3 /path/to/diarization.py test_audio.wav \
--format vtt \
--model medium
```
### Monitoring
- Heimdall logs: `/var/log/voice-assistant/`
- Maix Duino serial console: 115200 baud
- Home Assistant logs: Standard HA logging
## Security Considerations
1. **No external cloud services** - Everything local
2. **Network isolation** - Keep on 10.1.10.0/24
3. **Authentication** - Use HA long-lived tokens
4. **Rate limiting** - Prevent abuse
5. **Audio privacy** - Only stream after wake word
## Resource Requirements
### Heimdall
- **CPU**: Minimal (< 5% idle, spikes during STT)
- **RAM**: ~2GB for Whisper medium model
- **Storage**: ~5GB for models
- **Network**: Low bandwidth (16kHz audio stream)
### Maix Duino
- **Power**: ~1-2W typical
- **Storage**: 16MB flash (plenty for wake word model)
- **RAM**: 8MB SRAM (sufficient for audio buffering)
## Alternative Architectures
### Option A: Fully On-Device (Limited)
- Everything on Maix Duino
- Very limited vocabulary
- No internet required
- Lower accuracy
### Option B: Hybrid (Recommended)
- Wake word on Maix Duino
- Processing on Heimdall
- Best balance of speed/accuracy
### Option C: Raspberry Pi Alternative
- If K210 proves limiting
- More processing power
- Still local/FOSS
- Higher cost
## Expansion Ideas
### Future Enhancements
1. **Multi-room**: Deploy multiple Maix Duino units
2. **Music playback**: Integrate with Plex
3. **Timers/Reminders**: Local scheduling
4. **Weather**: Pull from local weather station
5. **Calendar**: Sync with Nextcloud
6. **Intercom**: Room-to-room communication
7. **Sound events**: Doorbell, smoke alarm detection
### Integration with Existing Infrastructure
- **Plex**: Voice control for media playback
- **qBittorrent**: Status queries, torrent management
- **Nextcloud**: Calendar/contact queries
- **Matrix**: Send messages via voice
## Cost Estimate
- Maix Duino board: ~$20-30 (already have!)
- Microphone: ~$5-10 (if not included)
- Speaker: ~$10-15 (or use existing)
- **Total**: $0-55 (mostly already have)
Compare to commercial solutions:
- Google Home Mini: $50 (requires cloud)
- Amazon Echo Dot: $50 (requires cloud)
- Apple HomePod Mini: $99 (requires cloud)
## Success Criteria
### Minimum Viable Product (MVP)
- ✓ Wake word detection < 1 second
- ✓ Speech-to-text accuracy > 90%
- ✓ Home Assistant command execution
- ✓ Response time < 3 seconds total
- ✓ All processing local (no cloud)
### Enhanced Version
- ✓ Multi-intent conversations
- ✓ Context awareness
- ✓ Multiple wake words
- ✓ Room-aware responses
- ✓ Custom voice training
## Resources & Documentation
### Official Documentation
- Maix Duino: https://wiki.sipeed.com/hardware/en/maix/
- MaixPy: https://maixpy.sipeed.com/
- Home Assistant API: https://developers.home-assistant.io/
### Wake Word Tools
- Mycroft Precise: https://github.com/MycroftAI/mycroft-precise
- Porcupine: https://github.com/Picovoice/porcupine
### TTS Options
- Piper: https://github.com/rhasspy/piper
- Coqui TTS: https://github.com/coqui-ai/TTS
### Community Projects
- Rhasspy: https://rhasspy.readthedocs.io/ (full voice assistant framework)
- Willow: https://github.com/toverainc/willow (ESP32-based alternative)
## Next Steps
1. **Test current setup**: Verify Maix Duino boots and can connect to WiFi
2. **Audio test**: Record and playback test on the board
3. **Server setup**: Create conda environment and install dependencies
4. **Simple prototype**: Wake word → beep (no processing yet)
5. **Iterate**: Add complexity step by step

View file

@ -0,0 +1,348 @@
# MicroPython/MaixPy Quirks and Compatibility Notes
**Date:** 2025-12-03
**MicroPython Version:** v0.6.2-89-gd8901fd22 on 2024-06-17
**Hardware:** Sipeed Maixduino (K210)
This document captures all the compatibility issues and workarounds discovered while developing the voice assistant client for Maixduino.
---
## String Formatting
### ❌ F-strings NOT supported
```python
# WRONG - SyntaxError
message = f"IP: {ip}"
temperature = f"Temp: {temp}°C"
```
### ✅ Use string concatenation
```python
# CORRECT
message = "IP: " + str(ip)
temperature = "Temp: " + str(temp) + "°C"
```
---
## Conditional Expressions (Ternary Operator)
### ❌ Inline ternary expressions NOT supported
```python
# WRONG - SyntaxError
plural = "s" if count > 1 else ""
message = "Found " + str(count) + " item" + ("s" if count > 1 else "")
```
### ✅ Use explicit if/else blocks
```python
# CORRECT
if count > 1:
plural = "s"
else:
plural = ""
message = "Found " + str(count) + " item" + plural
```
---
## String Methods
### ❌ decode() doesn't accept keyword arguments
```python
# WRONG - TypeError: function doesn't take keyword arguments
text = response.decode('utf-8', errors='ignore')
```
### ✅ Use positional arguments only (or catch exceptions)
```python
# CORRECT
try:
text = response.decode('utf-8')
except:
text = str(response)
```
---
## Display/LCD Color Format
### ❌ RGB tuples NOT accepted
```python
# WRONG - TypeError: can't convert tuple to int
COLOR_RED = (255, 0, 0)
lcd.draw_string(10, 50, "Hello", COLOR_RED, 0)
```
### ✅ Use bit-packed integers
```python
# CORRECT - Pack RGB into 16-bit or 24-bit integer
def rgb_to_int(r, g, b):
return (r << 16) | (g << 8) | b
COLOR_RED = rgb_to_int(255, 0, 0)
lcd.draw_string(10, 50, "Hello", COLOR_RED, 0)
```
---
## Network - WiFi Module
### ❌ Standard network.WLAN NOT available
```python
# WRONG - AttributeError: 'module' object has no attribute 'WLAN'
import network
nic = network.WLAN(network.STA_IF)
```
### ✅ Use network.ESP32_SPI for Maixduino
```python
# CORRECT - Requires full pin configuration
from network import ESP32_SPI
from fpioa_manager import fm
# Register all 6 SPI pins
fm.register(25, fm.fpioa.GPIOHS10, force=True) # CS
fm.register(8, fm.fpioa.GPIOHS11, force=True) # RST
fm.register(9, fm.fpioa.GPIOHS12, force=True) # RDY
fm.register(28, fm.fpioa.GPIOHS13, force=True) # MOSI
fm.register(26, fm.fpioa.GPIOHS14, force=True) # MISO
fm.register(27, fm.fpioa.GPIOHS15, force=True) # SCLK
nic = ESP32_SPI(
cs=fm.fpioa.GPIOHS10,
rst=fm.fpioa.GPIOHS11,
rdy=fm.fpioa.GPIOHS12,
mosi=fm.fpioa.GPIOHS13,
miso=fm.fpioa.GPIOHS14,
sclk=fm.fpioa.GPIOHS15
)
nic.connect(SSID, PASSWORD)
```
### ❌ active() method NOT available
```python
# WRONG - AttributeError: 'ESP32_SPI' object has no attribute 'active'
nic.active(True)
```
### ✅ Just use connect() directly
```python
# CORRECT
nic.connect(SSID, PASSWORD)
```
---
## I2S Audio
### ❌ record() doesn't accept size parameter only
```python
# WRONG - TypeError: object with buffer protocol required
chunk = i2s_dev.record(1024)
```
### ✅ Returns Audio object, use to_bytes()
```python
# CORRECT
audio_obj = i2s_dev.record(total_bytes)
audio_data = audio_obj.to_bytes()
```
**Note:** Audio data often comes in unexpected formats:
- Expected: 16-bit mono PCM
- Reality: Often 32-bit or stereo (4x expected size)
- Solution: Implement format detection and conversion
---
## Memory Management
### Memory is VERY limited (~6MB total, much less available)
**Problems encountered:**
- Creating large bytearrays fails (>100KB can fail)
- Multiple allocations cause fragmentation
- In-place operations preferred over creating new buffers
### ❌ Creating new buffers
```python
# WRONG - MemoryError on large data
compressed = bytearray()
for i in range(0, len(data), 4):
compressed.extend(data[i:i+2]) # Allocates new memory
```
### ✅ Work with smaller chunks or compress during transmission
```python
# CORRECT - Process in smaller pieces
chunk_size = 512
for i in range(0, len(data), chunk_size):
chunk = data[i:i+chunk_size]
process_chunk(chunk) # Handle incrementally
```
**Solutions implemented:**
1. Reduce recording duration (3s → 1s)
2. Compress audio (μ-law: 50% size reduction)
3. Stream transmission in small chunks (512 bytes)
4. Add delays between sends to prevent buffer overflow
---
## String Operations
### ❌ Arithmetic in string concatenation
```python
# WRONG - SyntaxError (sometimes)
message = "Count: #" + str(count + 1)
```
### ✅ Separate arithmetic from concatenation
```python
# CORRECT
next_count = count + 1
message = "Count: #" + str(next_count)
```
---
## Bytearray Operations
### ❌ Item deletion NOT supported
```python
# WRONG - TypeError: 'bytearray' object doesn't support item deletion
del audio_data[expected_size:]
```
### ✅ Create new bytearray with slice
```python
# CORRECT
audio_data = audio_data[:expected_size]
# Or create new buffer
trimmed = bytearray(expected_size)
trimmed[:] = audio_data[:expected_size]
```
---
## HTTP Requests
### ❌ urequests module NOT available
```python
# WRONG - ImportError: no module named 'urequests'
import urequests
response = urequests.post(url, data=data)
```
### ✅ Use raw socket HTTP
```python
# CORRECT
import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((host, port))
# Manual HTTP headers
headers = "POST /path HTTP/1.1\r\n"
headers += "Host: " + host + "\r\n"
headers += "Content-Type: audio/wav\r\n"
headers += "Content-Length: " + str(len(data)) + "\r\n"
headers += "Connection: close\r\n\r\n"
s.send(headers.encode())
s.send(data)
response = s.recv(1024)
s.close()
```
**Socket I/O errors common:**
- `[Errno 5] EIO` - Buffer overflow or disconnect
- Solutions:
- Send smaller chunks (512-1024 bytes)
- Add delays between sends (`time.sleep_ms(10)`)
- Enable keepalive if supported
---
## Best Practices for MaixPy
1. **Avoid complex expressions** - Break into simple steps
2. **Pre-allocate when possible** - Reduce fragmentation
3. **Use small buffers** - 512-1024 byte chunks work well
4. **Add delays in loops** - Prevent watchdog/buffer issues
5. **Explicit type conversions** - Always use `str()`, `int()`, etc.
6. **Test incrementally** - Memory errors appear suddenly
7. **Monitor serial output** - Errors often give hints
8. **Simplify, simplify** - Complexity = bugs in MicroPython
---
## Testing Methodology
When porting Python code to MaixPy:
1. Start with simplest version (hardcoded values)
2. Test each function individually via REPL
3. Add features incrementally
4. Watch for memory errors (usually allocation failures)
5. If error occurs, simplify the last change
6. Use print statements liberally (no debugger available)
---
## Hardware-Specific Notes
### Maixduino ESP32 WiFi
- Requires manual pin registration
- 6 pins must be configured (CS, RST, RDY, MOSI, MISO, SCLK)
- Connection can be slow (20+ seconds)
- Stability improves with smaller packet sizes
### I2S Microphone
- Returns Audio objects, not raw bytes
- Format is often different than configured
- May return stereo when mono requested
- May return 32-bit when 16-bit requested
- Always implement format detection/conversion
### BOOT Button (GPIO 16)
- Active low (0 = pressed, 1 = released)
- Requires pull-up configuration
- Debounce by waiting for release
- Can be used without interrupts (polling is fine)
---
## Resources
- **MaixPy Documentation:** https://maixpy.sipeed.com/
- **K210 Datasheet:** https://canaan.io/product/kendryteai
- **ESP32 SPI Firmware:** https://github.com/sipeed/MaixPy_scripts/tree/master/network
---
## Summary of Successful Patterns
```python
# Audio recording and transmission pipeline
1. Record audio → Audio object (128KB for 1 second)
2. Convert to bytes → to_bytes() (still 128KB)
3. Detect format → Check size vs expected
4. Convert to mono 16-bit → In-place copy (32KB)
5. Compress with μ-law → 50% reduction (16KB)
6. Send in chunks → 512 bytes at a time with delays
7. Parse response → Simple string operations
# Total: ~85% size reduction, fits in memory!
```
This approach works reliably on K210 with ~6MB RAM.
---
**Last Updated:** 2025-12-03
**Status:** Fully tested and working

184
hardware/maixduino/README.md Executable file
View file

@ -0,0 +1,184 @@
# Maixduino Scripts
Scripts to copy/paste into MaixPy IDE for running on the Maix Duino board.
## Files
### 1. maix_test_simple.py
**Purpose:** Hardware and connectivity test
**Use:** Copy/paste into MaixPy IDE to test before deploying full application
**Tests:**
- LCD display functionality
- WiFi connection
- Network connection to Heimdall server (port 3006)
- I2S audio hardware initialization
**Before running:**
1. Edit WiFi credentials (lines 16-17):
```python
WIFI_SSID = "YourNetworkName"
WIFI_PASSWORD = "YourPassword"
```
2. Verify server URL is correct (line 18):
```python
SERVER_URL = "http://10.1.10.71:3006"
```
3. Copy entire file contents
4. Paste into MaixPy IDE
5. Click RUN button
**Expected output:**
- Display will show test results
- Serial console will print detailed progress
- Will report OK/FAIL for each test
---
### 2. maix_voice_client.py
**Purpose:** Full voice assistant client
**Use:** Copy/paste into MaixPy IDE after test passes
**Features:**
- Wake word detection (placeholder - uses amplitude trigger)
- Audio recording after wake word
- Sends audio to Heimdall server for processing
- Displays transcription and response on LCD
- LED feedback for status
**Before running:**
1. Edit WiFi credentials (lines 38-39)
2. Verify server URL (line 42)
3. Adjust audio settings if needed (lines 45-62)
**For SD card deployment:**
1. Copy this file to SD card as `main.py`
2. Board will auto-run on boot
---
## Deployment Workflow
### Step 1: Test Hardware (maix_test_simple.py)
```
1. Edit WiFi settings
2. Paste into MaixPy IDE
3. Click RUN
4. Verify all tests pass
```
### Step 2: Deploy Full Client (maix_voice_client.py)
**Option A - IDE Testing:**
```
1. Edit WiFi settings
2. Paste into MaixPy IDE
3. Click RUN for testing
```
**Option B - Permanent SD Card:**
```
1. Edit WiFi settings
2. Save to SD card as: /sd/main.py
3. Reboot board - auto-runs on boot
```
---
## Hardware Requirements
### Maix Duino Board
- K210 processor with KPU
- LCD display (built-in)
- I2S microphone (check connections)
- ESP32 WiFi module (built-in)
### I2S Pin Configuration (Default)
```python
Pin 20: I2S0_IN_D0 (Data)
Pin 19: I2S0_WS (Word Select)
Pin 18: I2S0_SCLK (Clock)
```
**Note:** If your microphone uses different pins, edit the pin assignments in the scripts.
---
## Troubleshooting
### WiFi Won't Connect
- Verify SSID and password are correct
- Ensure WiFi is 2.4GHz (not 5GHz - Maix doesn't support 5GHz)
- Check signal strength
- Try moving closer to router
### Server Connection Fails
- Verify Heimdall server is running on port 3006
- Check firewall allows port 3006
- Ensure Maix is on same network (10.1.10.0/24)
- Test from another device: `curl http://10.1.10.71:3006/health`
### Audio Initialization Fails
- Check microphone is properly connected
- Verify I2S pins match your hardware
- Try alternate pin configuration if needed
- Check microphone requires 3.3V (not 5V)
### Script Errors in MaixPy IDE
- Ensure using latest MaixPy firmware
- Check for typos when editing WiFi credentials
- Verify entire script was copied (check for truncation)
- Look at serial console for detailed error messages
---
## MaixPy IDE Tips
### Running Scripts
1. Connect board via USB
2. Select correct board model: Tools → Select Board
3. Click connect button (turns red when connected)
4. Paste code into editor
5. Click run button (red triangle)
6. Watch serial console and LCD for output
### Stopping Scripts
- Click run button again to stop
- Or press reset button on board
### Serial Console
- Shows detailed debug output
- Useful for troubleshooting
- Can copy errors for debugging
---
## Network Configuration
- **Heimdall Server:** 10.1.10.71:3006
- **Maix Duino:** Gets IP via DHCP (shown on LCD during test)
- **Network:** 10.1.10.0/24
---
## Next Steps
After both scripts work:
1. Verify Heimdall server is processing audio
2. Test wake word detection
3. Integrate with Home Assistant (optional)
4. Train custom wake word (optional)
5. Deploy to SD card for permanent installation
---
## Related Documentation
- **Project overview:** `../PROJECT_SUMMARY.md`
- **Heimdall setup:** `../QUICKSTART.md`
- **Wake word training:** `../MYCROFT_PRECISE_GUIDE.md`
- **Server deployment:** `../docs/PRECISE_DEPLOYMENT.md`
---
**Last Updated:** 2025-12-03
**Location:** `/Library/Development/devl/Devops/projects/mycroft-precise/maixduino-scripts/`

View file

@ -0,0 +1,376 @@
# Maixduino Voice Assistant - Session Progress
**Date:** 2025-12-03
**Session Duration:** ~4 hours
**Goal:** Get audio recording and transcription working on Maixduino → Heimdall server
---
## 🎉 Major Achievements
### ✅ Full Audio Pipeline Working!
We successfully built and tested the complete audio capture → compression → transmission → transcription pipeline:
1. **WiFi Connection** - Maixduino connects to network (10.1.10.98)
2. **Audio Recording** - I2S microphone captures audio (MSM261S4030H0 MEMS mic)
3. **Format Conversion** - Converts 32-bit stereo to 16-bit mono (4x size reduction)
4. **μ-law Compression** - Compresses PCM audio by 50%
5. **HTTP Transmission** - Sends compressed WAV to Heimdall server
6. **Whisper Transcription** - Server transcribes and returns text
7. **LCD Display** - Shows transcription on Maixduino screen
8. **Button Loop** - Press BOOT button for repeated recordings
**Total size reduction:** 128KB → 32KB (mono) → 16KB (compressed) = **87.5% reduction!**
---
## 🔧 Technical Accomplishments
### Audio Recording Pipeline
- **Initial Problem:** `i2s_dev.record()` returned immediately (1ms instead of 1000ms)
- **Root Cause:** Recording API is asynchronous/non-blocking
- **Solution:** Use chunked recording with `wait_record()` blocking calls
- **Pattern:**
```python
for i in range(frame_cnt):
audio_chunk = i2s_dev.record(chunk_size)
i2s_dev.wait_record() # CRITICAL: blocks until complete
chunks.append(audio_chunk.to_bytes())
```
### Memory Management
- **K210 has very limited RAM** (~6MB total, much less available)
- Successfully handled 128KB → 16KB data transformation without OOM errors
- Techniques used:
- Record in small chunks (2048 samples)
- Stream HTTP transmission (512-byte chunks with delays)
- In-place data conversion where possible
- Explicit garbage collection hints (`audio_data = None`)
### Network Communication
- **Raw socket HTTP** (no urequests library available)
- **Chunked streaming** with flow control (10ms delays)
- **Simple WAV format** with μ-law compression (format code 7)
- **Robust error handling** with serial output debugging
---
## 🐛 MicroPython/MaixPy Quirks Discovered
### String Operations
- ❌ **F-strings NOT supported** - Must use `"text " + str(var)` concatenation
- ❌ **Ternary operators fail** - Use explicit `if/else` blocks instead
- ❌ **`split()` needs explicit delimiter** - `text.split(" ")` not `text.split()`
- ❌ **Escape sequences problematic** - Avoid `\n` in strings, causes syntax errors
### Data Types & Methods
- ❌ **`decode()` doesn't accept kwargs** - Use `decode('utf-8')` not `decode('utf-8', errors='ignore')`
- ❌ **RGB tuples not accepted** - Must convert to packed integers: `(r << 16) | (g << 8) | b`
- ❌ **Bytearray item deletion unsupported** - `del arr[n:]` fails, use slicing instead
- ❌ **Arithmetic in string concat** - Separate calculations: `next = count + 1; "text" + str(next)`
### I2S Audio Specific
- ❌ **`record()` is non-blocking** - Returns immediately, must use `wait_record()`
- ❌ **Audio object not directly iterable** - Must call `.to_bytes()` first
- ⚠️ **Data format mismatch** - Hardware returns 32-bit stereo even when configured for 16-bit mono (4x expected size)
### Network/WiFi
- ❌ **`network.WLAN` not available** - Must use `network.ESP32_SPI` with full pin config
- ❌ **`active()` method doesn't exist** - Just call `connect()` directly
- ⚠️ **Requires ALL 6 pins configured** - CS, RST, RDY, MOSI, MISO, SCLK
### General Syntax
- ⚠️ **`if __name__ == "__main__"` sometimes causes syntax errors** - Safer to just call `main()` directly
- ⚠️ **Import statements mid-function can cause syntax errors** - Keep imports at top of file
- ⚠️ **Some valid Python causes "invalid syntax" for unknown reasons** - Simplify complex expressions
---
## 📊 Current Status
### ✅ Working
- WiFi connectivity (ESP32 SPI)
- I2S audio initialization
- Chunked audio recording with `wait_record()`
- Audio format detection and conversion (32-bit stereo → 16-bit mono)
- μ-law compression (50% size reduction)
- HTTP transmission to server (chunked streaming)
- Whisper transcription (server-side)
- JSON response parsing
- LCD display (with word wrapping)
- Button-triggered recording loop
- Countdown timer before recording
### ⚠️ Partially Working
- **Recording duration** - Currently getting ~0.9 seconds instead of full 1 second
- Formula: `frame_cnt = seconds * sample_rate // chunk_size`
- Current: `7 frames × (2048/16000) = 0.896s`
- May need to increase `frame_cnt` or adjust chunk size
### ❌ Not Yet Implemented
- Mycroft Precise wake word detection
- Full voice assistant loop
- Command processing
- Home Assistant integration
- Multi-second recording support
- Real-time audio streaming
---
## 🔬 Technical Details
### Hardware Configuration
**Maixduino Board:**
- Processor: K210 dual-core RISC-V @ 400MHz
- RAM: ~6MB total (limited available memory)
- WiFi: ESP32 module via SPI
- Microphone: MSM261S4030H0 MEMS (onboard)
- IP Address: 10.1.10.98
**I2S Pins:**
- Pin 20: I2S0_IN_D0 (data)
- Pin 19: I2S0_WS (word select)
- Pin 18: I2S0_SCLK (clock)
**ESP32 SPI Pins:**
- Pin 25: CS (chip select)
- Pin 8: RST (reset)
- Pin 9: RDY (ready)
- Pin 28: MOSI (master out)
- Pin 26: MISO (master in)
- Pin 27: SCLK (clock)
**GPIO:**
- Pin 16: BOOT button (active low, pull-up)
### Server Configuration
**Heimdall Server:**
- IP: 10.1.10.71
- Port: 3006
- Framework: Flask
- Model: Whisper base
- Environment: Conda `whisper_cli`
**Endpoints:**
- `/health` - Health check
- `/transcribe` - POST audio for transcription
### Audio Format
**Recording:**
- Sample Rate: 16kHz
- Hardware Output: 32-bit stereo (128KB for 1 second)
- After Conversion: 16-bit mono (32KB for 1 second)
- After Compression: 8-bit μ-law (16KB for 1 second)
**WAV Header:**
- Format Code: 7 (μ-law)
- Channels: 1 (mono)
- Sample Rate: 16000 Hz
- Bits per Sample: 8
- Includes `fact` chunk (required for μ-law)
---
## 📝 Code Files
### Main Script
**File:** `/Library/Development/devl/Devops/projects/mycroft-precise/maixduino-scripts/maix_simple_record_test.py`
**Key Functions:**
- `init_wifi()` - ESP32 SPI WiFi connection
- `init_audio()` - I2S microphone setup
- `record_audio()` - Chunked recording with `wait_record()`
- `convert_to_mono_16bit()` - Format conversion (32-bit stereo → 16-bit mono)
- `compress_ulaw()` - μ-law compression
- `create_wav_header()` - WAV file header generation
- `send_to_server()` - HTTP POST with chunked streaming
- `display_transcription()` - LCD output with word wrapping
- `main()` - Button loop for repeated recordings
### Server Script
**File:** `/devl/voice-assistant/simple_transcribe_server.py`
**Features:**
- Accepts raw WAV or multipart uploads
- Whisper base model transcription
- JSON response with transcription text
- Handles μ-law compressed audio
### Documentation
**File:** `/Library/Development/devl/Devops/projects/mycroft-precise/maixduino-scripts/MICROPYTHON_QUIRKS.md`
Complete reference of all MicroPython compatibility issues discovered during development.
---
## 🎯 Next Steps
### Immediate (Tonight)
1. ✅ Switch to Linux laptop with direct serial access
2. ⏭️ Tune recording duration to get full 1 second
- Try `frame_cnt = 8` instead of 7
- Or adjust chunk size to get exact timing
3. ⏭️ Test transcription quality with proper-length recordings
### Short Term (This Week)
1. Increase recording duration to 2-3 seconds for better transcription
2. Test memory limits with longer recordings
3. Optimize compression/transmission for speed
4. Add visual feedback during transmission
### Medium Term (Next Week)
1. Install Mycroft Precise in `whisper_cli` environment
2. Test "hey mycroft" wake word detection on server
3. Integrate wake word into recording loop
4. Add command processing and Home Assistant integration
### Long Term (Future)
1. Explore edge wake word detection (Precise on K210)
2. Multi-device deployment
3. Continuous listening mode
4. Voice profiles and speaker identification
---
## 🐛 Known Issues
### Recording Duration
- **Issue:** Recording is ~0.9 seconds instead of 1.0 seconds
- **Cause:** Integer division `16000 // 2048 = 7.8` rounds down to 7 frames
- **Impact:** Minor - transcription still works
- **Fix:** Increase `frame_cnt` to 8 or adjust chunk size
### Data Format Mismatch
- **Issue:** Hardware returns 4x expected data (128KB vs 32KB)
- **Cause:** I2S outputting 32-bit stereo despite 16-bit mono config
- **Impact:** None - conversion function handles it
- **Status:** Working as intended
### Syntax Error Sensitivity
- **Issue:** Some valid Python causes "invalid syntax" in MicroPython
- **Patterns:** Import statements mid-function, certain arithmetic expressions
- **Workaround:** Simplify code, avoid complex expressions
- **Status:** Documented in MICROPYTHON_QUIRKS.md
---
## 💡 Key Learnings
### I2S Recording Pattern
The correct pattern for MaixPy I2S recording:
```python
chunk_size = 2048
frame_cnt = seconds * sample_rate // chunk_size
for i in range(frame_cnt):
audio_chunk = i2s_dev.record(chunk_size)
i2s_dev.wait_record() # BLOCKS until recording complete
data.append(audio_chunk.to_bytes())
```
**Critical:** `wait_record()` is REQUIRED or recording returns immediately!
### Memory Management
K210 has very limited RAM. Successful strategies:
- Work in small chunks (512-2048 bytes)
- Stream data instead of buffering
- Free variables explicitly when done
- Avoid creating large intermediate buffers
### MicroPython Compatibility
MicroPython is NOT Python. Many standard features missing:
- F-strings, ternary operators, keyword arguments
- Some string methods, complex expressions
- Standard libraries (urequests, json parsing)
**Rule:** Test incrementally, simplify everything, check quirks doc.
---
## 📚 Resources Used
### Documentation
- [MaixPy I2S API Reference](https://wiki.sipeed.com/soft/maixpy/en/api_reference/Maix/i2s.html)
- [MaixPy I2S Usage Guide](https://wiki.sipeed.com/soft/maixpy/en/modules/on_chip/i2s.html)
- [Maixduino Hardware Wiki](https://wiki.sipeed.com/hardware/en/maix/maixpy_develop_kit_board/maix_duino.html)
### Code Examples
- [Official record_wav.py](https://github.com/sipeed/MaixPy-v1_scripts/blob/master/multimedia/audio/record_wav.py)
- [MaixPy Scripts Repository](https://github.com/sipeed/MaixPy-v1_scripts)
### Tools
- MaixPy IDE (copy/paste to board)
- Serial monitor (debugging)
- Heimdall server (Whisper transcription)
---
## 🔄 Ready for Next Session
### Current State
- ✅ Code is working and stable
- ✅ Can record, compress, transmit, transcribe, display
- ✅ Button loop allows repeated testing
- ⚠️ Recording duration slightly short (~0.9s)
### Files Ready
- `/Library/Development/devl/Devops/projects/mycroft-precise/maixduino-scripts/maix_simple_record_test.py`
- `/Library/Development/devl/Devops/projects/mycroft-precise/maixduino-scripts/MICROPYTHON_QUIRKS.md`
- `/devl/voice-assistant/simple_transcribe_server.py`
### For Serial Access Session
1. Connect Maixduino via USB to Linux laptop
2. Install pyserial: `pip install pyserial`
3. Find device: `ls /dev/ttyUSB*` or `/dev/ttyACM*`
4. Connect: `screen /dev/ttyUSB0 115200` or use MaixPy IDE
5. Can directly modify code, test immediately, see serial output
### Quick Test Commands
```python
# Test WiFi
from network import ESP32_SPI
# ... (full init code in maix_test_simple.py)
# Test I2S
from Maix import I2S
rx = I2S(I2S.DEVICE_0)
# ...
# Test recording
audio = rx.record(2048)
rx.wait_record()
print(len(audio.to_bytes()))
```
---
## 🎊 Success Metrics
Today we achieved:
- ✅ WiFi connection working
- ✅ Audio recording working (with proper blocking)
- ✅ Format conversion working (4x reduction)
- ✅ Compression working (2x reduction)
- ✅ Network transmission working (chunked streaming)
- ✅ Server transcription working
- ✅ Display output working
- ✅ Button loop working
- ✅ End-to-end pipeline complete!
**Total:** 9/9 core features working! 🚀
Minor tuning needed, but the foundation is solid and ready for wake word integration.
---
**Session Summary:** Massive progress! From zero to working audio transcription pipeline in one session. Overcame significant MicroPython compatibility challenges and memory limitations. Ready for next phase: wake word detection.
**Status:** ✅ Ready for Linux serial access and fine-tuning
**Next Session:** Tune recording duration, then integrate Mycroft Precise wake word detection
---
*End of Session Report - 2025-12-03*

View file

@ -0,0 +1,41 @@
# Debug script to discover WiFi module methods
# This will help us figure out the correct API
import lcd
lcd.init()
lcd.clear()
print("=" * 40)
print("WiFi Module Debug")
print("=" * 40)
# Try to import WiFi module
try:
from network_esp32 import wifi
print("SUCCESS: Imported network_esp32.wifi")
lcd.draw_string(10, 10, "WiFi module found!", 0xFFFF, 0x0000)
# List all attributes/methods
print("\nAvailable methods:")
lcd.draw_string(10, 30, "Checking methods...", 0xFFFF, 0x0000)
attrs = dir(wifi)
y = 50
for i, attr in enumerate(attrs):
if not attr.startswith('_'):
print(" - " + attr)
if i < 10: # Only show first 10 on screen
lcd.draw_string(10, y, attr[:20], 0x07E0, 0x0000)
y += 15
print("\nTotal methods: " + str(len(attrs)))
except Exception as e:
print("ERROR importing wifi: " + str(e))
lcd.draw_string(10, 10, "WiFi import failed!", 0xF800, 0x0000)
lcd.draw_string(10, 30, str(e)[:30], 0xF800, 0x0000)
print("\n" + "=" * 40)
print("Debug complete - check serial output")
print("=" * 40)

View file

@ -0,0 +1,51 @@
# Discover what network/WiFi modules are actually available
import lcd
import sys
lcd.init()
lcd.clear()
print("=" * 40)
print("Module Discovery")
print("=" * 40)
# Try different possible module names
modules_to_try = [
"network",
"network_esp32",
"network_esp8285",
"esp32_spi",
"esp8285",
"wifi",
"ESP32_SPI",
"WIFI"
]
found = []
y = 10
for module_name in modules_to_try:
try:
mod = __import__(module_name)
msg = "FOUND: " + module_name
print(msg)
lcd.draw_string(10, y, msg[:25], 0x07E0, 0x0000) # Green
y += 15
found.append(module_name)
# Show methods
print(" Methods: " + str(dir(mod)))
except Exception as e:
msg = "NONE: " + module_name
print(msg + " (" + str(e) + ")")
print("\n" + "=" * 40)
if found:
print("Found modules: " + str(found))
lcd.draw_string(10, y + 20, "Found: " + str(len(found)), 0xFFFF, 0x0000)
else:
print("No WiFi modules found!")
lcd.draw_string(10, y + 20, "No WiFi found!", 0xF800, 0x0000)
print("=" * 40)

View file

@ -0,0 +1,461 @@
# Simple Audio Recording and Transcription Test
# Record audio for 3 seconds, send to server, display transcription
#
# This tests the full audio pipeline without wake word detection
import time
import lcd
import socket
import struct
from Maix import GPIO, I2S
from fpioa_manager import fm
# ===== CONFIGURATION =====
# Load credentials from secrets.py (gitignored)
try:
from secrets import SECRETS
except ImportError:
SECRETS = {}
WIFI_SSID = "Tell My WiFi Love Her"
WIFI_PASSWORD = SECRETS.get("wifi_password", "") # set in secrets.py
SERVER_HOST = "10.1.10.71"
SERVER_PORT = 3006
RECORD_SECONDS = 1 # Reduced to 1 second to save memory
SAMPLE_RATE = 16000
# ==========================
# Colors
def rgb_to_int(r, g, b):
return (r << 16) | (g << 8) | b
COLOR_BLACK = 0
COLOR_WHITE = rgb_to_int(255, 255, 255)
COLOR_RED = rgb_to_int(255, 0, 0)
COLOR_GREEN = rgb_to_int(0, 255, 0)
COLOR_BLUE = rgb_to_int(0, 0, 255)
COLOR_YELLOW = rgb_to_int(255, 255, 0)
COLOR_CYAN = 0x00FFFF # Cyan: rgb_to_int(0, 255, 255)
def display_msg(msg, color=COLOR_WHITE, y=50, clear=False):
"""Display message on LCD"""
if clear:
lcd.clear(COLOR_BLACK)
lcd.draw_string(10, y, msg[:30], color, COLOR_BLACK)
print(msg)
def init_wifi():
"""Initialize WiFi connection"""
from network import ESP32_SPI
lcd.init()
lcd.clear(COLOR_BLACK)
display_msg("Connecting WiFi...", COLOR_BLUE, 10)
# Register ESP32 SPI pins
fm.register(25, fm.fpioa.GPIOHS10, force=True) # CS
fm.register(8, fm.fpioa.GPIOHS11, force=True) # RST
fm.register(9, fm.fpioa.GPIOHS12, force=True) # RDY
fm.register(28, fm.fpioa.GPIOHS13, force=True) # MOSI
fm.register(26, fm.fpioa.GPIOHS14, force=True) # MISO
fm.register(27, fm.fpioa.GPIOHS15, force=True) # SCLK
nic = ESP32_SPI(
cs=fm.fpioa.GPIOHS10, rst=fm.fpioa.GPIOHS11, rdy=fm.fpioa.GPIOHS12,
mosi=fm.fpioa.GPIOHS13, miso=fm.fpioa.GPIOHS14, sclk=fm.fpioa.GPIOHS15
)
nic.connect(WIFI_SSID, WIFI_PASSWORD)
# Wait for connection
timeout = 20
while timeout > 0:
time.sleep(1)
if nic.isconnected():
ip = nic.ifconfig()[0]
display_msg("WiFi OK: " + str(ip), COLOR_GREEN, 30)
return nic
timeout -= 1
display_msg("WiFi FAILED!", COLOR_RED, 30)
return None
def init_audio():
"""Initialize I2S audio"""
display_msg("Init audio...", COLOR_BLUE, 50)
# Register I2S pins
fm.register(20, fm.fpioa.I2S0_IN_D0, force=True)
fm.register(19, fm.fpioa.I2S0_WS, force=True)
fm.register(18, fm.fpioa.I2S0_SCLK, force=True)
# Initialize I2S
rx = I2S(I2S.DEVICE_0)
rx.channel_config(rx.CHANNEL_0, rx.RECEIVER, align_mode=I2S.STANDARD_MODE)
rx.set_sample_rate(SAMPLE_RATE)
display_msg("Audio OK!", COLOR_GREEN, 70)
return rx
def convert_to_mono_16bit(audio_data):
"""Convert audio to mono 16-bit by returning a slice"""
expected_size = SAMPLE_RATE * RECORD_SECONDS * 2 # 16-bit mono
actual_size = len(audio_data)
print("Expected size: " + str(expected_size) + ", Actual: " + str(actual_size))
# If we got 4x the expected data, downsample to mono
if actual_size == expected_size * 4:
print("Extracting mono from stereo/32-bit...")
# Create new buffer with only the data we need (every 4th pair of bytes)
mono_data = bytearray(expected_size)
write_pos = 0
# Read every 4 bytes, take first 2 bytes only
for read_pos in range(0, actual_size, 4):
if write_pos + 1 < expected_size and read_pos + 1 < actual_size:
mono_data[write_pos] = audio_data[read_pos]
mono_data[write_pos + 1] = audio_data[read_pos + 1]
write_pos += 2
# Free original buffer explicitly
audio_data = None
return mono_data
# If we got 2x the expected data, extract mono
elif actual_size == expected_size * 2:
print("Extracting mono from stereo...")
mono_data = bytearray(expected_size)
write_pos = 0
for read_pos in range(0, actual_size, 4):
if write_pos + 1 < expected_size and read_pos + 1 < actual_size:
mono_data[write_pos] = audio_data[read_pos]
mono_data[write_pos + 1] = audio_data[read_pos + 1]
write_pos += 2
# Free original
audio_data = None
return mono_data
# Otherwise assume it's already correct format
print("Audio data appears to be correct format")
return audio_data
def record_audio(i2s_dev, seconds):
"""Record audio for specified seconds using chunked recording with wait"""
# Clear screen and show big recording indicator
lcd.clear(COLOR_BLACK)
# Show large "RECORDING" text
display_msg("*** RECORDING ***", COLOR_RED, 60)
display_msg("Speak now!", COLOR_YELLOW, 100)
display_msg("(listening...)", COLOR_WHITE, 130)
chunk_size = 2048
channels = 1
# Calculate number of chunks needed
frame_cnt = seconds * SAMPLE_RATE // chunk_size
print("Recording " + str(frame_cnt) + " frames...")
# Recording loop with wait
all_chunks = []
for i in range(frame_cnt):
# Start recording this chunk
audio_chunk = i2s_dev.record(chunk_size * channels)
# CRITICAL: Wait for recording to complete
i2s_dev.wait_record()
# Convert to bytes and store
chunk_bytes = audio_chunk.to_bytes()
all_chunks.append(chunk_bytes)
# Combine all chunks
print("Combining " + str(len(all_chunks)) + " chunks...")
audio_data = bytearray()
for chunk in all_chunks:
audio_data.extend(chunk)
print("Recorded " + str(len(audio_data)) + " bytes")
# Convert to mono 16-bit if needed
audio_data = convert_to_mono_16bit(audio_data)
print("Final size: " + str(len(audio_data)) + " bytes")
return audio_data
def compress_ulaw(data):
"""Compress 16-bit PCM to 8-bit μ-law (50% size reduction)"""
# μ-law compression lookup table (simplified)
BIAS = 0x84
CLIP = 32635
compressed = bytearray()
# Process 16-bit samples (2 bytes each)
for i in range(0, len(data), 2):
# Get 16-bit sample (little endian)
sample = struct.unpack('<h', data[i:i+2])[0]
# Get sign and magnitude
sign = 0x80 if sample < 0 else 0x00
if sample < 0:
sample = -sample
if sample > CLIP:
sample = CLIP
# Add bias
sample = sample + BIAS
# Find exponent (position of highest bit)
exponent = 7
for exp in range(7, -1, -1):
if sample & (1 << (exp + 7)):
exponent = exp
break
# Get mantissa (top 4 bits after exponent)
mantissa = (sample >> (exponent + 3)) & 0x0F
# Combine: sign (1 bit) + exponent (3 bits) + mantissa (4 bits)
ulaw_byte = sign | (exponent << 4) | mantissa
# Invert bits (μ-law standard)
compressed.append(ulaw_byte ^ 0xFF)
return compressed
def create_wav_header(data_size, sample_rate=16000, is_ulaw=False):
"""Create WAV file header"""
header = bytearray()
# RIFF header
header.extend(b'RIFF')
header.extend(struct.pack('<I', 50 + data_size)) # Larger header for μ-law
header.extend(b'WAVE')
# fmt chunk
header.extend(b'fmt ')
header.extend(struct.pack('<I', 18)) # Chunk size (with extension)
header.extend(struct.pack('<H', 7 if is_ulaw else 1)) # 7=μ-law, 1=PCM
header.extend(struct.pack('<H', 1)) # Mono
header.extend(struct.pack('<I', sample_rate))
header.extend(struct.pack('<I', sample_rate * (1 if is_ulaw else 2))) # Byte rate
header.extend(struct.pack('<H', 1 if is_ulaw else 2)) # Block align
header.extend(struct.pack('<H', 8 if is_ulaw else 16)) # Bits per sample
header.extend(struct.pack('<H', 0)) # Extension size
# fact chunk (required for μ-law)
if is_ulaw:
header.extend(b'fact')
header.extend(struct.pack('<I', 4))
header.extend(struct.pack('<I', data_size)) # Sample count
# data chunk
header.extend(b'data')
header.extend(struct.pack('<I', data_size))
return header
def send_to_server(audio_data):
"""Send audio to server and get transcription"""
lcd.clear(COLOR_BLACK)
display_msg("Processing...", COLOR_BLUE, 60)
display_msg("Compressing audio", COLOR_WHITE, 100)
print("Sending to server...")
try:
# Compress audio using μ-law (50% size reduction)
print("Compressing audio...")
compressed_data = compress_ulaw(audio_data)
print("Compressed: " + str(len(audio_data)) + " -> " + str(len(compressed_data)) + " bytes")
# Update display
display_msg("Sending to server", COLOR_WHITE, 130)
# Create WAV file with μ-law format
wav_header = create_wav_header(len(compressed_data), is_ulaw=True)
wav_size = len(wav_header) + len(compressed_data)
# Simple HTTP POST with raw WAV data
headers = "POST /transcribe HTTP/1.1\r\n"
headers += "Host: " + SERVER_HOST + "\r\n"
headers += "Content-Type: audio/wav\r\n"
headers += "Content-Length: " + str(wav_size) + "\r\n"
headers += "Connection: close\r\n\r\n"
# Connect with better socket settings
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.settimeout(30)
# Try to set socket options for better stability
try:
s.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
except:
pass # Some MicroPython builds don't support this
print("Connecting to " + SERVER_HOST + ":" + str(SERVER_PORT))
s.connect((SERVER_HOST, SERVER_PORT))
# Send headers
print("Sending headers...")
sent = s.send(headers.encode())
print("Sent " + str(sent) + " bytes of headers")
# Send WAV header
print("Sending WAV header...")
sent = s.send(wav_header)
print("Sent " + str(sent) + " bytes of WAV header")
# Send audio data in small chunks with delay
print("Sending audio data (" + str(len(compressed_data)) + " bytes)...")
chunk_size = 512 # Even smaller chunks for stability
total_chunks = (len(compressed_data) + chunk_size - 1) // chunk_size
bytes_sent = 0
for i in range(0, len(compressed_data), chunk_size):
chunk = compressed_data[i:i+chunk_size]
try:
sent = s.send(chunk)
bytes_sent += sent
chunk_num = i // chunk_size + 1
if chunk_num % 10 == 0: # Progress update every 10 chunks
print("Sent " + str(bytes_sent) + "/" + str(len(compressed_data)) + " bytes")
# Small delay to let socket buffer drain
time.sleep_ms(10)
except Exception as e:
print("Send error at byte " + str(bytes_sent) + ": " + str(e))
raise
print("All data sent! Total: " + str(bytes_sent) + " bytes")
# Update display for waiting
lcd.clear(COLOR_BLACK)
display_msg("Transcribing...", COLOR_CYAN, 60)
display_msg("Please wait", COLOR_WHITE, 100)
# Read response
response = b""
while True:
chunk = s.recv(1024)
if not chunk:
break
response += chunk
s.close()
# Parse response (MicroPython decode doesn't accept keyword args)
try:
response_str = response.decode('utf-8')
except:
response_str = str(response)
print("Response: " + response_str[:200])
# Extract JSON from response
if '{"' in response_str:
json_start = response_str.index('{"')
json_str = response_str[json_start:]
# Simple JSON parsing (MicroPython doesn't have json module)
if '"text":' in json_str:
text_start = json_str.index('"text":') + 7
text_str = json_str[text_start:]
# Find the value between quotes
if '"' in text_str:
quote_start = text_str.index('"') + 1
quote_end = text_str.index('"', quote_start)
transcription = text_str[quote_start:quote_end]
return transcription
return "Error parsing response"
except Exception as e:
print("Error: " + str(e))
return "Error: " + str(e)
def display_transcription(text):
"""Display transcription on LCD"""
lcd.clear(COLOR_BLACK)
display_msg("TRANSCRIPTION:", COLOR_GREEN, 10)
# Simple line splitting every 20 chars
y = 40
while len(text) > 0:
chunk = text[:20]
display_msg(chunk, COLOR_WHITE, y)
text = text[20:]
y += 20
if y > 200:
break
print("Transcription: " + text)
def main():
"""Main program with loop for multiple recordings"""
print("=" * 40)
print("Simple Audio Recording Test")
print("=" * 40)
# Initialize
nic = init_wifi()
if not nic:
return
i2s = init_audio()
# Setup button (boot button on GPIO 16)
fm.register(16, fm.fpioa.GPIOHS0, force=True)
button = GPIO(GPIO.GPIOHS0, GPIO.IN, GPIO.PULL_UP)
display_msg("Ready!", COLOR_GREEN, 110, clear=True)
display_msg("Press BOOT button", COLOR_WHITE, 130)
display_msg("to record", COLOR_WHITE, 150)
print("Press BOOT button to record, or Ctrl+C to exit")
recording_count = 0
# Main loop
while True:
# Wait for button press (button is active low)
if button.value() == 0:
recording_count += 1
print("\n--- Recording #" + str(recording_count) + " ---")
# Debounce - wait for button release
while button.value() == 0:
time.sleep_ms(10)
# Give user time to prepare (countdown)
lcd.clear(COLOR_BLACK)
display_msg("GET READY!", COLOR_YELLOW, 80)
display_msg("3...", COLOR_WHITE, 120)
time.sleep(1)
display_msg("2...", COLOR_WHITE, 140)
time.sleep(1)
display_msg("1...", COLOR_WHITE, 160)
time.sleep(1)
# Record
audio_data = record_audio(i2s, RECORD_SECONDS)
# Send to server
transcription = send_to_server(audio_data)
# Display result
display_transcription(transcription)
# Wait a bit before showing ready again
time.sleep(2)
# Show ready for next recording
display_msg("Ready!", COLOR_GREEN, 110, clear=True)
display_msg("Press BOOT button", COLOR_WHITE, 130)
next_count = recording_count + 1
display_msg("to record (#" + str(next_count) + ")", COLOR_WHITE, 150)
print("Ready for next recording. Press BOOT button.")
time.sleep_ms(50) # Small delay to reduce CPU usage
# Run main
main()

View file

@ -0,0 +1,252 @@
# Maix Duino - Simple Test Script
# Copy/paste this into MaixPy IDE and click RUN
#
# This script tests:
# 1. LCD display
# 2. WiFi connectivity
# 3. Network connection to Heimdall server
# 4. I2S audio initialization (without recording yet)
import time
import lcd
from Maix import GPIO, I2S
from fpioa_manager import fm
# Import the correct network module
try:
import network
# Create ESP32_SPI instance (for Maix Duino with ESP32)
nic = None # Will be initialized in test_wifi
except Exception as e:
print("Network module import error: " + str(e))
nic = None
# ===== CONFIGURATION - EDIT THESE =====
# Load credentials from secrets.py (gitignored)
try:
from secrets import SECRETS
except ImportError:
SECRETS = {}
WIFI_SSID = "Tell My WiFi Love Her" # <<< CHANGE THIS
WIFI_PASSWORD = SECRETS.get("wifi_password", "") # set in secrets.py # <<< CHANGE THIS
SERVER_URL = "http://10.1.10.71:3006" # Heimdall voice server
# =======================================
# Colors (as tuples for easy reference)
COLOR_BLACK = (0, 0, 0)
COLOR_WHITE = (255, 255, 255)
COLOR_RED = (255, 0, 0)
COLOR_GREEN = (0, 255, 0)
COLOR_BLUE = (0, 0, 255)
COLOR_YELLOW = (255, 255, 0)
def display_msg(msg, color=COLOR_WHITE, y=50):
"""Display message on LCD"""
# lcd.draw_string needs RGB as separate ints: lcd.draw_string(x, y, text, color_int, bg_color_int)
# Convert RGB tuple to single integer: (R << 16) | (G << 8) | B
color_int = (color[0] << 16) | (color[1] << 8) | color[2]
bg_int = 0 # Black background
lcd.draw_string(10, y, msg, color_int, bg_int)
print(msg)
def test_lcd():
"""Test LCD display"""
lcd.init()
lcd.clear(COLOR_BLACK)
display_msg("MaixDuino Test", COLOR_YELLOW, 10)
display_msg("Initializing...", COLOR_WHITE, 30)
time.sleep(1)
return True
def test_wifi():
"""Test WiFi connection"""
global nic
display_msg("Connecting WiFi...", COLOR_BLUE, 50)
try:
# Initialize ESP32_SPI network interface
print("Initializing ESP32_SPI...")
# Create network interface instance with Maix Duino pins
# Maix Duino ESP32 default pins:
# CS=25, RST=8, RDY=9, MOSI=28, MISO=26, SCLK=27
from network import ESP32_SPI
from fpioa_manager import fm
from Maix import GPIO
# Register pins for ESP32 SPI communication
fm.register(25, fm.fpioa.GPIOHS10, force=True) # CS
fm.register(8, fm.fpioa.GPIOHS11, force=True) # RST
fm.register(9, fm.fpioa.GPIOHS12, force=True) # RDY
fm.register(28, fm.fpioa.GPIOHS13, force=True) # MOSI
fm.register(26, fm.fpioa.GPIOHS14, force=True) # MISO
fm.register(27, fm.fpioa.GPIOHS15, force=True) # SCLK
nic = ESP32_SPI(
cs=fm.fpioa.GPIOHS10,
rst=fm.fpioa.GPIOHS11,
rdy=fm.fpioa.GPIOHS12,
mosi=fm.fpioa.GPIOHS13,
miso=fm.fpioa.GPIOHS14,
sclk=fm.fpioa.GPIOHS15
)
print("Connecting to " + WIFI_SSID + "...")
# Connect to WiFi (no need to call active() first)
nic.connect(WIFI_SSID, WIFI_PASSWORD)
# Wait for connection
timeout = 20
while timeout > 0:
time.sleep(1)
timeout -= 1
if nic.isconnected():
# Successfully connected!
ip_info = nic.ifconfig()
ip = ip_info[0] if ip_info else "Unknown"
display_msg("WiFi OK!", COLOR_GREEN, 70)
display_msg("IP: " + str(ip), COLOR_WHITE, 90)
print("Connected! IP: " + str(ip))
time.sleep(2)
return True
else:
print("Waiting... " + str(timeout) + "s")
# Timeout reached
display_msg("WiFi FAILED!", COLOR_RED, 70)
print("Connection timeout")
return False
except Exception as e:
display_msg("WiFi error!", COLOR_RED, 70)
print("WiFi error: " + str(e))
import sys
sys.print_exception(e)
return False
def test_server():
"""Test connection to Heimdall server"""
display_msg("Testing server...", COLOR_BLUE, 110)
try:
# Try socket connection to server
import socket
url = SERVER_URL + "/health"
print("Trying: " + url)
# Parse URL to get host and port
host = "10.1.10.71"
port = 3006
# Create socket
s = socket.socket()
s.settimeout(5)
print("Connecting to " + host + ":" + str(port))
s.connect((host, port))
# Send HTTP GET request
request = "GET /health HTTP/1.1\r\nHost: " + host + "\r\nConnection: close\r\n\r\n"
s.send(request.encode())
# Read response
response = s.recv(1024).decode()
s.close()
print("Server response received")
if "200" in response or "OK" in response:
display_msg("Server OK!", COLOR_GREEN, 130)
print("Server is reachable!")
time.sleep(2)
return True
else:
display_msg("Server responded", COLOR_YELLOW, 130)
print("Response: " + response[:100])
return True # Still counts as success if we got a response
except Exception as e:
display_msg("Server FAILED!", COLOR_RED, 130)
error_msg = str(e)[:30]
display_msg(error_msg, COLOR_RED, 150)
print("Server connection failed: " + str(e))
return False
def test_audio():
"""Test I2S audio initialization"""
display_msg("Testing audio...", COLOR_BLUE, 170)
try:
# Register I2S pins (Maix Duino pinout)
fm.register(20, fm.fpioa.I2S0_IN_D0, force=True)
fm.register(19, fm.fpioa.I2S0_WS, force=True)
fm.register(18, fm.fpioa.I2S0_SCLK, force=True)
# Initialize I2S
rx = I2S(I2S.DEVICE_0)
rx.channel_config(rx.CHANNEL_0, rx.RECEIVER, align_mode=I2S.STANDARD_MODE)
rx.set_sample_rate(16000)
display_msg("Audio OK!", COLOR_GREEN, 190)
print("I2S initialized: " + str(rx))
time.sleep(2)
return True
except Exception as e:
display_msg("Audio FAILED!", COLOR_RED, 190)
print("Audio init failed: " + str(e))
return False
def main():
"""Run all tests"""
print("=" * 40)
print("MaixDuino Voice Assistant Test")
print("=" * 40)
# Test LCD
if not test_lcd():
print("LCD test failed!")
return
# Test WiFi
if not test_wifi():
print("WiFi test failed!")
red_int = (255 << 16) | (0 << 8) | 0 # Red color
lcd.draw_string(10, 210, "STOPPED - Check WiFi", red_int, 0)
return
# Test server connection
server_ok = test_server()
# Test audio
audio_ok = test_audio()
# Summary
lcd.clear(COLOR_BLACK)
display_msg("=== TEST RESULTS ===", COLOR_YELLOW, 10)
display_msg("LCD: OK", COLOR_GREEN, 40)
display_msg("WiFi: OK", COLOR_GREEN, 60)
if server_ok:
display_msg("Server: OK", COLOR_GREEN, 80)
else:
display_msg("Server: FAIL", COLOR_RED, 80)
if audio_ok:
display_msg("Audio: OK", COLOR_GREEN, 100)
else:
display_msg("Audio: FAIL", COLOR_RED, 100)
if server_ok and audio_ok:
display_msg("Ready for voice app!", COLOR_GREEN, 140)
else:
display_msg("Fix errors first", COLOR_YELLOW, 140)
print("\nTest complete!")
# Run the test
if __name__ == "__main__":
main()

View file

@ -0,0 +1,465 @@
# Maix Duino Voice Assistant Client
# Path: maix_voice_client.py (upload to Maix Duino SD card)
#
# Purpose and usage:
# This script runs on the Maix Duino board and handles:
# - Wake word detection using KPU
# - Audio capture from I2S microphone
# - Streaming audio to voice processing server
# - Playing back TTS responses
# - LED feedback for user interaction
#
# Requirements:
# - MaixPy firmware (latest version)
# - I2S microphone connected
# - Speaker or audio output connected
# - WiFi configured (see config below)
#
# Upload to board:
# 1. Copy this file to SD card as boot.py or main.py
# 2. Update WiFi credentials below
# 3. Update server URL to your Heimdall IP
# 4. Power cycle the board
import time
import audio
import image
from Maix import GPIO
from fpioa_manager import fm
from machine import I2S
import KPU as kpu
import sensor
import lcd
import gc
# ----- Configuration -----
# WiFi Settings
WIFI_SSID = "YourSSID"
WIFI_PASSWORD = "YourPassword"
# Server Settings
VOICE_SERVER_URL = "http://10.1.10.71:5000"
PROCESS_ENDPOINT = "/process"
# Audio Settings
SAMPLE_RATE = 16000 # 16kHz for Whisper
CHANNELS = 1 # Mono
SAMPLE_WIDTH = 2 # 16-bit
CHUNK_SIZE = 1024
# Wake Word Settings
WAKE_WORD_THRESHOLD = 0.7 # Confidence threshold (0.0-1.0)
WAKE_WORD_MODEL = "/sd/models/wake_word.kmodel" # Path to wake word model
# LED Pin for feedback
LED_PIN = 13 # Onboard LED (adjust if needed)
# Recording Settings
MAX_RECORD_TIME = 10 # Maximum seconds to record after wake word
SILENCE_THRESHOLD = 500 # Amplitude threshold for silence detection
SILENCE_DURATION = 2 # Seconds of silence before stopping recording
# ----- Color definitions for LCD -----
COLOR_RED = (255, 0, 0)
COLOR_GREEN = (0, 255, 0)
COLOR_BLUE = (0, 0, 255)
COLOR_YELLOW = (255, 255, 0)
COLOR_BLACK = (0, 0, 0)
COLOR_WHITE = (255, 255, 255)
# ----- Global Variables -----
led = None
i2s_dev = None
kpu_task = None
listening = False
def init_hardware():
"""Initialize hardware components"""
global led, i2s_dev
# Initialize LED
fm.register(LED_PIN, fm.fpioa.GPIO0)
led = GPIO(GPIO.GPIO0, GPIO.OUT)
led.value(0) # Turn off initially
# Initialize LCD
lcd.init()
lcd.clear(COLOR_BLACK)
lcd.draw_string(lcd.width()//2 - 50, lcd.height()//2,
"Initializing...",
lcd.WHITE, lcd.BLACK)
# Initialize I2S for audio (microphone)
# Note: Pin configuration may vary based on your specific hardware
fm.register(20, fm.fpioa.I2S0_IN_D0)
fm.register(19, fm.fpioa.I2S0_WS)
fm.register(18, fm.fpioa.I2S0_SCLK)
i2s_dev = I2S(I2S.DEVICE_0)
i2s_dev.channel_config(I2S.CHANNEL_0, I2S.RECEIVER,
align_mode=I2S.STANDARD_MODE,
data_width=I2S.RESOLUTION_16_BIT)
i2s_dev.set_sample_rate(SAMPLE_RATE)
print("Hardware initialized")
def init_network():
"""Initialize WiFi connection"""
import network
lcd.clear(COLOR_BLACK)
lcd.draw_string(10, 50, "Connecting to WiFi...", COLOR_WHITE, COLOR_BLACK)
wlan = network.WLAN(network.STA_IF)
wlan.active(True)
if not wlan.isconnected():
print(f"Connecting to {WIFI_SSID}...")
wlan.connect(WIFI_SSID, WIFI_PASSWORD)
# Wait for connection
timeout = 20
while not wlan.isconnected() and timeout > 0:
time.sleep(1)
timeout -= 1
print(f"Waiting for connection... {timeout}s")
if not wlan.isconnected():
print("Failed to connect to WiFi")
lcd.clear(COLOR_BLACK)
lcd.draw_string(10, 50, "WiFi Failed!", COLOR_RED, COLOR_BLACK)
return False
print("Network connected:", wlan.ifconfig())
lcd.clear(COLOR_BLACK)
lcd.draw_string(10, 50, "WiFi Connected", COLOR_GREEN, COLOR_BLACK)
lcd.draw_string(10, 70, f"IP: {wlan.ifconfig()[0]}", COLOR_WHITE, COLOR_BLACK)
time.sleep(2)
return True
def load_wake_word_model():
"""Load wake word detection model"""
global kpu_task
try:
# This is a placeholder - you'll need to train and convert a wake word model
# For now, we'll skip KPU wake word and use a simpler approach
print("Wake word model loading skipped (implement after model training)")
return True
except Exception as e:
print(f"Failed to load wake word model: {e}")
return False
def detect_wake_word():
"""
Detect wake word in audio stream
Returns:
True if wake word detected, False otherwise
Note: This is a simplified version. For production, you should:
1. Train a wake word model using Mycroft Precise or similar
2. Convert the model to .kmodel format for K210
3. Load and run inference using KPU
For now, we'll use a simple amplitude-based trigger
"""
# Simple amplitude-based detection (placeholder)
# Replace with actual KPU inference
audio_data = i2s_dev.record(CHUNK_SIZE)
if audio_data:
# Calculate amplitude
amplitude = 0
for i in range(0, len(audio_data), 2):
sample = int.from_bytes(audio_data[i:i+2], 'little', True)
amplitude += abs(sample)
amplitude = amplitude / (len(audio_data) // 2)
# Simple threshold detection (replace with KPU inference)
if amplitude > 3000: # Adjust threshold based on your microphone
return True
return False
def record_audio(max_duration=MAX_RECORD_TIME):
"""
Record audio until silence or max duration
Returns:
bytes: Recorded audio data in WAV format
"""
print(f"Recording audio (max {max_duration}s)...")
audio_buffer = bytearray()
start_time = time.time()
silence_start = None
# Record in chunks
while True:
elapsed = time.time() - start_time
# Check max duration
if elapsed > max_duration:
print("Max recording duration reached")
break
# Record chunk
chunk = i2s_dev.record(CHUNK_SIZE)
if chunk:
audio_buffer.extend(chunk)
# Calculate amplitude for silence detection
amplitude = 0
for i in range(0, len(chunk), 2):
sample = int.from_bytes(chunk[i:i+2], 'little', True)
amplitude += abs(sample)
amplitude = amplitude / (len(chunk) // 2)
# Silence detection
if amplitude < SILENCE_THRESHOLD:
if silence_start is None:
silence_start = time.time()
elif time.time() - silence_start > SILENCE_DURATION:
print("Silence detected, stopping recording")
break
else:
silence_start = None
# Update LCD with recording time
if int(elapsed) % 1 == 0:
lcd.clear(COLOR_BLACK)
lcd.draw_string(10, 50, f"Recording... {int(elapsed)}s",
COLOR_RED, COLOR_BLACK)
print(f"Recorded {len(audio_buffer)} bytes")
# Convert to WAV format
return create_wav(audio_buffer)
def create_wav(audio_data):
"""Create WAV file header and combine with audio data"""
import struct
# WAV header
sample_rate = SAMPLE_RATE
channels = CHANNELS
sample_width = SAMPLE_WIDTH
data_size = len(audio_data)
# RIFF header
wav = bytearray(b'RIFF')
wav.extend(struct.pack('<I', 36 + data_size)) # File size - 8
wav.extend(b'WAVE')
# fmt chunk
wav.extend(b'fmt ')
wav.extend(struct.pack('<I', 16)) # fmt chunk size
wav.extend(struct.pack('<H', 1)) # PCM format
wav.extend(struct.pack('<H', channels))
wav.extend(struct.pack('<I', sample_rate))
wav.extend(struct.pack('<I', sample_rate * channels * sample_width))
wav.extend(struct.pack('<H', channels * sample_width))
wav.extend(struct.pack('<H', sample_width * 8))
# data chunk
wav.extend(b'data')
wav.extend(struct.pack('<I', data_size))
wav.extend(audio_data)
return bytes(wav)
def send_audio_to_server(audio_data):
"""
Send audio to voice processing server and get response
Returns:
dict: Response from server or None on failure
"""
import urequests
try:
# Prepare multipart form data
url = f"{VOICE_SERVER_URL}{PROCESS_ENDPOINT}"
print(f"Sending audio to {url}...")
lcd.clear(COLOR_BLACK)
lcd.draw_string(10, 50, "Processing...", COLOR_YELLOW, COLOR_BLACK)
# Send POST request with audio file
# Note: MaixPy's urequests doesn't support multipart, so we need a workaround
# For now, send raw audio with appropriate headers
headers = {
'Content-Type': 'audio/wav',
}
response = urequests.post(url, data=audio_data, headers=headers)
if response.status_code == 200:
result = response.json()
response.close()
return result
else:
print(f"Server error: {response.status_code}")
response.close()
return None
except Exception as e:
print(f"Error sending audio: {e}")
return None
def display_response(response_text):
"""Display response on LCD"""
lcd.clear(COLOR_BLACK)
# Word wrap for LCD
words = response_text.split()
lines = []
current_line = ""
for word in words:
test_line = current_line + word + " "
if len(test_line) * 8 > lcd.width() - 20: # Rough character width
if current_line:
lines.append(current_line.strip())
current_line = word + " "
else:
current_line = test_line
if current_line:
lines.append(current_line.strip())
# Display lines
y = 30
for line in lines[:5]: # Max 5 lines
lcd.draw_string(10, y, line, COLOR_GREEN, COLOR_BLACK)
y += 20
def set_led(state):
"""Control LED state"""
if led:
led.value(1 if state else 0)
def main_loop():
"""Main voice assistant loop"""
global listening
# Show ready status
lcd.clear(COLOR_BLACK)
lcd.draw_string(10, lcd.height()//2 - 10, "Say wake word...",
COLOR_BLUE, COLOR_BLACK)
print("Voice assistant ready. Listening for wake word...")
while True:
try:
# Listen for wake word
if detect_wake_word():
print("Wake word detected!")
# Visual feedback
set_led(True)
lcd.clear(COLOR_BLACK)
lcd.draw_string(10, 50, "Listening...", COLOR_RED, COLOR_BLACK)
# Small delay to skip the wake word itself
time.sleep(0.5)
# Record command
audio_data = record_audio()
# Send to server
response = send_audio_to_server(audio_data)
if response and response.get('success'):
transcription = response.get('transcription', '')
response_text = response.get('response', 'No response')
print(f"You said: {transcription}")
print(f"Response: {response_text}")
# Display response
display_response(response_text)
# TODO: Play TTS audio response
else:
lcd.clear(COLOR_BLACK)
lcd.draw_string(10, 50, "Error processing",
COLOR_RED, COLOR_BLACK)
# Turn off LED
set_led(False)
# Pause before listening again
time.sleep(2)
# Reset display
lcd.clear(COLOR_BLACK)
lcd.draw_string(10, lcd.height()//2 - 10, "Say wake word...",
COLOR_BLUE, COLOR_BLACK)
# Small delay to prevent tight loop
time.sleep(0.1)
# Garbage collection
if gc.mem_free() < 100000: # If free memory < 100KB
gc.collect()
except KeyboardInterrupt:
print("Exiting...")
break
except Exception as e:
print(f"Error in main loop: {e}")
time.sleep(1)
def main():
"""Main entry point"""
print("=" * 40)
print("Maix Duino Voice Assistant")
print("=" * 40)
# Initialize hardware
init_hardware()
# Connect to network
if not init_network():
print("Failed to initialize network. Exiting.")
return
# Load wake word model (optional)
load_wake_word_model()
# Start main loop
try:
main_loop()
except Exception as e:
print(f"Fatal error: {e}")
finally:
# Cleanup
set_led(False)
lcd.clear(COLOR_BLACK)
lcd.draw_string(10, lcd.height()//2, "Stopped",
COLOR_RED, COLOR_BLACK)
# Run main program
if __name__ == "__main__":
main()

View file

@ -0,0 +1,7 @@
# Copy this file to secrets.py and fill in your values
# secrets.py is gitignored — never commit it
SECRETS = {
"wifi_ssid": "YourNetworkName",
"wifi_password": "YourWiFiPassword",
"voice_server_url": "http://10.1.10.71:5000", # replace with your Minerva server IP
}

View file

@ -0,0 +1,409 @@
#!/usr/bin/env bash
#
# Path: download_pretrained_models.sh
#
# Purpose and usage:
# Downloads and sets up pre-trained Mycroft Precise wake word models
# - Downloads Hey Mycroft, Hey Jarvis, and other available models
# - Tests each model with microphone
# - Configures voice server to use them
#
# Requirements:
# - Mycroft Precise installed (run setup_precise.sh first)
# - Internet connection for downloads
# - Microphone for testing
#
# Usage:
# ./download_pretrained_models.sh [--test-all] [--model MODEL_NAME]
#
# Author: PRbL Library
# Created: $(date +"%Y-%m-%d")
# ----- PRbL Color and output functions -----
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[0;33m'
BLUE='\033[0;34m'
PURPLE='\033[0;35m'
CYAN='\033[0;36m'
NC='\033[0m' # No Color
print_status() {
local level="$1"
shift
case "$level" in
"info") echo -e "${BLUE}[INFO]${NC} $*" >&2 ;;
"success") echo -e "${GREEN}[SUCCESS]${NC} $*" >&2 ;;
"warning") echo -e "${YELLOW}[WARNING]${NC} $*" >&2 ;;
"error") echo -e "${RED}[ERROR]${NC} $*" >&2 ;;
"debug") [[ "$VERBOSE" == "true" ]] && echo -e "${PURPLE}[DEBUG]${NC} $*" >&2 ;;
*) echo -e "$*" >&2 ;;
esac
}
# ----- Configuration -----
MODELS_DIR="$HOME/precise-models/pretrained"
TEST_ALL=false
SPECIFIC_MODEL=""
VERBOSE=false
# Available pre-trained models
declare -A MODELS=(
["hey-mycroft"]="https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz"
["hey-jarvis"]="https://github.com/MycroftAI/precise-data/raw/models-dev/hey-jarvis.tar.gz"
["christopher"]="https://github.com/MycroftAI/precise-data/raw/models-dev/christopher.tar.gz"
["hey-ezra"]="https://github.com/MycroftAI/precise-data/raw/models-dev/hey-ezra.tar.gz"
)
# ----- Dependency checking -----
command_exists() {
command -v "$1" &> /dev/null
}
check_dependencies() {
local missing=()
if ! command_exists wget; then
missing+=("wget")
fi
if ! command_exists precise-listen; then
missing+=("precise-listen (run setup_precise.sh first)")
fi
if [[ ${#missing[@]} -gt 0 ]]; then
print_status error "Missing dependencies: ${missing[*]}"
return 1
fi
return 0
}
# ----- Parse arguments -----
parse_args() {
while [[ $# -gt 0 ]]; do
case "$1" in
--test-all)
TEST_ALL=true
shift
;;
--model)
SPECIFIC_MODEL="$2"
shift 2
;;
-v|--verbose)
VERBOSE=true
shift
;;
-h|--help)
cat << EOF
Usage: $(basename "$0") [OPTIONS]
Download and test pre-trained Mycroft Precise wake word models
Options:
--test-all Download and test all available models
--model NAME Download and test specific model
-v, --verbose Enable verbose output
-h, --help Show this help message
Available models:
hey-mycroft Original Mycroft wake word (most data)
hey-jarvis Popular alternative
christopher Alternative wake word
hey-ezra Another option
Examples:
$(basename "$0") --model hey-mycroft
$(basename "$0") --test-all
EOF
exit 0
;;
*)
print_status error "Unknown option: $1"
exit 1
;;
esac
done
}
# ----- Functions -----
create_models_directory() {
print_status info "Creating models directory: $MODELS_DIR"
mkdir -p "$MODELS_DIR" || {
print_status error "Failed to create directory"
return 1
}
return 0
}
download_model() {
local model_name="$1"
local model_url="${MODELS[${model_name}]}"
if [[ -z "$model_url" ]]; then
print_status error "Unknown model: $model_name"
return 1
fi
# Check if already downloaded
if [[ -f "$MODELS_DIR/${model_name}.net" ]]; then
print_status info "Model already exists: $model_name"
return 0
fi
print_status info "Downloading $model_name..."
local temp_file="/tmp/${model_name}-$$.tar.gz"
wget -q --show-progress -O "$temp_file" "$model_url" || {
print_status error "Failed to download $model_name"
rm -f "$temp_file"
return 1
}
# Extract
print_status info "Extracting $model_name..."
tar xzf "$temp_file" -C "$MODELS_DIR" || {
print_status error "Failed to extract $model_name"
rm -f "$temp_file"
return 1
}
rm -f "$temp_file"
# Verify extraction
if [[ -f "$MODELS_DIR/${model_name}.net" ]]; then
print_status success "Downloaded: $model_name"
return 0
else
print_status error "Extraction failed for $model_name"
return 1
fi
}
test_model() {
local model_name="$1"
local model_file="$MODELS_DIR/${model_name}.net"
if [[ ! -f "$model_file" ]]; then
print_status error "Model file not found: $model_file"
return 1
fi
print_status info "Testing model: $model_name"
echo ""
echo -e "${CYAN}Instructions:${NC}"
echo " - Speak the wake word: '$model_name'"
echo " - You should see '!' when detected"
echo " - Press Ctrl+C to stop testing"
echo ""
read -p "Press Enter to start test..."
# Activate conda environment if needed
if command_exists conda; then
eval "$(conda shell.bash hook)"
conda activate precise 2>/dev/null || true
fi
precise-listen "$model_file" || {
print_status warning "Test interrupted or failed"
return 1
}
return 0
}
create_multi_wake_config() {
print_status info "Creating multi-wake-word configuration..."
local config_file="$MODELS_DIR/multi-wake-config.sh"
cat > "$config_file" << 'EOF'
#!/bin/bash
# Multi-wake-word configuration
# Generated by download_pretrained_models.sh
# Start voice server with multiple wake words
cd ~/voice-assistant
# List of wake word models
MODELS=""
EOF
# Add each downloaded model to config
for model_name in "${!MODELS[@]}"; do
if [[ -f "$MODELS_DIR/${model_name}.net" ]]; then
echo "# Found: $model_name" >> "$config_file"
echo "MODELS=\"\${MODELS}${model_name}:$MODELS_DIR/${model_name}.net:0.5,\"" >> "$config_file"
fi
done
cat >> "$config_file" << 'EOF'
# Remove trailing comma
MODELS="${MODELS%,}"
# Activate environment
eval "$(conda shell.bash hook)"
conda activate precise
# Start server
python voice_server.py \
--enable-precise \
--precise-models "$MODELS" \
--ha-token "$HA_TOKEN"
EOF
chmod +x "$config_file"
print_status success "Created: $config_file"
echo ""
print_status info "To use multiple wake words, run:"
print_status info " $config_file"
return 0
}
list_downloaded_models() {
print_status info "Downloaded models in $MODELS_DIR:"
echo ""
local count=0
for model_name in "${!MODELS[@]}"; do
if [[ -f "$MODELS_DIR/${model_name}.net" ]]; then
local size=$(du -h "$MODELS_DIR/${model_name}.net" | cut -f1)
echo -e " ${GREEN}${NC} ${model_name}.net (${size})"
((count++))
else
echo -e " ${YELLOW}${NC} ${model_name}.net (not downloaded)"
fi
done
echo ""
print_status success "Total downloaded: $count"
return 0
}
compare_models() {
print_status info "Model comparison:"
echo ""
cat << 'EOF'
┌─────────────────┬──────────────┬─────────────┬─────────────────┐
│ Wake Word │ Popularity │ Difficulty │ Recommended For │
├─────────────────┼──────────────┼─────────────┼─────────────────┤
│ Hey Mycroft │ ★★★★★ │ Easy │ Default choice │
│ Hey Jarvis │ ★★★★☆ │ Easy │ Pop culture │
│ Christopher │ ★★☆☆☆ │ Medium │ Unique name │
│ Hey Ezra │ ★★☆☆☆ │ Medium │ Alternative │
└─────────────────┴──────────────┴─────────────┴─────────────────┘
Recommendations:
- Start with: Hey Mycroft (most training data)
- For media: Hey Jarvis (Plex/entertainment)
- For uniqueness: Christopher or Hey Ezra
Multiple wake words:
- Use different wake words for different contexts
- Example: "Hey Mycroft" for commands, "Hey Jarvis" for media
- Server can run 2-3 models simultaneously
EOF
}
# ----- Main -----
main() {
print_status info "Mycroft Precise Pre-trained Model Downloader"
echo ""
# Parse arguments
parse_args "$@"
# Check dependencies
check_dependencies || exit 1
# Create directory
create_models_directory || exit 1
# Show comparison
if [[ -z "$SPECIFIC_MODEL" && "$TEST_ALL" != "true" ]]; then
compare_models
echo ""
print_status info "Use --model <name> to download a specific model"
print_status info "Use --test-all to download all models"
echo ""
list_downloaded_models
exit 0
fi
# Download models
if [[ -n "$SPECIFIC_MODEL" ]]; then
# Download specific model
download_model "$SPECIFIC_MODEL" || exit 1
# Offer to test
echo ""
read -p "Test this model now? (y/N): " -n 1 -r
echo
if [[ $REPLY =~ ^[Yy]$ ]]; then
test_model "$SPECIFIC_MODEL"
fi
elif [[ "$TEST_ALL" == "true" ]]; then
# Download all models
for model_name in "${!MODELS[@]}"; do
download_model "$model_name"
echo ""
done
# Offer to test each
echo ""
print_status success "All models downloaded"
echo ""
read -p "Test each model? (y/N): " -n 1 -r
echo
if [[ $REPLY =~ ^[Yy]$ ]]; then
for model_name in "${!MODELS[@]}"; do
if [[ -f "$MODELS_DIR/${model_name}.net" ]]; then
echo ""
test_model "$model_name"
fi
done
fi
fi
# List results
echo ""
list_downloaded_models
# Create multi-wake config if multiple models
local model_count=$(find "$MODELS_DIR" -name "*.net" | wc -l)
if [[ $model_count -gt 1 ]]; then
echo ""
create_multi_wake_config
fi
# Final instructions
echo ""
print_status success "Setup complete!"
echo ""
print_status info "Next steps:"
print_status info "1. Test a model: precise-listen $MODELS_DIR/hey-mycroft.net"
print_status info "2. Use in server: python voice_server.py --enable-precise --precise-model $MODELS_DIR/hey-mycroft.net"
print_status info "3. Fine-tune: precise-train -e 30 custom.net . --from-checkpoint $MODELS_DIR/hey-mycroft.net"
if [[ $model_count -gt 1 ]]; then
echo ""
print_status info "For multiple wake words:"
print_status info " $MODELS_DIR/multi-wake-config.sh"
fi
}
# Run main
main "$@"

View file

@ -0,0 +1,456 @@
#!/usr/bin/env bash
#
# Path: quick_start_hey_mycroft.sh
#
# Purpose and usage:
# Zero-training quick start using pre-trained "Hey Mycroft" model
# Gets you a working voice assistant in 5 minutes!
#
# Requirements:
# - Heimdall already setup (ran setup_voice_assistant.sh)
# - Mycroft Precise installed (ran setup_precise.sh)
#
# Usage:
# ./quick_start_hey_mycroft.sh [--test-only]
#
# Author: PRbL Library
# ----- PRbL Color and output functions -----
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[0;33m'
BLUE='\033[0;34m'
PURPLE='\033[0;35m'
CYAN='\033[0;36m'
NC='\033[0m'
print_status() {
local level="$1"
shift
case "$level" in
"info") echo -e "${BLUE}[INFO]${NC} $*" >&2 ;;
"success") echo -e "${GREEN}[SUCCESS]${NC} $*" >&2 ;;
"warning") echo -e "${YELLOW}[WARNING]${NC} $*" >&2 ;;
"error") echo -e "${RED}[ERROR]${NC} $*" >&2 ;;
*) echo -e "$*" >&2 ;;
esac
}
# ----- Configuration -----
MODELS_DIR="$HOME/precise-models/pretrained"
MODEL_URL="https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz"
MODEL_NAME="hey-mycroft"
TEST_ONLY=false
# ----- Parse arguments -----
parse_args() {
while [[ $# -gt 0 ]]; do
case "$1" in
--test-only)
TEST_ONLY=true
shift
;;
-h|--help)
cat << EOF
Usage: $(basename "$0") [OPTIONS]
Quick start with pre-trained "Hey Mycroft" wake word model.
No training required!
Options:
--test-only Just test the model, don't start server
-h, --help Show this help
Examples:
$(basename "$0") # Download, test, and run server
$(basename "$0") --test-only # Just download and test
EOF
exit 0
;;
*)
print_status error "Unknown option: $1"
exit 1
;;
esac
done
}
# ----- Functions -----
check_prerequisites() {
print_status info "Checking prerequisites..."
# Check conda
if ! command -v conda &> /dev/null; then
print_status error "conda not found"
return 1
fi
# Check precise environment
if ! conda env list | grep -q "^precise\s"; then
print_status error "Precise environment not found"
print_status info "Run: ./setup_precise.sh first"
return 1
fi
# Check voice-assistant directory
if [[ ! -d "$HOME/voice-assistant" ]]; then
print_status error "Voice assistant not setup"
print_status info "Run: ./setup_voice_assistant.sh first"
return 1
fi
print_status success "Prerequisites OK"
return 0
}
download_pretrained_model() {
print_status info "Downloading pre-trained 'Hey Mycroft' model..."
# Create directory
mkdir -p "$MODELS_DIR"
# Check if already downloaded
if [[ -f "$MODELS_DIR/${MODEL_NAME}.net" ]]; then
print_status info "Model already downloaded"
return 0
fi
# Download
cd "$MODELS_DIR" || return 1
print_status info "Fetching from GitHub..."
wget -q --show-progress "$MODEL_URL" || {
print_status error "Failed to download model"
return 1
}
# Extract
print_status info "Extracting model..."
tar xzf hey-mycroft.tar.gz || {
print_status error "Failed to extract model"
return 1
}
# Verify
if [[ ! -f "${MODEL_NAME}.net" ]]; then
print_status error "Model file not found after extraction"
return 1
fi
print_status success "Model downloaded: $MODELS_DIR/${MODEL_NAME}.net"
return 0
}
test_model() {
print_status info "Testing wake word model..."
cd "$MODELS_DIR" || return 1
# Activate conda
eval "$(conda shell.bash hook)"
conda activate precise || {
print_status error "Failed to activate precise environment"
return 1
}
cat << EOF
${CYAN}═══════════════════════════════════════════════════${NC}
${CYAN} Wake Word Test: "Hey Mycroft"${NC}
${CYAN}═══════════════════════════════════════════════════${NC}
${YELLOW}Instructions:${NC}
1. Speak "Hey Mycroft" into your microphone
2. You should see ${GREEN}"!"${NC} when detected
3. Try other phrases - should ${RED}not${NC} trigger
4. Press ${RED}Ctrl+C${NC} when done testing
${CYAN}Starting in 3 seconds...${NC}
EOF
sleep 3
# Test the model
precise-listen "${MODEL_NAME}.net" || {
print_status error "Model test failed"
return 1
}
print_status success "Model test complete!"
return 0
}
update_config() {
print_status info "Updating voice assistant configuration..."
local config_file="$HOME/voice-assistant/config/.env"
if [[ ! -f "$config_file" ]]; then
print_status error "Config file not found: $config_file"
return 1
fi
# Update PRECISE_MODEL if exists, otherwise add it
if grep -q "^PRECISE_MODEL=" "$config_file"; then
sed -i "s|^PRECISE_MODEL=.*|PRECISE_MODEL=$MODELS_DIR/${MODEL_NAME}.net|" "$config_file"
else
echo "PRECISE_MODEL=$MODELS_DIR/${MODEL_NAME}.net" >> "$config_file"
fi
# Update sensitivity if not set
if ! grep -q "^PRECISE_SENSITIVITY=" "$config_file"; then
echo "PRECISE_SENSITIVITY=0.5" >> "$config_file"
fi
print_status success "Configuration updated"
return 0
}
start_server() {
print_status info "Starting voice assistant server..."
cd "$HOME/voice-assistant" || return 1
# Activate conda
eval "$(conda shell.bash hook)"
conda activate precise || {
print_status error "Failed to activate environment"
return 1
}
cat << EOF
${CYAN}═══════════════════════════════════════════════════${NC}
${GREEN} Starting Voice Assistant Server${NC}
${CYAN}═══════════════════════════════════════════════════${NC}
${YELLOW}Configuration:${NC}
Wake word: ${GREEN}Hey Mycroft${NC}
Model: ${MODEL_NAME}.net
Server: http://0.0.0.0:5000
${YELLOW}What to do next:${NC}
1. Wait for "Precise listening started" message
2. Say ${GREEN}"Hey Mycroft"${NC} to test wake word
3. Say a command like ${GREEN}"turn on the lights"${NC}
4. Check server logs for activity
${YELLOW}Press Ctrl+C to stop the server${NC}
${CYAN}Starting server...${NC}
EOF
# Check if HA token is set
if ! grep -q "^HA_TOKEN=..*" config/.env; then
print_status warning "Home Assistant token not set!"
print_status warning "Commands won't execute without it."
print_status info "Edit config/.env and add your HA token"
echo
read -p "Continue anyway? (y/N): " -n 1 -r
echo
if [[ ! $REPLY =~ ^[Yy]$ ]]; then
return 1
fi
fi
# Start server
python voice_server.py \
--enable-precise \
--precise-model "$MODELS_DIR/${MODEL_NAME}.net" \
--precise-sensitivity 0.5
return $?
}
create_systemd_service() {
print_status info "Creating systemd service..."
local service_file="/etc/systemd/system/voice-assistant.service"
# Check if we should update existing service
if [[ -f "$service_file" ]]; then
print_status warning "Service file already exists"
read -p "Update with Hey Mycroft configuration? (y/N): " -n 1 -r
echo
if [[ ! $REPLY =~ ^[Yy]$ ]]; then
return 0
fi
fi
# Create service file
sudo tee "$service_file" > /dev/null << EOF
[Unit]
Description=Voice Assistant with Hey Mycroft Wake Word
After=network.target
[Service]
Type=simple
User=$USER
WorkingDirectory=$HOME/voice-assistant
Environment="PATH=$HOME/miniconda3/envs/precise/bin:/usr/local/bin:/usr/bin:/bin"
EnvironmentFile=$HOME/voice-assistant/config/.env
ExecStart=$HOME/miniconda3/envs/precise/bin/python voice_server.py \\
--enable-precise \\
--precise-model $MODELS_DIR/${MODEL_NAME}.net \\
--precise-sensitivity 0.5
Restart=on-failure
RestartSec=10
StandardOutput=append:$HOME/voice-assistant/logs/voice_assistant.log
StandardError=append:$HOME/voice-assistant/logs/voice_assistant_error.log
[Install]
WantedBy=multi-user.target
EOF
# Reload systemd
sudo systemctl daemon-reload
print_status success "Systemd service created"
cat << EOF
${CYAN}To enable and start the service:${NC}
sudo systemctl enable voice-assistant
sudo systemctl start voice-assistant
sudo systemctl status voice-assistant
${CYAN}To view logs:${NC}
journalctl -u voice-assistant -f
EOF
read -p "Enable service now? (y/N): " -n 1 -r
echo
if [[ $REPLY =~ ^[Yy]$ ]]; then
sudo systemctl enable voice-assistant
sudo systemctl start voice-assistant
sleep 2
sudo systemctl status voice-assistant
fi
}
print_next_steps() {
cat << EOF
${GREEN}═══════════════════════════════════════════════════${NC}
${GREEN} Success! Your voice assistant is ready!${NC}
${GREEN}═══════════════════════════════════════════════════${NC}
${CYAN}What you have:${NC}
✓ Pre-trained "Hey Mycroft" wake word
✓ Voice assistant server configured
✓ Ready to control Home Assistant
${CYAN}Quick test:${NC}
1. Say: ${GREEN}"Hey Mycroft"${NC}
2. Say: ${GREEN}"Turn on the living room lights"${NC}
3. Check if command executed
${CYAN}Next steps:${NC}
1. ${YELLOW}Configure Home Assistant entities${NC}
Edit: ~/voice-assistant/config/.env
Add: HA_TOKEN=your_token_here
2. ${YELLOW}Add more entity mappings${NC}
Edit: voice_server.py
Update: IntentParser.ENTITY_MAP
3. ${YELLOW}Fine-tune for your voice (optional)${NC}
cd ~/precise-models/hey-mycroft-custom
./1-record-wake-word.sh
# Record 20-30 samples
precise-train -e 30 hey-mycroft-custom.net . \\
--from-checkpoint $MODELS_DIR/${MODEL_NAME}.net
4. ${YELLOW}Setup Maix Duino${NC}
See: QUICKSTART.md Phase 2
${CYAN}Useful commands:${NC}
# Test wake word only
cd $MODELS_DIR && conda activate precise
precise-listen ${MODEL_NAME}.net
# Check server health
curl http://localhost:5000/health
# Monitor logs
journalctl -u voice-assistant -f
${CYAN}Documentation:${NC}
README.md - Project overview
WAKE_WORD_ADVANCED.md - Multiple wake words guide
QUICKSTART.md - Complete setup guide
${GREEN}Happy voice assisting! 🎙️${NC}
EOF
}
# ----- Main -----
main() {
cat << EOF
${CYAN}═══════════════════════════════════════════════════${NC}
${CYAN} Quick Start: Hey Mycroft Wake Word${NC}
${CYAN}═══════════════════════════════════════════════════${NC}
${YELLOW}This script will:${NC}
1. Download pre-trained "Hey Mycroft" model
2. Test wake word detection
3. Configure voice assistant server
4. Start the server (optional)
${YELLOW}Total time: ~5 minutes (no training!)${NC}
EOF
# Parse arguments
parse_args "$@"
# Check prerequisites
check_prerequisites || exit 1
# Download model
download_pretrained_model || exit 1
# Test model
print_status info "Ready to test wake word"
read -p "Test now? (Y/n): " -n 1 -r
echo
if [[ ! $REPLY =~ ^[Nn]$ ]]; then
test_model
fi
# If test-only mode, stop here
if [[ "$TEST_ONLY" == "true" ]]; then
print_status success "Test complete!"
print_status info "Model location: $MODELS_DIR/${MODEL_NAME}.net"
exit 0
fi
# Update configuration
update_config || exit 1
# Start server
read -p "Start voice assistant server now? (Y/n): " -n 1 -r
echo
if [[ ! $REPLY =~ ^[Nn]$ ]]; then
start_server
else
# Offer to create systemd service
read -p "Create systemd service instead? (y/N): " -n 1 -r
echo
if [[ $REPLY =~ ^[Yy]$ ]]; then
create_systemd_service
fi
fi
# Print next steps
print_next_steps
}
# Run main
main "$@"

630
scripts/setup_precise.sh Executable file
View file

@ -0,0 +1,630 @@
#!/usr/bin/env bash
#
# Path: setup_precise.sh
#
# Purpose and usage:
# Sets up Mycroft Precise wake word detection on Heimdall
# - Creates conda environment for Precise
# - Installs TensorFlow 1.x and dependencies
# - Downloads precise-engine
# - Sets up training directories
# - Provides helper scripts for training
#
# Requirements:
# - conda/miniconda installed
# - Internet connection for downloads
# - Microphone for recording samples
#
# Usage:
# ./setup_precise.sh [--wake-word "phrase"] [--env-name NAME]
#
# Author: PRbL Library
# Created: $(date +"%Y-%m-%d")
# ----- PRbL Color and output functions -----
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[0;33m'
BLUE='\033[0;34m'
PURPLE='\033[0;35m'
CYAN='\033[0;36m'
NC='\033[0m' # No Color
print_status() {
local level="$1"
shift
case "$level" in
"info") echo -e "${BLUE}[INFO]${NC} $*" >&2 ;;
"success") echo -e "${GREEN}[SUCCESS]${NC} $*" >&2 ;;
"warning") echo -e "${YELLOW}[WARNING]${NC} $*" >&2 ;;
"error") echo -e "${RED}[ERROR]${NC} $*" >&2 ;;
"debug") [[ "$VERBOSE" == "true" ]] && echo -e "${PURPLE}[DEBUG]${NC} $*" >&2 ;;
*) echo -e "$*" >&2 ;;
esac
}
# ----- Configuration -----
CONDA_ENV_NAME="precise"
WAKE_WORD="hey computer"
MODELS_DIR="$HOME/precise-models"
VERBOSE=false
# ----- Dependency checking -----
command_exists() {
command -v "$1" &> /dev/null
}
check_conda() {
if ! command_exists conda; then
print_status error "conda not found. Please install miniconda first."
return 1
fi
return 0
}
# ----- Parse arguments -----
parse_args() {
while [[ $# -gt 0 ]]; do
case "$1" in
--wake-word)
WAKE_WORD="$2"
shift 2
;;
--env-name)
CONDA_ENV_NAME="$2"
shift 2
;;
-v|--verbose)
VERBOSE=true
shift
;;
-h|--help)
cat << EOF
Usage: $(basename "$0") [OPTIONS]
Options:
--wake-word "phrase" Wake word to train (default: "hey computer")
--env-name NAME Custom conda environment name (default: precise)
-v, --verbose Enable verbose output
-h, --help Show this help message
Examples:
$(basename "$0") --wake-word "hey jarvis"
$(basename "$0") --env-name mycroft-precise
EOF
exit 0
;;
*)
print_status error "Unknown option: $1"
exit 1
;;
esac
done
}
# ----- Setup functions -----
create_conda_environment() {
print_status info "Creating conda environment: $CONDA_ENV_NAME"
# Check if environment already exists
if conda env list | grep -q "^${CONDA_ENV_NAME}\s"; then
print_status warning "Environment $CONDA_ENV_NAME already exists"
read -p "Remove and recreate? (y/N): " -n 1 -r
echo
if [[ $REPLY =~ ^[Yy]$ ]]; then
print_status info "Removing existing environment..."
conda env remove -n "$CONDA_ENV_NAME" -y
else
print_status info "Using existing environment"
return 0
fi
fi
# Create new environment with Python 3.7 (required for TF 1.15)
print_status info "Creating Python 3.7 environment..."
conda create -n "$CONDA_ENV_NAME" python=3.7 -y || {
print_status error "Failed to create conda environment"
return 1
}
print_status success "Conda environment created"
return 0
}
install_tensorflow() {
print_status info "Installing TensorFlow 1.15..."
# Activate conda environment
eval "$(conda shell.bash hook)"
conda activate "$CONDA_ENV_NAME" || {
print_status error "Failed to activate conda environment"
return 1
}
# Install TensorFlow 1.15 (last 1.x version)
pip install tensorflow==1.15.5 --break-system-packages || {
print_status error "Failed to install TensorFlow"
return 1
}
# Verify installation
python -c "import tensorflow as tf; print(f'TensorFlow {tf.__version__} installed')" || {
print_status error "TensorFlow installation verification failed"
return 1
}
print_status success "TensorFlow 1.15 installed"
return 0
}
install_precise() {
print_status info "Installing Mycroft Precise..."
# Activate conda environment
eval "$(conda shell.bash hook)"
conda activate "$CONDA_ENV_NAME" || {
print_status error "Failed to activate conda environment"
return 1
}
# Install audio dependencies
print_status info "Installing system audio dependencies..."
if command_exists apt-get; then
sudo apt-get update
sudo apt-get install -y portaudio19-dev sox libatlas-base-dev || {
print_status warning "Some audio dependencies failed to install"
}
fi
# Install Python audio libraries
pip install pyaudio --break-system-packages || {
print_status warning "PyAudio installation failed (may need manual installation)"
}
# Install Precise
pip install mycroft-precise --break-system-packages || {
print_status error "Failed to install Mycroft Precise"
return 1
}
# Verify installation
python -c "import precise_runner; print('Precise installed successfully')" || {
print_status error "Precise installation verification failed"
return 1
}
print_status success "Mycroft Precise installed"
return 0
}
download_precise_engine() {
print_status info "Downloading precise-engine..."
local engine_version="0.3.0"
local engine_url="https://github.com/MycroftAI/mycroft-precise/releases/download/v${engine_version}/precise-engine_${engine_version}_x86_64.tar.gz"
local temp_dir=$(mktemp -d)
# Download engine
wget -q --show-progress -O "$temp_dir/precise-engine.tar.gz" "$engine_url" || {
print_status error "Failed to download precise-engine"
rm -rf "$temp_dir"
return 1
}
# Extract
tar xzf "$temp_dir/precise-engine.tar.gz" -C "$temp_dir" || {
print_status error "Failed to extract precise-engine"
rm -rf "$temp_dir"
return 1
}
# Install to /usr/local/bin
sudo cp "$temp_dir/precise-engine/precise-engine" /usr/local/bin/ || {
print_status error "Failed to install precise-engine"
rm -rf "$temp_dir"
return 1
}
sudo chmod +x /usr/local/bin/precise-engine
# Clean up
rm -rf "$temp_dir"
# Verify installation
precise-engine --version || {
print_status error "precise-engine installation verification failed"
return 1
}
print_status success "precise-engine installed"
return 0
}
create_training_directory() {
print_status info "Creating training directory structure..."
# Sanitize wake word for directory name
local wake_word_dir=$(echo "$WAKE_WORD" | tr ' ' '-' | tr '[:upper:]' '[:lower:]')
local project_dir="$MODELS_DIR/$wake_word_dir"
mkdir -p "$project_dir"/{wake-word,not-wake-word,test/wake-word,test/not-wake-word}
print_status success "Training directory created: $project_dir"
# Store project path for later use
echo "$project_dir" > "$MODELS_DIR/.current_project"
return 0
}
create_training_scripts() {
print_status info "Creating training helper scripts..."
local wake_word_dir=$(echo "$WAKE_WORD" | tr ' ' '-' | tr '[:upper:]' '[:lower:]')
local project_dir="$MODELS_DIR/$wake_word_dir"
# Create recording script
cat > "$project_dir/1-record-wake-word.sh" << 'EOF'
#!/bin/bash
# Step 1: Record wake word samples
# Run this script and follow the prompts to record ~50-100 samples
eval "$(conda shell.bash hook)"
conda activate precise
echo "Recording wake word samples..."
echo "Press SPACE to start/stop recording"
echo "Press Ctrl+C when done (aim for 50-100 samples)"
echo ""
precise-collect
EOF
# Create not-wake-word recording script
cat > "$project_dir/2-record-not-wake-word.sh" << 'EOF'
#!/bin/bash
# Step 2: Record "not wake word" samples
# Record random speech, TV, music, similar-sounding phrases
eval "$(conda shell.bash hook)"
conda activate precise
echo "Recording not-wake-word samples..."
echo "Record:"
echo " - Normal conversation"
echo " - TV/music background"
echo " - Similar sounding phrases"
echo " - Ambient noise"
echo ""
echo "Press SPACE to start/stop recording"
echo "Press Ctrl+C when done (aim for 200-500 samples)"
echo ""
precise-collect -f not-wake-word/samples.wav
EOF
# Create training script
cat > "$project_dir/3-train-model.sh" << EOF
#!/bin/bash
# Step 3: Train the model
# This will train for 60 epochs (adjust -e parameter for more/less)
eval "\$(conda shell.bash hook)"
conda activate precise
echo "Training wake word model..."
echo "This will take 30-60 minutes..."
echo ""
# Train model
precise-train -e 60 ${wake_word_dir}.net .
echo ""
echo "Training complete!"
echo "Test with: precise-listen ${wake_word_dir}.net"
EOF
# Create testing script
cat > "$project_dir/4-test-model.sh" << EOF
#!/bin/bash
# Step 4: Test the model with live microphone
eval "\$(conda shell.bash hook)"
conda activate precise
echo "Testing wake word model..."
echo "Speak your wake word - you should see '!' when detected"
echo "Speak other phrases - should not trigger"
echo ""
echo "Press Ctrl+C to exit"
echo ""
precise-listen ${wake_word_dir}.net
EOF
# Create evaluation script
cat > "$project_dir/5-evaluate-model.sh" << EOF
#!/bin/bash
# Step 5: Evaluate model on test set
eval "\$(conda shell.bash hook)"
conda activate precise
echo "Evaluating wake word model on test set..."
echo ""
precise-test ${wake_word_dir}.net test/
echo ""
echo "Check metrics above:"
echo " - Wake word accuracy should be >95%"
echo " - False positive rate should be <5%"
EOF
# Create tuning script
cat > "$project_dir/6-tune-threshold.sh" << EOF
#!/bin/bash
# Step 6: Tune activation threshold
eval "\$(conda shell.bash hook)"
conda activate precise
echo "Testing different thresholds..."
echo ""
echo "Default threshold: 0.5"
echo "Higher = fewer false positives, may miss some wake words"
echo "Lower = catch more wake words, more false positives"
echo ""
for threshold in 0.3 0.5 0.7; do
echo "Testing threshold: \$threshold"
echo "Press Ctrl+C to try next threshold"
precise-listen ${wake_word_dir}.net -t \$threshold
done
EOF
# Make all scripts executable
chmod +x "$project_dir"/*.sh
print_status success "Training scripts created in $project_dir"
return 0
}
create_readme() {
print_status info "Creating README..."
local wake_word_dir=$(echo "$WAKE_WORD" | tr ' ' '-' | tr '[:upper:]' '[:lower:]')
local project_dir="$MODELS_DIR/$wake_word_dir"
cat > "$project_dir/README.md" << EOF
# Wake Word Training: "$WAKE_WORD"
## Quick Start
Follow these steps in order:
### 1. Record Wake Word Samples
\`\`\`bash
./1-record-wake-word.sh
\`\`\`
Record 50-100 samples:
- Vary your tone and speed
- Different distances from microphone
- Different background noise levels
- Have family members record too
### 2. Record Not-Wake-Word Samples
\`\`\`bash
./2-record-not-wake-word.sh
\`\`\`
Record 200-500 samples of:
- Normal conversation
- TV/music in background
- Similar sounding phrases
- Ambient household noise
### 3. Organize Samples
Move files into training/test split:
\`\`\`bash
# 80% of wake-word samples go to:
mv wake-word-samples-* wake-word/
# 20% of wake-word samples go to:
mv wake-word-samples-* test/wake-word/
# 80% of not-wake-word samples go to:
mv not-wake-word-samples-* not-wake-word/
# 20% of not-wake-word samples go to:
mv not-wake-word-samples-* test/not-wake-word/
\`\`\`
### 4. Train Model
\`\`\`bash
./3-train-model.sh
\`\`\`
Wait 30-60 minutes for training to complete.
### 5. Test Model
\`\`\`bash
./4-test-model.sh
\`\`\`
Speak your wake word and verify detection.
### 6. Evaluate Model
\`\`\`bash
./5-evaluate-model.sh
\`\`\`
Check accuracy metrics on test set.
### 7. Tune Threshold
\`\`\`bash
./6-tune-threshold.sh
\`\`\`
Find the best threshold for your environment.
## Tips for Good Training
1. **Quality over quantity** - Clear samples are better than many poor ones
2. **Diverse conditions** - Different noise levels, distances, speakers
3. **Hard negatives** - Include similar-sounding phrases in not-wake-word set
4. **Regular updates** - Add false positives/negatives and retrain
## Next Steps
Once trained and tested:
1. Copy model to voice assistant server:
\`\`\`bash
cp ${wake_word_dir}.net ~/voice-assistant/models/
\`\`\`
2. Update voice assistant config:
\`\`\`bash
vim ~/voice-assistant/config/.env
# Set: PRECISE_MODEL=~/voice-assistant/models/${wake_word_dir}.net
\`\`\`
3. Restart voice assistant service:
\`\`\`bash
sudo systemctl restart voice-assistant
\`\`\`
## Troubleshooting
**Low accuracy?**
- Collect more training samples
- Increase training epochs (edit 3-train-model.sh, change -e 60 to -e 120)
- Verify 80/20 train/test split
**Too many false positives?**
- Increase threshold (use 6-tune-threshold.sh)
- Add false trigger audio to not-wake-word set
- Retrain with more diverse negative samples
**Misses wake words?**
- Lower threshold
- Add missed samples to training set
- Ensure good audio quality
## Resources
- Mycroft Precise Docs: https://github.com/MycroftAI/mycroft-precise
- Training Guide: https://mycroft-ai.gitbook.io/docs/mycroft-technologies/precise
- Community Models: https://github.com/MycroftAI/precise-data
EOF
print_status success "README created in $project_dir"
return 0
}
download_pretrained_models() {
print_status info "Downloading pre-trained models..."
# Create models directory
mkdir -p "$MODELS_DIR/pretrained"
# Download Hey Mycroft model (as example/base)
local model_url="https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz"
if [[ ! -f "$MODELS_DIR/pretrained/hey-mycroft.net" ]]; then
print_status info "Downloading Hey Mycroft model..."
wget -q --show-progress -O "$MODELS_DIR/pretrained/hey-mycroft.tar.gz" "$model_url" || {
print_status warning "Failed to download pre-trained model (optional)"
return 0
}
tar xzf "$MODELS_DIR/pretrained/hey-mycroft.tar.gz" -C "$MODELS_DIR/pretrained/" || {
print_status warning "Failed to extract pre-trained model"
return 0
}
print_status success "Pre-trained model downloaded"
else
print_status info "Pre-trained model already exists"
fi
return 0
}
print_next_steps() {
local wake_word_dir=$(echo "$WAKE_WORD" | tr ' ' '-' | tr '[:upper:]' '[:lower:]')
local project_dir="$MODELS_DIR/$wake_word_dir"
cat << EOF
${GREEN}Setup complete!${NC}
Wake word: "$WAKE_WORD"
Project directory: $project_dir
${BLUE}Next steps:${NC}
1. ${CYAN}Activate conda environment:${NC}
conda activate $CONDA_ENV_NAME
2. ${CYAN}Navigate to project directory:${NC}
cd $project_dir
3. ${CYAN}Follow the README or run scripts in order:${NC}
./1-record-wake-word.sh # Record wake word samples
./2-record-not-wake-word.sh # Record negative samples
# Organize samples into train/test directories
./3-train-model.sh # Train the model (30-60 min)
./4-test-model.sh # Test with microphone
./5-evaluate-model.sh # Check accuracy metrics
./6-tune-threshold.sh # Find best threshold
${BLUE}Helpful commands:${NC}
Test pre-trained model:
conda activate $CONDA_ENV_NAME
precise-listen $MODELS_DIR/pretrained/hey-mycroft.net
Check precise-engine:
precise-engine --version
${BLUE}Resources:${NC}
Full guide: See MYCROFT_PRECISE_GUIDE.md
Project README: $project_dir/README.md
Mycroft Docs: https://github.com/MycroftAI/mycroft-precise
EOF
}
# ----- Main -----
main() {
print_status info "Starting Mycroft Precise setup..."
# Parse arguments
parse_args "$@"
# Check dependencies
check_conda || exit 1
# Setup steps
create_conda_environment || exit 1
install_tensorflow || exit 1
install_precise || exit 1
download_precise_engine || exit 1
create_training_directory || exit 1
create_training_scripts || exit 1
create_readme || exit 1
download_pretrained_models || exit 1
# Print next steps
print_next_steps
}
# Run main
main "$@"

429
scripts/setup_voice_assistant.sh Executable file
View file

@ -0,0 +1,429 @@
#!/usr/bin/env bash
#
# Path: setup_voice_assistant.sh
#
# Purpose and usage:
# Sets up the voice assistant server environment on Heimdall
# - Creates conda environment
# - Installs dependencies (Whisper, Flask, Piper TTS)
# - Downloads and configures TTS models
# - Sets up systemd service (optional)
# - Configures environment variables
#
# Requirements:
# - conda/miniconda installed
# - Internet connection for downloads
# - Sudo access (for systemd service setup)
#
# Usage:
# ./setup_voice_assistant.sh [--no-service] [--env-name NAME]
#
# Author: PRbL Library
# Created: $(date +"%Y-%m-%d")
# ----- PRbL Color and output functions -----
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[0;33m'
BLUE='\033[0;34m'
PURPLE='\033[0;35m'
CYAN='\033[0;36m'
NC='\033[0m' # No Color
print_status() {
local level="$1"
shift
case "$level" in
"info") echo -e "${BLUE}[INFO]${NC} $*" >&2 ;;
"success") echo -e "${GREEN}[SUCCESS]${NC} $*" >&2 ;;
"warning") echo -e "${YELLOW}[WARNING]${NC} $*" >&2 ;;
"error") echo -e "${RED}[ERROR]${NC} $*" >&2 ;;
"debug") [[ "$VERBOSE" == "true" ]] && echo -e "${PURPLE}[DEBUG]${NC} $*" >&2 ;;
*) echo -e "$*" >&2 ;;
esac
}
# ----- Configuration -----
CONDA_ENV_NAME="voice-assistant"
PROJECT_DIR="$HOME/voice-assistant"
INSTALL_SYSTEMD=true
VERBOSE=false
# ----- Dependency checking -----
command_exists() {
command -v "$1" &> /dev/null
}
check_conda() {
if ! command_exists conda; then
print_status error "conda not found. Please install miniconda first."
print_status info "Install with: wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh"
print_status info " bash Miniconda3-latest-Linux-x86_64.sh"
return 1
fi
return 0
}
# ----- Parse arguments -----
parse_args() {
while [[ $# -gt 0 ]]; do
case "$1" in
--no-service)
INSTALL_SYSTEMD=false
shift
;;
--env-name)
CONDA_ENV_NAME="$2"
shift 2
;;
-v|--verbose)
VERBOSE=true
shift
;;
-h|--help)
cat << EOF
Usage: $(basename "$0") [OPTIONS]
Options:
--no-service Don't install systemd service
--env-name NAME Custom conda environment name (default: voice-assistant)
-v, --verbose Enable verbose output
-h, --help Show this help message
EOF
exit 0
;;
*)
print_status error "Unknown option: $1"
exit 1
;;
esac
done
}
# ----- Setup functions -----
create_project_directory() {
print_status info "Creating project directory: $PROJECT_DIR"
if [[ ! -d "$PROJECT_DIR" ]]; then
mkdir -p "$PROJECT_DIR" || {
print_status error "Failed to create project directory"
return 1
}
fi
# Create subdirectories
mkdir -p "$PROJECT_DIR"/{logs,models,config}
print_status success "Project directory created"
return 0
}
create_conda_environment() {
print_status info "Creating conda environment: $CONDA_ENV_NAME"
# Check if environment already exists
if conda env list | grep -q "^${CONDA_ENV_NAME}\s"; then
print_status warning "Environment $CONDA_ENV_NAME already exists"
read -p "Remove and recreate? (y/N): " -n 1 -r
echo
if [[ $REPLY =~ ^[Yy]$ ]]; then
print_status info "Removing existing environment..."
conda env remove -n "$CONDA_ENV_NAME" -y
else
print_status info "Using existing environment"
return 0
fi
fi
# Create new environment
print_status info "Creating Python 3.10 environment..."
conda create -n "$CONDA_ENV_NAME" python=3.10 -y || {
print_status error "Failed to create conda environment"
return 1
}
print_status success "Conda environment created"
return 0
}
install_python_dependencies() {
print_status info "Installing Python dependencies..."
# Activate conda environment
eval "$(conda shell.bash hook)"
conda activate "$CONDA_ENV_NAME" || {
print_status error "Failed to activate conda environment"
return 1
}
# Install base dependencies
print_status info "Installing base packages..."
pip install --upgrade pip --break-system-packages || true
# Install Whisper (OpenAI)
print_status info "Installing OpenAI Whisper..."
pip install -U openai-whisper --break-system-packages || {
print_status error "Failed to install Whisper"
return 1
}
# Install Flask
print_status info "Installing Flask..."
pip install flask --break-system-packages || {
print_status error "Failed to install Flask"
return 1
}
# Install requests
print_status info "Installing requests..."
pip install requests --break-system-packages || {
print_status error "Failed to install requests"
return 1
}
# Install python-dotenv
print_status info "Installing python-dotenv..."
pip install python-dotenv --break-system-packages || {
print_status warning "Failed to install python-dotenv (optional)"
}
# Install Piper TTS
print_status info "Installing Piper TTS..."
# Note: Piper TTS installation method varies, adjust as needed
# For now, we'll install the Python package if available
pip install piper-tts --break-system-packages || {
print_status warning "Piper TTS pip package not found"
print_status info "You may need to install Piper manually from: https://github.com/rhasspy/piper"
}
# Install PyAudio for audio handling
print_status info "Installing PyAudio dependencies..."
if command_exists apt-get; then
sudo apt-get install -y portaudio19-dev python3-pyaudio || {
print_status warning "Failed to install portaudio dev packages"
}
fi
pip install pyaudio --break-system-packages || {
print_status warning "Failed to install PyAudio (may need manual installation)"
}
print_status success "Python dependencies installed"
return 0
}
download_piper_models() {
print_status info "Downloading Piper TTS models..."
local models_dir="$PROJECT_DIR/models/piper"
mkdir -p "$models_dir"
# Download a default voice model
# Example: en_US-lessac-medium
local model_url="https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx"
local config_url="https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx.json"
if [[ ! -f "$models_dir/en_US-lessac-medium.onnx" ]]; then
print_status info "Downloading voice model..."
wget -q --show-progress -O "$models_dir/en_US-lessac-medium.onnx" "$model_url" || {
print_status warning "Failed to download Piper model (manual download may be needed)"
}
wget -q --show-progress -O "$models_dir/en_US-lessac-medium.onnx.json" "$config_url" || {
print_status warning "Failed to download Piper config"
}
else
print_status info "Piper model already downloaded"
fi
print_status success "Piper models ready"
return 0
}
create_config_file() {
print_status info "Creating configuration file..."
local config_file="$PROJECT_DIR/config/.env"
if [[ -f "$config_file" ]]; then
print_status warning "Config file already exists: $config_file"
return 0
fi
cat > "$config_file" << 'EOF'
# Voice Assistant Configuration
# Path: ~/voice-assistant/config/.env
# Home Assistant Configuration
HA_URL=http://homeassistant.local:8123
HA_TOKEN=your_long_lived_access_token_here
# Server Configuration
SERVER_HOST=0.0.0.0
SERVER_PORT=5000
# Whisper Configuration
WHISPER_MODEL=medium
# Piper TTS Configuration
PIPER_MODEL=/path/to/piper/model.onnx
PIPER_CONFIG=/path/to/piper/model.onnx.json
# Logging
LOG_LEVEL=INFO
LOG_FILE=/home/$USER/voice-assistant/logs/voice_assistant.log
EOF
# Update paths in config
sed -i "s|/path/to/piper/model.onnx|$PROJECT_DIR/models/piper/en_US-lessac-medium.onnx|g" "$config_file"
sed -i "s|/path/to/piper/model.onnx.json|$PROJECT_DIR/models/piper/en_US-lessac-medium.onnx.json|g" "$config_file"
sed -i "s|/home/\$USER|$HOME|g" "$config_file"
chmod 600 "$config_file"
print_status success "Config file created: $config_file"
print_status warning "Please edit $config_file and add your Home Assistant token"
return 0
}
create_systemd_service() {
if [[ "$INSTALL_SYSTEMD" != "true" ]]; then
print_status info "Skipping systemd service installation"
return 0
fi
print_status info "Creating systemd service..."
local service_file="/etc/systemd/system/voice-assistant.service"
# Create service file
sudo tee "$service_file" > /dev/null << EOF
[Unit]
Description=Voice Assistant Server
After=network.target
[Service]
Type=simple
User=$USER
WorkingDirectory=$PROJECT_DIR
Environment="PATH=$HOME/miniconda3/envs/$CONDA_ENV_NAME/bin:/usr/local/bin:/usr/bin:/bin"
EnvironmentFile=$PROJECT_DIR/config/.env
ExecStart=$HOME/miniconda3/envs/$CONDA_ENV_NAME/bin/python $PROJECT_DIR/voice_server.py
Restart=on-failure
RestartSec=10
StandardOutput=append:$PROJECT_DIR/logs/voice_assistant.log
StandardError=append:$PROJECT_DIR/logs/voice_assistant_error.log
[Install]
WantedBy=multi-user.target
EOF
# Reload systemd
sudo systemctl daemon-reload
print_status success "Systemd service created"
print_status info "To enable and start the service:"
print_status info " sudo systemctl enable voice-assistant"
print_status info " sudo systemctl start voice-assistant"
return 0
}
create_test_script() {
print_status info "Creating test script..."
local test_script="$PROJECT_DIR/test_server.sh"
cat > "$test_script" << 'EOF'
#!/bin/bash
# Test script for voice assistant server
# Activate conda environment
eval "$(conda shell.bash hook)"
conda activate voice-assistant
# Load environment variables
if [[ -f ~/voice-assistant/config/.env ]]; then
export $(grep -v '^#' ~/voice-assistant/config/.env | xargs)
fi
# Run server
cd ~/voice-assistant
python voice_server.py --verbose
EOF
chmod +x "$test_script"
print_status success "Test script created: $test_script"
return 0
}
install_voice_server_script() {
print_status info "Installing voice_server.py..."
# Check if voice_server.py exists in outputs
if [[ -f "$HOME/voice_server.py" ]]; then
cp "$HOME/voice_server.py" "$PROJECT_DIR/voice_server.py"
print_status success "voice_server.py installed"
elif [[ -f "./voice_server.py" ]]; then
cp "./voice_server.py" "$PROJECT_DIR/voice_server.py"
print_status success "voice_server.py installed"
else
print_status warning "voice_server.py not found in current directory"
print_status info "Please copy voice_server.py to $PROJECT_DIR manually"
fi
return 0
}
# ----- Main -----
main() {
print_status info "Starting voice assistant setup..."
# Parse arguments
parse_args "$@"
# Check dependencies
check_conda || exit 1
# Setup steps
create_project_directory || exit 1
create_conda_environment || exit 1
install_python_dependencies || exit 1
download_piper_models || exit 1
create_config_file || exit 1
install_voice_server_script || exit 1
create_test_script || exit 1
if [[ "$INSTALL_SYSTEMD" == "true" ]]; then
create_systemd_service || exit 1
fi
# Final instructions
print_status success "Setup complete!"
echo
print_status info "Next steps:"
print_status info "1. Edit config file: vim $PROJECT_DIR/config/.env"
print_status info "2. Add your Home Assistant long-lived access token"
print_status info "3. Test the server: $PROJECT_DIR/test_server.sh"
print_status info "4. Configure your Maix Duino device"
if [[ "$INSTALL_SYSTEMD" == "true" ]]; then
echo
print_status info "To run as a service:"
print_status info " sudo systemctl enable voice-assistant"
print_status info " sudo systemctl start voice-assistant"
print_status info " sudo systemctl status voice-assistant"
fi
echo
print_status info "Project directory: $PROJECT_DIR"
print_status info "Conda environment: $CONDA_ENV_NAME"
print_status info "Activate with: conda activate $CONDA_ENV_NAME"
}
# Run main
main "$@"

700
scripts/voice_server.py Executable file
View file

@ -0,0 +1,700 @@
#!/usr/bin/env python3
"""
Voice Processing Server for Maix Duino Voice Assistant
Purpose and usage:
This server runs on Heimdall (10.1.10.71) and handles:
- Audio stream reception from Maix Duino
- Speech-to-text using Whisper
- Intent recognition and Home Assistant API calls
- Text-to-speech using Piper
- Audio response streaming back to device
Path: /home/alan/voice-assistant/voice_server.py
Requirements:
- whisper (already installed)
- piper-tts
- flask
- requests
- python-dotenv
Usage:
python3 voice_server.py [--host HOST] [--port PORT] [--ha-url URL]
"""
import os
import sys
import argparse
import tempfile
import wave
import io
import re
import threading
import queue
from pathlib import Path
from typing import Optional, Dict, Any, Tuple
import whisper
import requests
from flask import Flask, request, jsonify, send_file
from werkzeug.exceptions import BadRequest
# Try to load environment variables
try:
from dotenv import load_dotenv
load_dotenv()
except ImportError:
print("Warning: python-dotenv not installed. Using environment variables only.")
# Try to import Mycroft Precise
PRECISE_AVAILABLE = False
try:
from precise_runner import PreciseEngine, PreciseRunner
import pyaudio
PRECISE_AVAILABLE = True
except ImportError:
print("Warning: Mycroft Precise not installed. Wake word detection disabled.")
print("Install with: pip install mycroft-precise pyaudio")
# Configuration
DEFAULT_HOST = "0.0.0.0"
DEFAULT_PORT = 5000
DEFAULT_WHISPER_MODEL = "medium"
DEFAULT_HA_URL = os.getenv("HA_URL", "http://homeassistant.local:8123")
DEFAULT_HA_TOKEN = os.getenv("HA_TOKEN", "")
DEFAULT_PRECISE_MODEL = os.getenv("PRECISE_MODEL", "")
DEFAULT_PRECISE_SENSITIVITY = float(os.getenv("PRECISE_SENSITIVITY", "0.5"))
DEFAULT_PRECISE_ENGINE = "/usr/local/bin/precise-engine"
# Initialize Flask app
app = Flask(__name__)
app.config['MAX_CONTENT_LENGTH'] = 16 * 1024 * 1024 # 16MB max audio file
# Global variables for loaded models
whisper_model = None
ha_client = None
precise_runner = None
precise_enabled = False
wake_word_queue = queue.Queue() # Queue for wake word detections
class HomeAssistantClient:
"""Client for interacting with Home Assistant API"""
def __init__(self, base_url: str, token: str):
self.base_url = base_url.rstrip('/')
self.token = token
self.session = requests.Session()
self.session.headers.update({
'Authorization': f'Bearer {token}',
'Content-Type': 'application/json'
})
def get_state(self, entity_id: str) -> Optional[Dict[str, Any]]:
"""Get the state of an entity"""
try:
response = self.session.get(f'{self.base_url}/api/states/{entity_id}')
response.raise_for_status()
return response.json()
except requests.RequestException as e:
print(f"Error getting state for {entity_id}: {e}")
return None
def call_service(self, domain: str, service: str, entity_id: str,
**kwargs) -> bool:
"""Call a Home Assistant service"""
try:
data = {'entity_id': entity_id}
data.update(kwargs)
response = self.session.post(
f'{self.base_url}/api/services/{domain}/{service}',
json=data
)
response.raise_for_status()
return True
except requests.RequestException as e:
print(f"Error calling service {domain}.{service}: {e}")
return False
def turn_on(self, entity_id: str, **kwargs) -> bool:
"""Turn on an entity"""
domain = entity_id.split('.')[0]
return self.call_service(domain, 'turn_on', entity_id, **kwargs)
def turn_off(self, entity_id: str, **kwargs) -> bool:
"""Turn off an entity"""
domain = entity_id.split('.')[0]
return self.call_service(domain, 'turn_off', entity_id, **kwargs)
def toggle(self, entity_id: str, **kwargs) -> bool:
"""Toggle an entity"""
domain = entity_id.split('.')[0]
return self.call_service(domain, 'toggle', entity_id, **kwargs)
class IntentParser:
"""Simple pattern-based intent recognition"""
# Intent patterns (can be expanded or replaced with ML-based NLU)
PATTERNS = {
'turn_on': [
r'turn on (the )?(.+)',
r'switch on (the )?(.+)',
r'enable (the )?(.+)',
],
'turn_off': [
r'turn off (the )?(.+)',
r'switch off (the )?(.+)',
r'disable (the )?(.+)',
],
'toggle': [
r'toggle (the )?(.+)',
],
'get_state': [
r'what(?:\'s| is) (the )?(.+)',
r'how is (the )?(.+)',
r'status of (the )?(.+)',
],
'get_temperature': [
r'what(?:\'s| is) the temperature',
r'how (?:warm|cold|hot) is it',
],
}
# Entity name mapping (friendly names to entity IDs)
ENTITY_MAP = {
'living room light': 'light.living_room',
'living room lights': 'light.living_room',
'bedroom light': 'light.bedroom',
'bedroom lights': 'light.bedroom',
'kitchen light': 'light.kitchen',
'kitchen lights': 'light.kitchen',
'all lights': 'group.all_lights',
'temperature': 'sensor.temperature',
'thermostat': 'climate.thermostat',
}
def parse(self, text: str) -> Optional[Tuple[str, str, Dict[str, Any]]]:
"""
Parse text into intent, entity, and parameters
Returns:
(intent, entity_id, params) or None if no match
"""
text = text.lower().strip()
for intent, patterns in self.PATTERNS.items():
for pattern in patterns:
match = re.match(pattern, text, re.IGNORECASE)
if match:
# Extract entity name from match groups
entity_name = None
for group in match.groups():
if group and group.lower() not in ['the', 'a', 'an']:
entity_name = group.lower().strip()
break
# Map entity name to entity ID
entity_id = None
if entity_name:
entity_id = self.ENTITY_MAP.get(entity_name)
# For get_temperature, use default sensor
if intent == 'get_temperature':
entity_id = self.ENTITY_MAP.get('temperature')
if entity_id:
return (intent, entity_id, {})
return None
def load_whisper_model(model_name: str = DEFAULT_WHISPER_MODEL):
"""Load Whisper model"""
global whisper_model
if whisper_model is None:
print(f"Loading Whisper model: {model_name}")
whisper_model = whisper.load_model(model_name)
print("Whisper model loaded successfully")
return whisper_model
def transcribe_audio(audio_file_path: str) -> Optional[str]:
"""Transcribe audio file using Whisper"""
try:
model = load_whisper_model()
result = model.transcribe(audio_file_path)
return result['text'].strip()
except Exception as e:
print(f"Error transcribing audio: {e}")
return None
def generate_tts(text: str) -> Optional[bytes]:
"""
Generate speech from text using Piper TTS
TODO: Implement Piper TTS integration
For now, returns None - implement based on Piper installation
"""
# Placeholder for TTS implementation
print(f"TTS requested for: {text}")
# You'll need to add Piper TTS integration here
# Example command: piper --model <model> --output_file <file> < text
return None
def on_wake_word_detected():
"""
Callback when Mycroft Precise detects wake word
This function is called by the Precise runner when the wake word
is detected. It signals the main application to start recording
and processing the user's command.
"""
print("Wake word detected by Precise!")
wake_word_queue.put({
'timestamp': time.time(),
'source': 'precise'
})
def start_precise_listener(model_path: str, sensitivity: float = 0.5,
engine_path: str = DEFAULT_PRECISE_ENGINE):
"""
Start Mycroft Precise wake word detection
Args:
model_path: Path to .net model file
sensitivity: Detection threshold (0.0-1.0, default 0.5)
engine_path: Path to precise-engine binary
Returns:
PreciseRunner instance if successful, None otherwise
"""
global precise_runner, precise_enabled
if not PRECISE_AVAILABLE:
print("Error: Mycroft Precise not available")
return None
# Verify model exists
if not os.path.exists(model_path):
print(f"Error: Precise model not found: {model_path}")
return None
# Verify engine exists
if not os.path.exists(engine_path):
print(f"Error: precise-engine not found: {engine_path}")
print("Download from: https://github.com/MycroftAI/mycroft-precise/releases")
return None
try:
# Create Precise engine
engine = PreciseEngine(engine_path, model_path)
# Create runner with callback
precise_runner = PreciseRunner(
engine,
sensitivity=sensitivity,
on_activation=on_wake_word_detected
)
# Start listening
precise_runner.start()
precise_enabled = True
print(f"Precise listening started:")
print(f" Model: {model_path}")
print(f" Sensitivity: {sensitivity}")
print(f" Engine: {engine_path}")
return precise_runner
except Exception as e:
print(f"Error starting Precise: {e}")
return None
def stop_precise_listener():
"""Stop Mycroft Precise wake word detection"""
global precise_runner, precise_enabled
if precise_runner:
try:
precise_runner.stop()
precise_enabled = False
print("Precise listener stopped")
except Exception as e:
print(f"Error stopping Precise: {e}")
def record_audio_after_wake(duration: int = 5) -> Optional[bytes]:
"""
Record audio after wake word is detected
Args:
duration: Maximum recording duration in seconds
Returns:
WAV audio data or None
Note: This is for server-side wake word detection where
the server is also doing audio capture. For Maix Duino
client-side wake detection, audio comes from the client.
"""
if not PRECISE_AVAILABLE:
return None
try:
# Audio settings
CHUNK = 1024
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000
p = pyaudio.PyAudio()
# Open stream
stream = p.open(
format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=CHUNK
)
print(f"Recording for {duration} seconds...")
frames = []
for _ in range(0, int(RATE / CHUNK * duration)):
data = stream.read(CHUNK)
frames.append(data)
# Stop and close stream
stream.stop_stream()
stream.close()
p.terminate()
# Convert to WAV
wav_buffer = io.BytesIO()
with wave.open(wav_buffer, 'wb') as wf:
wf.setnchannels(CHANNELS)
wf.setsampwidth(p.get_sample_size(FORMAT))
wf.setframerate(RATE)
wf.writeframes(b''.join(frames))
return wav_buffer.getvalue()
except Exception as e:
print(f"Error recording audio: {e}")
return None
import time # Add this import at the top if not already there
def execute_intent(intent: str, entity_id: str, params: Dict[str, Any]) -> str:
"""Execute an intent and return response text"""
if intent == 'turn_on':
success = ha_client.turn_on(entity_id)
if success:
entity_name = entity_id.split('.')[-1].replace('_', ' ')
return f"Turned on {entity_name}"
else:
return "Sorry, I couldn't turn that on"
elif intent == 'turn_off':
success = ha_client.turn_off(entity_id)
if success:
entity_name = entity_id.split('.')[-1].replace('_', ' ')
return f"Turned off {entity_name}"
else:
return "Sorry, I couldn't turn that off"
elif intent == 'toggle':
success = ha_client.toggle(entity_id)
if success:
entity_name = entity_id.split('.')[-1].replace('_', ' ')
return f"Toggled {entity_name}"
else:
return "Sorry, I couldn't toggle that"
elif intent in ['get_state', 'get_temperature']:
state = ha_client.get_state(entity_id)
if state:
entity_name = entity_id.split('.')[-1].replace('_', ' ')
value = state.get('state', 'unknown')
unit = state.get('attributes', {}).get('unit_of_measurement', '')
return f"The {entity_name} is {value} {unit}".strip()
else:
return "Sorry, I couldn't get that information"
return "I didn't understand that command"
# Flask routes
@app.route('/health', methods=['GET'])
def health():
"""Health check endpoint"""
return jsonify({
'status': 'healthy',
'whisper_loaded': whisper_model is not None,
'ha_connected': ha_client is not None,
'precise_enabled': precise_enabled,
'precise_available': PRECISE_AVAILABLE
})
@app.route('/wake-word/status', methods=['GET'])
def wake_word_status():
"""Get wake word detection status"""
return jsonify({
'enabled': precise_enabled,
'available': PRECISE_AVAILABLE,
'model': DEFAULT_PRECISE_MODEL if precise_enabled else None,
'sensitivity': DEFAULT_PRECISE_SENSITIVITY if precise_enabled else None
})
@app.route('/wake-word/detections', methods=['GET'])
def wake_word_detections():
"""
Get recent wake word detections (non-blocking)
Returns any wake word detections in the queue.
Used for testing and monitoring.
"""
detections = []
try:
while not wake_word_queue.empty():
detections.append(wake_word_queue.get_nowait())
except queue.Empty:
pass
return jsonify({
'detections': detections,
'count': len(detections)
})
@app.route('/transcribe', methods=['POST'])
def transcribe():
"""
Transcribe audio file
Expects: WAV audio file in request body
Returns: JSON with transcribed text
"""
if 'audio' not in request.files:
raise BadRequest('No audio file provided')
audio_file = request.files['audio']
# Save to temporary file
with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as temp_file:
audio_file.save(temp_file.name)
temp_path = temp_file.name
try:
# Transcribe
text = transcribe_audio(temp_path)
if text:
return jsonify({
'success': True,
'text': text
})
else:
return jsonify({
'success': False,
'error': 'Transcription failed'
}), 500
finally:
# Clean up temp file
if os.path.exists(temp_path):
os.remove(temp_path)
@app.route('/process', methods=['POST'])
def process():
"""
Process complete voice command
Expects: WAV audio file in request body
Returns: JSON with response and audio file
"""
if 'audio' not in request.files:
raise BadRequest('No audio file provided')
audio_file = request.files['audio']
# Save to temporary file
with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as temp_file:
audio_file.save(temp_file.name)
temp_path = temp_file.name
try:
# Step 1: Transcribe
text = transcribe_audio(temp_path)
if not text:
return jsonify({
'success': False,
'error': 'Transcription failed'
}), 500
print(f"Transcribed: {text}")
# Step 2: Parse intent
parser = IntentParser()
intent_result = parser.parse(text)
if not intent_result:
response_text = "I didn't understand that command"
else:
intent, entity_id, params = intent_result
print(f"Intent: {intent}, Entity: {entity_id}")
# Step 3: Execute intent
response_text = execute_intent(intent, entity_id, params)
print(f"Response: {response_text}")
# Step 4: Generate TTS (placeholder for now)
# audio_response = generate_tts(response_text)
return jsonify({
'success': True,
'transcription': text,
'response': response_text,
# 'audio_available': audio_response is not None
})
finally:
# Clean up temp file
if os.path.exists(temp_path):
os.remove(temp_path)
@app.route('/tts', methods=['POST'])
def tts():
"""
Generate TTS audio
Expects: JSON with 'text' field
Returns: WAV audio file
"""
data = request.get_json()
if not data or 'text' not in data:
raise BadRequest('No text provided')
text = data['text']
# Generate TTS
audio_data = generate_tts(text)
if audio_data:
return send_file(
io.BytesIO(audio_data),
mimetype='audio/wav',
as_attachment=True,
download_name='response.wav'
)
else:
return jsonify({
'success': False,
'error': 'TTS generation not implemented yet'
}), 501
def main():
parser = argparse.ArgumentParser(
description="Voice Processing Server for Maix Duino Voice Assistant"
)
parser.add_argument('--host', default=DEFAULT_HOST,
help=f'Server host (default: {DEFAULT_HOST})')
parser.add_argument('--port', type=int, default=DEFAULT_PORT,
help=f'Server port (default: {DEFAULT_PORT})')
parser.add_argument('--whisper-model', default=DEFAULT_WHISPER_MODEL,
help=f'Whisper model to use (default: {DEFAULT_WHISPER_MODEL})')
parser.add_argument('--ha-url', default=DEFAULT_HA_URL,
help=f'Home Assistant URL (default: {DEFAULT_HA_URL})')
parser.add_argument('--ha-token', default=DEFAULT_HA_TOKEN,
help='Home Assistant long-lived access token')
parser.add_argument('--enable-precise', action='store_true',
help='Enable Mycroft Precise wake word detection')
parser.add_argument('--precise-model', default=DEFAULT_PRECISE_MODEL,
help='Path to Precise .net model file')
parser.add_argument('--precise-sensitivity', type=float,
default=DEFAULT_PRECISE_SENSITIVITY,
help='Precise sensitivity threshold (0.0-1.0, default: 0.5)')
parser.add_argument('--precise-engine', default=DEFAULT_PRECISE_ENGINE,
help=f'Path to precise-engine binary (default: {DEFAULT_PRECISE_ENGINE})')
args = parser.parse_args()
# Validate HA configuration
if not args.ha_token:
print("Warning: No Home Assistant token provided!")
print("Set HA_TOKEN environment variable or use --ha-token")
print("Commands will not execute without authentication.")
# Initialize global clients
global ha_client
ha_client = HomeAssistantClient(args.ha_url, args.ha_token)
# Load Whisper model
print(f"Starting voice processing server on {args.host}:{args.port}")
load_whisper_model(args.whisper_model)
# Start Precise if enabled
if args.enable_precise:
if not PRECISE_AVAILABLE:
print("Error: --enable-precise specified but Mycroft Precise not installed")
print("Install with: pip install mycroft-precise pyaudio")
sys.exit(1)
if not args.precise_model:
print("Error: --enable-precise requires --precise-model")
sys.exit(1)
print("\nStarting Mycroft Precise wake word detection...")
precise_result = start_precise_listener(
args.precise_model,
args.precise_sensitivity,
args.precise_engine
)
if not precise_result:
print("Error: Failed to start Precise listener")
sys.exit(1)
print("\nWake word detection active!")
print("The server will detect wake words and queue them for processing.")
print("Use /wake-word/detections endpoint to check for detections.\n")
# Start Flask server
try:
app.run(host=args.host, port=args.port, debug=False)
except KeyboardInterrupt:
print("\nShutting down...")
if args.enable_precise:
stop_precise_listener()
sys.exit(0)
if __name__ == '__main__':
main()

580
scripts/voice_server_enhanced.py Executable file
View file

@ -0,0 +1,580 @@
#!/usr/bin/env python3
"""
Enhanced Voice Server with Multiple Wake Words and Speaker Identification
Path: /home/alan/voice-assistant/voice_server_enhanced.py
This enhanced version adds:
- Multiple wake word support
- Speaker identification using pyannote.audio
- Per-user customization
- Wake word-specific responses
Usage:
python3 voice_server_enhanced.py \
--enable-precise \
--multi-wake-word \
--enable-speaker-id
"""
import os
import sys
import json
import argparse
import tempfile
import wave
import io
import re
import threading
import queue
import time
from pathlib import Path
from typing import Optional, Dict, Any, Tuple, List
import whisper
import requests
from flask import Flask, request, jsonify, send_file
from werkzeug.exceptions import BadRequest
try:
from dotenv import load_dotenv
load_dotenv()
except ImportError:
pass
# Mycroft Precise
PRECISE_AVAILABLE = False
try:
from precise_runner import PreciseEngine, PreciseRunner
import pyaudio
PRECISE_AVAILABLE = True
except ImportError:
print("Warning: Mycroft Precise not installed")
# Speaker identification
SPEAKER_ID_AVAILABLE = False
try:
from pyannote.audio import Inference
from scipy.spatial.distance import cosine
import numpy as np
SPEAKER_ID_AVAILABLE = True
except ImportError:
print("Warning: Speaker ID not available. Install: pip install pyannote.audio scipy")
# Configuration
DEFAULT_HOST = "0.0.0.0"
DEFAULT_PORT = 5000
DEFAULT_WHISPER_MODEL = "medium"
DEFAULT_HA_URL = os.getenv("HA_URL", "http://homeassistant.local:8123")
DEFAULT_HA_TOKEN = os.getenv("HA_TOKEN", "")
DEFAULT_PRECISE_ENGINE = "/usr/local/bin/precise-engine"
DEFAULT_HF_TOKEN = os.getenv("HF_TOKEN", "")
# Wake word configurations
WAKE_WORD_CONFIGS = {
'hey_mycroft': {
'model': os.path.expanduser('~/precise-models/pretrained/hey-mycroft.net'),
'sensitivity': 0.5,
'response': 'Yes?',
'enabled': True,
'context': 'general'
},
'hey_computer': {
'model': os.path.expanduser('~/precise-models/hey-computer/hey-computer.net'),
'sensitivity': 0.5,
'response': 'I\'m listening',
'enabled': False, # Disabled by default (requires training)
'context': 'general'
},
'jarvis': {
'model': os.path.expanduser('~/precise-models/jarvis/jarvis.net'),
'sensitivity': 0.6,
'response': 'At your service',
'enabled': False,
'context': 'personal'
},
}
# Speaker profiles (stored in JSON file)
SPEAKER_PROFILES_FILE = os.path.expanduser('~/voice-assistant/config/speaker_profiles.json')
# Flask app
app = Flask(__name__)
app.config['MAX_CONTENT_LENGTH'] = 16 * 1024 * 1024
# Global state
whisper_model = None
ha_client = None
precise_runners = {}
precise_enabled = False
speaker_id_enabled = False
speaker_inference = None
speaker_profiles = {}
wake_word_queue = queue.Queue()
class HomeAssistantClient:
"""Client for Home Assistant API"""
def __init__(self, base_url: str, token: str):
self.base_url = base_url.rstrip('/')
self.token = token
self.session = requests.Session()
self.session.headers.update({
'Authorization': f'Bearer {token}',
'Content-Type': 'application/json'
})
def get_state(self, entity_id: str) -> Optional[Dict[str, Any]]:
try:
response = self.session.get(f'{self.base_url}/api/states/{entity_id}')
response.raise_for_status()
return response.json()
except requests.RequestException as e:
print(f"Error getting state for {entity_id}: {e}")
return None
def call_service(self, domain: str, service: str, entity_id: str, **kwargs) -> bool:
try:
data = {'entity_id': entity_id}
data.update(kwargs)
response = self.session.post(
f'{self.base_url}/api/services/{domain}/{service}',
json=data
)
response.raise_for_status()
return True
except requests.RequestException as e:
print(f"Error calling service {domain}.{service}: {e}")
return False
def turn_on(self, entity_id: str, **kwargs) -> bool:
domain = entity_id.split('.')[0]
return self.call_service(domain, 'turn_on', entity_id, **kwargs)
def turn_off(self, entity_id: str, **kwargs) -> bool:
domain = entity_id.split('.')[0]
return self.call_service(domain, 'turn_off', entity_id, **kwargs)
class SpeakerIdentification:
"""Speaker identification using pyannote.audio"""
def __init__(self, hf_token: str):
if not SPEAKER_ID_AVAILABLE:
raise ImportError("Speaker ID dependencies not available")
self.inference = Inference(
"pyannote/embedding",
use_auth_token=hf_token
)
self.profiles = {}
def enroll_speaker(self, name: str, audio_file: str):
"""Enroll a speaker from audio file"""
embedding = self.inference(audio_file)
self.profiles[name] = {
'embedding': embedding.tolist(), # Convert to list for JSON
'enrolled': time.time()
}
print(f"Enrolled speaker: {name}")
def identify_speaker(self, audio_file: str, threshold: float = 0.7) -> Optional[str]:
"""Identify speaker from audio file"""
if not self.profiles:
return None
unknown_embedding = self.inference(audio_file)
best_match = None
best_similarity = 0.0
for name, profile in self.profiles.items():
known_embedding = np.array(profile['embedding'])
similarity = 1 - cosine(unknown_embedding, known_embedding)
if similarity > best_similarity:
best_similarity = similarity
best_match = name
if best_similarity >= threshold:
return best_match
return 'unknown'
def load_profiles(self, filepath: str):
"""Load speaker profiles from JSON"""
if os.path.exists(filepath):
with open(filepath, 'r') as f:
self.profiles = json.load(f)
print(f"Loaded {len(self.profiles)} speaker profiles")
def save_profiles(self, filepath: str):
"""Save speaker profiles to JSON"""
os.makedirs(os.path.dirname(filepath), exist_ok=True)
with open(filepath, 'w') as f:
json.dump(self.profiles, f, indent=2)
print(f"Saved {len(self.profiles)} speaker profiles")
def load_whisper_model(model_name: str = DEFAULT_WHISPER_MODEL):
"""Load Whisper model"""
global whisper_model
if whisper_model is None:
print(f"Loading Whisper model: {model_name}")
whisper_model = whisper.load_model(model_name)
print("Whisper model loaded")
return whisper_model
def transcribe_audio(audio_file_path: str) -> Optional[str]:
"""Transcribe audio file"""
try:
model = load_whisper_model()
result = model.transcribe(audio_file_path)
return result['text'].strip()
except Exception as e:
print(f"Error transcribing: {e}")
return None
def on_wake_word_detected(wake_word_name: str):
"""Callback factory for wake word detection"""
def callback():
config = WAKE_WORD_CONFIGS.get(wake_word_name, {})
print(f"Wake word detected: {wake_word_name}")
wake_word_queue.put({
'timestamp': time.time(),
'wake_word': wake_word_name,
'response': config.get('response', 'Yes?'),
'context': config.get('context', 'general')
})
return callback
def start_multiple_wake_words(configs: Dict[str, Dict], engine_path: str):
"""Start multiple Precise wake word listeners"""
global precise_runners, precise_enabled
if not PRECISE_AVAILABLE:
print("Error: Precise not available")
return False
active_count = 0
for name, config in configs.items():
if not config.get('enabled', False):
continue
model_path = config['model']
if not os.path.exists(model_path):
print(f"Warning: Model not found: {model_path} (skipping {name})")
continue
try:
engine = PreciseEngine(engine_path, model_path)
runner = PreciseRunner(
engine,
sensitivity=config.get('sensitivity', 0.5),
on_activation=on_wake_word_detected(name)
)
runner.start()
precise_runners[name] = runner
active_count += 1
print(f"✓ Started wake word: {name}")
print(f" Model: {model_path}")
print(f" Sensitivity: {config.get('sensitivity', 0.5)}")
except Exception as e:
print(f"✗ Failed to start {name}: {e}")
if active_count > 0:
precise_enabled = True
print(f"\nTotal active wake words: {active_count}")
return True
return False
def stop_all_wake_words():
"""Stop all wake word listeners"""
global precise_runners, precise_enabled
for name, runner in precise_runners.items():
try:
runner.stop()
print(f"Stopped wake word: {name}")
except Exception as e:
print(f"Error stopping {name}: {e}")
precise_runners = {}
precise_enabled = False
def init_speaker_identification(hf_token: str) -> Optional[SpeakerIdentification]:
"""Initialize speaker identification"""
global speaker_inference, speaker_id_enabled
if not SPEAKER_ID_AVAILABLE:
print("Speaker ID not available")
return None
try:
speaker_inference = SpeakerIdentification(hf_token)
# Load existing profiles
if os.path.exists(SPEAKER_PROFILES_FILE):
speaker_inference.load_profiles(SPEAKER_PROFILES_FILE)
speaker_id_enabled = True
print("Speaker identification initialized")
return speaker_inference
except Exception as e:
print(f"Error initializing speaker ID: {e}")
return None
# Flask routes
@app.route('/health', methods=['GET'])
def health():
"""Health check"""
return jsonify({
'status': 'healthy',
'whisper_loaded': whisper_model is not None,
'ha_connected': ha_client is not None,
'precise_enabled': precise_enabled,
'active_wake_words': list(precise_runners.keys()),
'speaker_id_enabled': speaker_id_enabled,
'enrolled_speakers': list(speaker_inference.profiles.keys()) if speaker_inference else []
})
@app.route('/wake-words', methods=['GET'])
def list_wake_words():
"""List all configured wake words"""
wake_words = []
for name, config in WAKE_WORD_CONFIGS.items():
wake_words.append({
'name': name,
'enabled': config.get('enabled', False),
'active': name in precise_runners,
'model': config['model'],
'sensitivity': config.get('sensitivity', 0.5),
'response': config.get('response', ''),
'context': config.get('context', 'general')
})
return jsonify({
'wake_words': wake_words,
'total': len(wake_words),
'active': len(precise_runners)
})
@app.route('/wake-words/<name>/enable', methods=['POST'])
def enable_wake_word(name):
"""Enable a wake word"""
if name not in WAKE_WORD_CONFIGS:
return jsonify({'error': 'Wake word not found'}), 404
config = WAKE_WORD_CONFIGS[name]
config['enabled'] = True
# Start the wake word if not already running
if name not in precise_runners:
# Restart all wake words to pick up changes
# (simpler than starting individual ones)
return jsonify({
'message': f'Enabled {name}. Restart server to activate.'
})
return jsonify({'message': f'Wake word {name} enabled'})
@app.route('/speakers/enroll', methods=['POST'])
def enroll_speaker():
"""Enroll a new speaker"""
if not speaker_id_enabled or not speaker_inference:
return jsonify({'error': 'Speaker ID not enabled'}), 400
if 'audio' not in request.files:
return jsonify({'error': 'No audio file'}), 400
name = request.form.get('name')
if not name:
return jsonify({'error': 'No speaker name provided'}), 400
audio_file = request.files['audio']
# Save temporarily
with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as temp:
audio_file.save(temp.name)
temp_path = temp.name
try:
speaker_inference.enroll_speaker(name, temp_path)
speaker_inference.save_profiles(SPEAKER_PROFILES_FILE)
return jsonify({
'message': f'Enrolled speaker: {name}',
'total_speakers': len(speaker_inference.profiles)
})
except Exception as e:
return jsonify({'error': str(e)}), 500
finally:
if os.path.exists(temp_path):
os.remove(temp_path)
@app.route('/speakers', methods=['GET'])
def list_speakers():
"""List enrolled speakers"""
if not speaker_id_enabled or not speaker_inference:
return jsonify({'error': 'Speaker ID not enabled'}), 400
speakers = []
for name, profile in speaker_inference.profiles.items():
speakers.append({
'name': name,
'enrolled': profile.get('enrolled', 0)
})
return jsonify({
'speakers': speakers,
'total': len(speakers)
})
@app.route('/process-enhanced', methods=['POST'])
def process_enhanced():
"""
Enhanced processing with speaker ID and wake word context
"""
if 'audio' not in request.files:
return jsonify({'error': 'No audio file'}), 400
wake_word = request.form.get('wake_word', 'unknown')
audio_file = request.files['audio']
with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as temp:
audio_file.save(temp.name)
temp_path = temp.name
try:
# Identify speaker (if enabled)
speaker = 'unknown'
if speaker_id_enabled and speaker_inference:
speaker = speaker_inference.identify_speaker(temp_path)
print(f"Identified speaker: {speaker}")
# Transcribe
text = transcribe_audio(temp_path)
if not text:
return jsonify({'error': 'Transcription failed'}), 500
print(f"[{speaker}] via [{wake_word}]: {text}")
# Get wake word config
config = WAKE_WORD_CONFIGS.get(wake_word, {})
context = config.get('context', 'general')
# Process based on context and speaker
response = f"Heard via {wake_word}: {text}"
return jsonify({
'success': True,
'transcription': text,
'speaker': speaker,
'wake_word': wake_word,
'context': context,
'response': response
})
finally:
if os.path.exists(temp_path):
os.remove(temp_path)
def main():
parser = argparse.ArgumentParser(
description="Enhanced Voice Server with Multi-Wake-Word and Speaker ID"
)
parser.add_argument('--host', default=DEFAULT_HOST)
parser.add_argument('--port', type=int, default=DEFAULT_PORT)
parser.add_argument('--whisper-model', default=DEFAULT_WHISPER_MODEL)
parser.add_argument('--ha-url', default=DEFAULT_HA_URL)
parser.add_argument('--ha-token', default=DEFAULT_HA_TOKEN)
parser.add_argument('--enable-precise', action='store_true',
help='Enable wake word detection')
parser.add_argument('--multi-wake-word', action='store_true',
help='Enable multiple wake words')
parser.add_argument('--precise-engine', default=DEFAULT_PRECISE_ENGINE)
parser.add_argument('--enable-speaker-id', action='store_true',
help='Enable speaker identification')
parser.add_argument('--hf-token', default=DEFAULT_HF_TOKEN,
help='HuggingFace token for speaker ID')
args = parser.parse_args()
# Initialize HA client
global ha_client
ha_client = HomeAssistantClient(args.ha_url, args.ha_token)
# Load Whisper
print(f"Starting enhanced voice server on {args.host}:{args.port}")
load_whisper_model(args.whisper_model)
# Start Precise (multiple wake words)
if args.enable_precise:
if not PRECISE_AVAILABLE:
print("Error: Precise not available")
sys.exit(1)
# Enable all or just first wake word
if args.multi_wake_word:
# Enable all configured wake words
enabled_count = sum(1 for c in WAKE_WORD_CONFIGS.values() if c.get('enabled'))
print(f"\nStarting {enabled_count} wake words...")
else:
# Enable only first wake word
first_key = list(WAKE_WORD_CONFIGS.keys())[0]
WAKE_WORD_CONFIGS[first_key]['enabled'] = True
for key in list(WAKE_WORD_CONFIGS.keys())[1:]:
WAKE_WORD_CONFIGS[key]['enabled'] = False
if not start_multiple_wake_words(WAKE_WORD_CONFIGS, args.precise_engine):
print("Error: No wake words started")
sys.exit(1)
# Initialize speaker ID
if args.enable_speaker_id:
if not args.hf_token:
print("Error: --hf-token required for speaker ID")
sys.exit(1)
if not init_speaker_identification(args.hf_token):
print("Warning: Speaker ID initialization failed")
# Start server
try:
print("\n" + "="*50)
print("Server ready!")
print("="*50 + "\n")
app.run(host=args.host, port=args.port, debug=False)
except KeyboardInterrupt:
print("\nShutting down...")
stop_all_wake_words()
sys.exit(0)
if __name__ == '__main__':
main()