feat: import mycroft-precise work as Minerva foundation
Ports prior voice assistant research and prototypes from devl/Devops into the Minerva repo. Includes: - docs/: architecture, wake word guides, ESP32-S3 spec, hardware buying guide - scripts/: voice_server.py, voice_server_enhanced.py, setup scripts - hardware/maixduino/: edge device scripts with WiFi credentials scrubbed (replaced hardcoded password with secrets.py pattern) - config/.env.example: server config template - .gitignore: excludes .env, secrets.py, model blobs, ELF firmware - CLAUDE.md: Minerva product context and connection to cf-voice roadmap
This commit is contained in:
parent
fca5a107de
commit
173f7f37d4
30 changed files with 12519 additions and 0 deletions
29
.gitignore
vendored
Normal file
29
.gitignore
vendored
Normal file
|
|
@ -0,0 +1,29 @@
|
||||||
|
# Credentials
|
||||||
|
secrets.py
|
||||||
|
config/.env
|
||||||
|
*.env
|
||||||
|
!*.env.example
|
||||||
|
|
||||||
|
# Models (large binary files)
|
||||||
|
models/*.pb
|
||||||
|
models/*.pb.params
|
||||||
|
models/*.net
|
||||||
|
models/*.tflite
|
||||||
|
models/*.kmodel
|
||||||
|
|
||||||
|
# OEM firmware blobs
|
||||||
|
*.elf
|
||||||
|
*.7z
|
||||||
|
*.bin
|
||||||
|
|
||||||
|
# Python
|
||||||
|
__pycache__/
|
||||||
|
*.pyc
|
||||||
|
*.pyo
|
||||||
|
|
||||||
|
# Logs
|
||||||
|
logs/
|
||||||
|
|
||||||
|
# IDE
|
||||||
|
.vscode/
|
||||||
|
.idea/
|
||||||
165
CLAUDE.md
Normal file
165
CLAUDE.md
Normal file
|
|
@ -0,0 +1,165 @@
|
||||||
|
# Minerva — Developer Context
|
||||||
|
|
||||||
|
**Product code:** `MNRV`
|
||||||
|
**Status:** Concept / early prototype
|
||||||
|
**Domain:** Privacy-first, local-only voice assistant hardware platform
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What Minerva Is
|
||||||
|
|
||||||
|
A 100% local, FOSS voice assistant hardware platform. No cloud. No subscriptions. No data leaving the local network.
|
||||||
|
|
||||||
|
The goal is a reference hardware + software stack for a privacy-first voice assistant that anyone can build, extend, or self-host — including people without technical backgrounds if the assembly docs are good enough.
|
||||||
|
|
||||||
|
Core design principles (same as all CF products):
|
||||||
|
- **Local-first inference** — Whisper STT, Piper TTS, Mycroft Precise wake word all run on the host server
|
||||||
|
- **Edge where possible** — wake word detection moves to edge hardware over time (K210 → ESP32-S3 → custom)
|
||||||
|
- **No cloud dependency** — Home Assistant optional, not required
|
||||||
|
- **100% FOSS stack**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Hardware Targets
|
||||||
|
|
||||||
|
### Phase 1 (current): Maix Duino (K210)
|
||||||
|
- K210 dual-core RISC-V @ 400MHz with KPU neural accelerator
|
||||||
|
- Audio: I2S microphone + speaker output
|
||||||
|
- Connectivity: ESP32 WiFi/BLE co-processor
|
||||||
|
- Programming: MaixPy (MicroPython)
|
||||||
|
- Status: server-side wake word working; edge inference in progress
|
||||||
|
|
||||||
|
### Phase 2: ESP32-S3
|
||||||
|
- More accessible, cheaper, better WiFi
|
||||||
|
- On-device wake word with Espressif ESP-SR
|
||||||
|
- See `docs/ESP32_S3_VOICE_ASSISTANT_SPEC.md`
|
||||||
|
|
||||||
|
### Phase 3: Custom hardware
|
||||||
|
- Dedicated PCB for CF reference platform
|
||||||
|
- Hardware-accelerated wake word + VAD
|
||||||
|
- Designed for accessibility: large buttons, LED feedback, easy mounting
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Software Stack
|
||||||
|
|
||||||
|
### Edge device (Maix Duino / ESP32-S3)
|
||||||
|
- Firmware: MaixPy or ESP-IDF
|
||||||
|
- Client: `hardware/maixduino/maix_voice_client.py`
|
||||||
|
- Audio: I2S capture and playback
|
||||||
|
- Network: WiFi → Minerva server
|
||||||
|
|
||||||
|
### Server (runs on Heimdall or any Linux box)
|
||||||
|
- Voice server: `scripts/voice_server.py` (Flask + Whisper + Precise)
|
||||||
|
- Enhanced version: `scripts/voice_server_enhanced.py` (adds speaker ID via pyannote)
|
||||||
|
- STT: Whisper (local)
|
||||||
|
- Wake word: Mycroft Precise
|
||||||
|
- TTS: Piper
|
||||||
|
- Home Assistant: REST API integration (optional)
|
||||||
|
- Conda env: `whisper_cli` (existing on Heimdall)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Directory Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
minerva/
|
||||||
|
├── docs/ # Architecture, guides, reference docs
|
||||||
|
│ ├── maix-voice-assistant-architecture.md
|
||||||
|
│ ├── MYCROFT_PRECISE_GUIDE.md
|
||||||
|
│ ├── PRECISE_DEPLOYMENT.md
|
||||||
|
│ ├── ESP32_S3_VOICE_ASSISTANT_SPEC.md
|
||||||
|
│ ├── HARDWARE_BUYING_GUIDE.md
|
||||||
|
│ ├── LCD_CAMERA_FEATURES.md
|
||||||
|
│ ├── K210_PERFORMANCE_VERIFICATION.md
|
||||||
|
│ ├── WAKE_WORD_ADVANCED.md
|
||||||
|
│ ├── ADVANCED_WAKE_WORD_TOPICS.md
|
||||||
|
│ └── QUESTIONS_ANSWERED.md
|
||||||
|
├── scripts/ # Server-side scripts
|
||||||
|
│ ├── voice_server.py # Core Flask + Whisper + Precise server
|
||||||
|
│ ├── voice_server_enhanced.py # + speaker identification (pyannote)
|
||||||
|
│ ├── setup_voice_assistant.sh # Server setup
|
||||||
|
│ ├── setup_precise.sh # Mycroft Precise training environment
|
||||||
|
│ └── download_pretrained_models.sh
|
||||||
|
├── hardware/
|
||||||
|
│ └── maixduino/ # K210 edge device scripts
|
||||||
|
│ ├── maix_voice_client.py # Production client
|
||||||
|
│ ├── maix_simple_record_test.py # Audio capture test
|
||||||
|
│ ├── maix_test_simple.py # Hardware/network test
|
||||||
|
│ ├── maix_debug_wifi.py # WiFi diagnostics
|
||||||
|
│ ├── maix_discover_modules.py # Module discovery
|
||||||
|
│ ├── secrets.py.example # WiFi/server credential template
|
||||||
|
│ ├── MICROPYTHON_QUIRKS.md
|
||||||
|
│ └── README.md
|
||||||
|
├── config/
|
||||||
|
│ └── .env.example # Server config template
|
||||||
|
├── models/ # Wake word models (gitignored, large)
|
||||||
|
└── CLAUDE.md # This file
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Credentials / Secrets
|
||||||
|
|
||||||
|
**Never commit real credentials.** Pattern:
|
||||||
|
|
||||||
|
- Server: copy `config/.env.example` → `config/.env`, fill in real values
|
||||||
|
- Edge device: copy `hardware/maixduino/secrets.py.example` → `secrets.py`, fill in WiFi + server URL
|
||||||
|
|
||||||
|
Both files are gitignored. `.example` files are committed as templates.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Running the Server
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Activate environment
|
||||||
|
conda activate whisper_cli
|
||||||
|
|
||||||
|
# Basic server (Whisper + Precise wake word)
|
||||||
|
python scripts/voice_server.py \
|
||||||
|
--enable-precise \
|
||||||
|
--precise-model models/hey-minerva.net \
|
||||||
|
--precise-sensitivity 0.5
|
||||||
|
|
||||||
|
# Enhanced server (+ speaker identification)
|
||||||
|
python scripts/voice_server_enhanced.py \
|
||||||
|
--enable-speaker-id \
|
||||||
|
--hf-token $HF_TOKEN
|
||||||
|
|
||||||
|
# Test health
|
||||||
|
curl http://localhost:5000/health
|
||||||
|
curl http://localhost:5000/wake-word/status
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Connection to CF Voice Infrastructure
|
||||||
|
|
||||||
|
Minerva is the **hardware platform** for cf-voice. As `circuitforge_core.voice` matures:
|
||||||
|
|
||||||
|
- `cf_voice.io` (STT/TTS) → replaces the ad hoc Whisper/Piper calls in `voice_server.py`
|
||||||
|
- `cf_voice.context` (parallel classifier) → augments Mycroft Precise with tone/environment detection
|
||||||
|
- `cf_voice.telephony` → future: Minerva as an always-on household linnet node
|
||||||
|
|
||||||
|
Minerva hardware + cf-voice software = the CF reference voice assistant stack.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Roadmap
|
||||||
|
|
||||||
|
See Forgejo milestones on this repo. High-level:
|
||||||
|
|
||||||
|
1. **Alpha — Server-side pipeline** — Whisper + Precise + Piper working end-to-end on Heimdall
|
||||||
|
2. **Beta — Edge wake word** — wake word on K210 or ESP32-S3; audio only streams post-wake
|
||||||
|
3. **Hardware v1** — documented reference build; buying guide; assembly instructions
|
||||||
|
4. **cf-voice integration** — Minerva uses cf_voice modules from circuitforge-core
|
||||||
|
5. **Platform** — multiple hardware targets; custom PCB design
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Related
|
||||||
|
|
||||||
|
- `cf-voice` module design: `circuitforge-plans/circuitforge-core/2026-04-06-cf-voice-design.md`
|
||||||
|
- `linnet` product: real-time tone annotation, will eventually embed Minerva as a hardware node
|
||||||
|
- Heimdall server: primary dev/deployment target (10.1.10.71 on LAN)
|
||||||
24
config/.env.example
Normal file
24
config/.env.example
Normal file
|
|
@ -0,0 +1,24 @@
|
||||||
|
# Minerva Voice Server — configuration
|
||||||
|
# Copy to config/.env and fill in real values. Never commit .env.
|
||||||
|
|
||||||
|
# Server
|
||||||
|
SERVER_HOST=0.0.0.0
|
||||||
|
SERVER_PORT=5000
|
||||||
|
|
||||||
|
# Whisper STT
|
||||||
|
WHISPER_MODEL=base
|
||||||
|
|
||||||
|
# Mycroft Precise wake word
|
||||||
|
# PRECISE_MODEL=/path/to/wake-word.net
|
||||||
|
# PRECISE_SENSITIVITY=0.5
|
||||||
|
|
||||||
|
# Home Assistant integration (optional)
|
||||||
|
# HA_URL=http://homeassistant.local:8123
|
||||||
|
# HA_TOKEN=your_long_lived_access_token_here
|
||||||
|
|
||||||
|
# HuggingFace (for speaker identification, optional)
|
||||||
|
# HF_TOKEN=your_huggingface_token_here
|
||||||
|
|
||||||
|
# Logging
|
||||||
|
LOG_LEVEL=INFO
|
||||||
|
LOG_FILE=logs/minerva.log
|
||||||
905
docs/ADVANCED_WAKE_WORD_TOPICS.md
Executable file
905
docs/ADVANCED_WAKE_WORD_TOPICS.md
Executable file
|
|
@ -0,0 +1,905 @@
|
||||||
|
# Advanced Wake Word Topics - Pre-trained Models, Multiple Wake Words, and Voice Adaptation
|
||||||
|
|
||||||
|
## Pre-trained Mycroft Models
|
||||||
|
|
||||||
|
### Yes! Pre-trained Models Exist
|
||||||
|
|
||||||
|
Mycroft AI provides several pre-trained wake word models you can use immediately:
|
||||||
|
|
||||||
|
**Available Models:**
|
||||||
|
- **Hey Mycroft** - Original Mycroft wake word (most training data)
|
||||||
|
- **Hey Jarvis** - Popular alternative
|
||||||
|
- **Christopher** - Alternative wake word
|
||||||
|
- **Hey Ezra** - Another option
|
||||||
|
|
||||||
|
### Download Pre-trained Models
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# On Heimdall
|
||||||
|
conda activate precise
|
||||||
|
cd ~/precise-models
|
||||||
|
|
||||||
|
# Create directory for pre-trained models
|
||||||
|
mkdir -p pretrained
|
||||||
|
cd pretrained
|
||||||
|
|
||||||
|
# Download Hey Mycroft (recommended starting point)
|
||||||
|
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
|
||||||
|
tar xzf hey-mycroft.tar.gz
|
||||||
|
|
||||||
|
# Download other models
|
||||||
|
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-jarvis.tar.gz
|
||||||
|
tar xzf hey-jarvis.tar.gz
|
||||||
|
|
||||||
|
# List available models
|
||||||
|
ls -lh *.net
|
||||||
|
```
|
||||||
|
|
||||||
|
### Test Pre-trained Model
|
||||||
|
|
||||||
|
```bash
|
||||||
|
conda activate precise
|
||||||
|
|
||||||
|
# Test Hey Mycroft
|
||||||
|
precise-listen hey-mycroft.net
|
||||||
|
|
||||||
|
# Speak "Hey Mycroft" - should see "!" when detected
|
||||||
|
# Press Ctrl+C to exit
|
||||||
|
|
||||||
|
# Test with different threshold
|
||||||
|
precise-listen hey-mycroft.net -t 0.7 # More conservative
|
||||||
|
```
|
||||||
|
|
||||||
|
### Use Pre-trained Model in Voice Server
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd ~/voice-assistant
|
||||||
|
|
||||||
|
# Start server with Hey Mycroft model
|
||||||
|
python voice_server.py \
|
||||||
|
--enable-precise \
|
||||||
|
--precise-model ~/precise-models/pretrained/hey-mycroft.net \
|
||||||
|
--precise-sensitivity 0.5
|
||||||
|
```
|
||||||
|
|
||||||
|
### Fine-tune Pre-trained Models
|
||||||
|
|
||||||
|
You can use pre-trained models as a **starting point** and fine-tune with your voice:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd ~/precise-models
|
||||||
|
mkdir -p hey-mycroft-custom
|
||||||
|
|
||||||
|
# Copy base model
|
||||||
|
cp pretrained/hey-mycroft.net hey-mycroft-custom/
|
||||||
|
|
||||||
|
# Collect your samples
|
||||||
|
cd hey-mycroft-custom
|
||||||
|
precise-collect # Record 20-30 samples of YOUR voice
|
||||||
|
|
||||||
|
# Fine-tune from pre-trained model
|
||||||
|
precise-train -e 30 hey-mycroft-custom.net . \
|
||||||
|
--from-checkpoint ../pretrained/hey-mycroft.net
|
||||||
|
|
||||||
|
# This is MUCH faster than training from scratch!
|
||||||
|
```
|
||||||
|
|
||||||
|
**Benefits:**
|
||||||
|
- ✅ Start with proven model
|
||||||
|
- ✅ Much less training data needed (20-30 vs 100+ samples)
|
||||||
|
- ✅ Faster training (30 mins vs 60 mins)
|
||||||
|
- ✅ Good baseline accuracy
|
||||||
|
|
||||||
|
## Multiple Wake Words
|
||||||
|
|
||||||
|
### Architecture Options
|
||||||
|
|
||||||
|
#### Option 1: Multiple Models in Parallel (Server-Side Only)
|
||||||
|
|
||||||
|
Run multiple Precise instances simultaneously:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# In voice_server.py - Multiple wake word detection
|
||||||
|
|
||||||
|
from precise_runner import PreciseEngine, PreciseRunner
|
||||||
|
import threading
|
||||||
|
|
||||||
|
# Global runners
|
||||||
|
precise_runners = {}
|
||||||
|
|
||||||
|
def on_wake_word_detected(wake_word_name):
|
||||||
|
"""Callback factory for different wake words"""
|
||||||
|
def callback():
|
||||||
|
print(f"Wake word detected: {wake_word_name}")
|
||||||
|
wake_word_queue.put({
|
||||||
|
'wake_word': wake_word_name,
|
||||||
|
'timestamp': time.time()
|
||||||
|
})
|
||||||
|
return callback
|
||||||
|
|
||||||
|
def start_multiple_wake_words(wake_word_configs):
|
||||||
|
"""
|
||||||
|
Start multiple wake word detectors
|
||||||
|
|
||||||
|
Args:
|
||||||
|
wake_word_configs: List of dicts with 'name', 'model', 'sensitivity'
|
||||||
|
|
||||||
|
Example:
|
||||||
|
configs = [
|
||||||
|
{'name': 'hey mycroft', 'model': 'hey-mycroft.net', 'sensitivity': 0.5},
|
||||||
|
{'name': 'hey jarvis', 'model': 'hey-jarvis.net', 'sensitivity': 0.5}
|
||||||
|
]
|
||||||
|
"""
|
||||||
|
global precise_runners
|
||||||
|
|
||||||
|
for config in wake_word_configs:
|
||||||
|
engine = PreciseEngine(
|
||||||
|
'/usr/local/bin/precise-engine',
|
||||||
|
config['model']
|
||||||
|
)
|
||||||
|
|
||||||
|
runner = PreciseRunner(
|
||||||
|
engine,
|
||||||
|
sensitivity=config['sensitivity'],
|
||||||
|
on_activation=on_wake_word_detected(config['name'])
|
||||||
|
)
|
||||||
|
|
||||||
|
runner.start()
|
||||||
|
precise_runners[config['name']] = runner
|
||||||
|
|
||||||
|
print(f"Started wake word detector: {config['name']}")
|
||||||
|
```
|
||||||
|
|
||||||
|
**Server-Side Multiple Wake Words:**
|
||||||
|
```bash
|
||||||
|
# Start server with multiple wake words
|
||||||
|
python voice_server.py \
|
||||||
|
--enable-precise \
|
||||||
|
--precise-models "hey-mycroft:~/models/hey-mycroft.net:0.5,hey-jarvis:~/models/hey-jarvis.net:0.5"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Performance Impact:**
|
||||||
|
- CPU: ~5-10% per model (can run 2-3 easily)
|
||||||
|
- Memory: ~50-100MB per model
|
||||||
|
- Latency: Minimal (all run in parallel)
|
||||||
|
|
||||||
|
#### Option 2: Single Model, Multiple Phrases (Edge or Server)
|
||||||
|
|
||||||
|
Train ONE model that responds to multiple phrases:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd ~/precise-models/multi-wake
|
||||||
|
conda activate precise
|
||||||
|
|
||||||
|
# Record samples for BOTH wake words in the SAME dataset
|
||||||
|
# Label all as "wake-word" regardless of which phrase
|
||||||
|
|
||||||
|
mkdir -p wake-word not-wake-word
|
||||||
|
|
||||||
|
# Record "Hey Mycroft" samples
|
||||||
|
precise-collect # Save to wake-word/hey-mycroft-*.wav
|
||||||
|
|
||||||
|
# Record "Hey Computer" samples
|
||||||
|
precise-collect # Save to wake-word/hey-computer-*.wav
|
||||||
|
|
||||||
|
# Record negatives
|
||||||
|
precise-collect -f not-wake-word/random.wav
|
||||||
|
|
||||||
|
# Train single model on both phrases
|
||||||
|
precise-train -e 60 multi-wake.net .
|
||||||
|
```
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
- ✅ Single model = less compute
|
||||||
|
- ✅ Works on edge (K210)
|
||||||
|
- ✅ Easy to deploy
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
- ❌ Can't tell which wake word was used
|
||||||
|
- ❌ May reduce accuracy for each individual phrase
|
||||||
|
- ❌ Higher false positive risk
|
||||||
|
|
||||||
|
#### Option 3: Sequential Detection (Edge)
|
||||||
|
|
||||||
|
Detect wake word, then identify which one:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Pseudo-code for edge detection
|
||||||
|
if wake_word_detected():
|
||||||
|
audio_snippet = last_2_seconds()
|
||||||
|
|
||||||
|
# Run all models on the audio snippet
|
||||||
|
scores = {
|
||||||
|
'hey-mycroft': model1.score(audio_snippet),
|
||||||
|
'hey-jarvis': model2.score(audio_snippet),
|
||||||
|
'hey-computer': model3.score(audio_snippet)
|
||||||
|
}
|
||||||
|
|
||||||
|
# Use highest scoring wake word
|
||||||
|
wake_word = max(scores, key=scores.get)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Recommendations
|
||||||
|
|
||||||
|
**Server-Side (Heimdall):**
|
||||||
|
- ✅ **Use Option 1** - Multiple models in parallel
|
||||||
|
- Run 2-3 wake words easily
|
||||||
|
- Each can have different sensitivity
|
||||||
|
- Can identify which wake word was used
|
||||||
|
- Example: "Hey Mycroft" for commands, "Hey Jarvis" for queries
|
||||||
|
|
||||||
|
**Edge (Maix Duino K210):**
|
||||||
|
- ✅ **Use Option 2** - Single multi-phrase model
|
||||||
|
- K210 can handle 1 model efficiently
|
||||||
|
- Train on 2-3 phrases max
|
||||||
|
- Simpler deployment
|
||||||
|
- Lower latency
|
||||||
|
|
||||||
|
## Voice Adaptation & Multi-User Support
|
||||||
|
|
||||||
|
### Approach 1: Inclusive Training (Recommended)
|
||||||
|
|
||||||
|
Train ONE model on EVERYONE'S voices:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd ~/precise-models/family-wake-word
|
||||||
|
conda activate precise
|
||||||
|
|
||||||
|
# Record samples from each family member
|
||||||
|
# Alice records 30 samples
|
||||||
|
precise-collect # Save as wake-word/alice-*.wav
|
||||||
|
|
||||||
|
# Bob records 30 samples
|
||||||
|
precise-collect # Save as wake-word/bob-*.wav
|
||||||
|
|
||||||
|
# Carol records 30 samples
|
||||||
|
precise-collect # Save as wake-word/carol-*.wav
|
||||||
|
|
||||||
|
# Train on all voices
|
||||||
|
precise-train -e 60 family-wake-word.net .
|
||||||
|
```
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
- ✅ Everyone can use the system
|
||||||
|
- ✅ Single model deployment
|
||||||
|
- ✅ Works for all family members
|
||||||
|
- ✅ Simple maintenance
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
- ❌ Can't identify who spoke
|
||||||
|
- ❌ May need more training data
|
||||||
|
- ❌ No personalization
|
||||||
|
|
||||||
|
**Best for:** Family voice assistant, shared devices
|
||||||
|
|
||||||
|
### Approach 2: Speaker Identification (Advanced)
|
||||||
|
|
||||||
|
Detect wake word, then identify speaker:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Architecture with speaker ID
|
||||||
|
|
||||||
|
# Step 1: Precise detects wake word
|
||||||
|
if wake_word_detected():
|
||||||
|
|
||||||
|
# Step 2: Capture voice sample
|
||||||
|
voice_sample = record_audio(duration=3)
|
||||||
|
|
||||||
|
# Step 3: Speaker identification
|
||||||
|
speaker = identify_speaker(voice_sample)
|
||||||
|
# Uses voice embeddings/neural network
|
||||||
|
|
||||||
|
# Step 4: Process with user context
|
||||||
|
process_command(voice_sample, user=speaker)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Implementation Options:**
|
||||||
|
|
||||||
|
#### Option A: Use resemblyzer (Voice Embeddings)
|
||||||
|
```bash
|
||||||
|
pip install resemblyzer --break-system-packages
|
||||||
|
|
||||||
|
# Enrollment phase
|
||||||
|
python enroll_users.py
|
||||||
|
# Each user records 10-20 seconds of speech
|
||||||
|
# System creates voice profile (embedding)
|
||||||
|
|
||||||
|
# Runtime
|
||||||
|
python speaker_id.py
|
||||||
|
# Compares incoming audio to stored embeddings
|
||||||
|
# Returns most likely speaker
|
||||||
|
```
|
||||||
|
|
||||||
|
**Example Code:**
|
||||||
|
```python
|
||||||
|
from resemblyzer import VoiceEncoder, preprocess_wav
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
# Initialize encoder
|
||||||
|
encoder = VoiceEncoder()
|
||||||
|
|
||||||
|
# Enrollment - do once per user
|
||||||
|
def enroll_user(name, audio_files):
|
||||||
|
"""Create voice profile for user"""
|
||||||
|
embeddings = []
|
||||||
|
|
||||||
|
for audio_file in audio_files:
|
||||||
|
wav = preprocess_wav(audio_file)
|
||||||
|
embedding = encoder.embed_utterance(wav)
|
||||||
|
embeddings.append(embedding)
|
||||||
|
|
||||||
|
# Average embeddings for robustness
|
||||||
|
user_profile = np.mean(embeddings, axis=0)
|
||||||
|
|
||||||
|
# Save profile
|
||||||
|
np.save(f'profiles/{name}.npy', user_profile)
|
||||||
|
return user_profile
|
||||||
|
|
||||||
|
# Identification - run each time
|
||||||
|
def identify_speaker(audio_file, profiles_dir='profiles'):
|
||||||
|
"""Identify which enrolled user is speaking"""
|
||||||
|
wav = preprocess_wav(audio_file)
|
||||||
|
test_embedding = encoder.embed_utterance(wav)
|
||||||
|
|
||||||
|
# Load all profiles
|
||||||
|
profiles = {}
|
||||||
|
for profile_file in os.listdir(profiles_dir):
|
||||||
|
name = profile_file.replace('.npy', '')
|
||||||
|
profile = np.load(os.path.join(profiles_dir, profile_file))
|
||||||
|
profiles[name] = profile
|
||||||
|
|
||||||
|
# Calculate similarity to each profile
|
||||||
|
similarities = {}
|
||||||
|
for name, profile in profiles.items():
|
||||||
|
similarity = np.dot(test_embedding, profile)
|
||||||
|
similarities[name] = similarity
|
||||||
|
|
||||||
|
# Return most similar
|
||||||
|
best_match = max(similarities, key=similarities.get)
|
||||||
|
confidence = similarities[best_match]
|
||||||
|
|
||||||
|
if confidence > 0.7: # Threshold
|
||||||
|
return best_match
|
||||||
|
else:
|
||||||
|
return "unknown"
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Option B: Use pyannote.audio (Production-grade)
|
||||||
|
```bash
|
||||||
|
pip install pyannote.audio --break-system-packages
|
||||||
|
|
||||||
|
# Requires HuggingFace token (same as diarization)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Example:**
|
||||||
|
```python
|
||||||
|
from pyannote.audio import Inference
|
||||||
|
|
||||||
|
# Initialize
|
||||||
|
inference = Inference(
|
||||||
|
"pyannote/embedding",
|
||||||
|
use_auth_token="your_hf_token"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Enroll users
|
||||||
|
alice_profile = inference("alice_sample.wav")
|
||||||
|
bob_profile = inference("bob_sample.wav")
|
||||||
|
|
||||||
|
# Identify
|
||||||
|
test_embedding = inference("test_audio.wav")
|
||||||
|
|
||||||
|
# Compare
|
||||||
|
from scipy.spatial.distance import cosine
|
||||||
|
alice_similarity = 1 - cosine(test_embedding, alice_profile)
|
||||||
|
bob_similarity = 1 - cosine(test_embedding, bob_profile)
|
||||||
|
|
||||||
|
if alice_similarity > bob_similarity and alice_similarity > 0.7:
|
||||||
|
speaker = "Alice"
|
||||||
|
elif bob_similarity > 0.7:
|
||||||
|
speaker = "Bob"
|
||||||
|
else:
|
||||||
|
speaker = "Unknown"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
- ✅ Can identify individual users
|
||||||
|
- ✅ Personalized responses
|
||||||
|
- ✅ User-specific commands/permissions
|
||||||
|
- ✅ Better for privacy (know who's speaking)
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
- ❌ More complex implementation
|
||||||
|
- ❌ Requires enrollment phase
|
||||||
|
- ❌ Additional processing time (~100-200ms)
|
||||||
|
- ❌ May fail with similar voices
|
||||||
|
|
||||||
|
### Approach 3: Per-User Wake Word Models
|
||||||
|
|
||||||
|
Each person has their OWN wake word:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Alice's wake word: "Hey Mycroft"
|
||||||
|
# Train on ONLY Alice's voice
|
||||||
|
|
||||||
|
# Bob's wake word: "Hey Jarvis"
|
||||||
|
# Train on ONLY Bob's voice
|
||||||
|
|
||||||
|
# Carol's wake word: "Hey Computer"
|
||||||
|
# Train on ONLY Carol's voice
|
||||||
|
```
|
||||||
|
|
||||||
|
**Deployment:**
|
||||||
|
Run all 3 models in parallel (server-side):
|
||||||
|
```python
|
||||||
|
wake_word_configs = [
|
||||||
|
{'name': 'Alice', 'wake_word': 'hey mycroft', 'model': 'alice-wake.net'},
|
||||||
|
{'name': 'Bob', 'wake_word': 'hey jarvis', 'model': 'bob-wake.net'},
|
||||||
|
{'name': 'Carol', 'wake_word': 'hey computer', 'model': 'carol-wake.net'}
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
- ✅ Automatic user identification
|
||||||
|
- ✅ Highest accuracy per user
|
||||||
|
- ✅ Clear user separation
|
||||||
|
- ✅ No additional speaker ID needed
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
- ❌ Requires 3x models (server only)
|
||||||
|
- ❌ Users must remember their wake word
|
||||||
|
- ❌ 3x CPU usage (~15-30%)
|
||||||
|
- ❌ Can't work on edge (K210)
|
||||||
|
|
||||||
|
### Approach 4: Context-Based Adaptation
|
||||||
|
|
||||||
|
No speaker ID, but learn from interaction:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Track command patterns
|
||||||
|
user_context = {
|
||||||
|
'last_command': 'turn on living room lights',
|
||||||
|
'frequent_entities': ['light.living_room', 'light.bedroom'],
|
||||||
|
'time_of_day_patterns': {'morning': 'coffee maker', 'evening': 'tv'},
|
||||||
|
'location': 'home' # vs 'away'
|
||||||
|
}
|
||||||
|
|
||||||
|
# Use context to improve intent recognition
|
||||||
|
if "turn on the lights" and time.is_morning():
|
||||||
|
# Probably means bedroom lights (based on history)
|
||||||
|
entity = user_context['frequent_entities'][0]
|
||||||
|
```
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
- ✅ No enrollment needed
|
||||||
|
- ✅ Improves over time
|
||||||
|
- ✅ Simple to implement
|
||||||
|
- ✅ Works with any number of users
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
- ❌ No true user identification
|
||||||
|
- ❌ May make incorrect assumptions
|
||||||
|
- ❌ Privacy concerns (tracking behavior)
|
||||||
|
|
||||||
|
## Recommended Strategy
|
||||||
|
|
||||||
|
### For Your Use Case
|
||||||
|
|
||||||
|
Based on your home lab setup, I recommend:
|
||||||
|
|
||||||
|
#### Phase 1: Single Wake Word, Inclusive Training (Week 1-2)
|
||||||
|
```bash
|
||||||
|
# Start simple
|
||||||
|
cd ~/precise-models/hey-computer
|
||||||
|
conda activate precise
|
||||||
|
|
||||||
|
# Have all family members record samples
|
||||||
|
# Alice: 30 samples of "Hey Computer"
|
||||||
|
# Bob: 30 samples of "Hey Computer"
|
||||||
|
# You: 30 samples of "Hey Computer"
|
||||||
|
|
||||||
|
# Train single model on all voices
|
||||||
|
precise-train -e 60 hey-computer.net .
|
||||||
|
|
||||||
|
# Deploy to server
|
||||||
|
python voice_server.py \
|
||||||
|
--enable-precise \
|
||||||
|
--precise-model hey-computer.net
|
||||||
|
```
|
||||||
|
|
||||||
|
**Why:**
|
||||||
|
- Simple to setup and test
|
||||||
|
- Everyone can use it immediately
|
||||||
|
- Single model = easier debugging
|
||||||
|
- Works on edge if you migrate later
|
||||||
|
|
||||||
|
#### Phase 2: Add Speaker Identification (Week 3-4)
|
||||||
|
```bash
|
||||||
|
# Install resemblyzer
|
||||||
|
pip install resemblyzer --break-system-packages
|
||||||
|
|
||||||
|
# Enroll users
|
||||||
|
python enroll_users.py
|
||||||
|
# Each person speaks for 20 seconds
|
||||||
|
|
||||||
|
# Update voice_server.py to identify speaker
|
||||||
|
# Use speaker ID for personalized responses
|
||||||
|
```
|
||||||
|
|
||||||
|
**Why:**
|
||||||
|
- Enables personalization
|
||||||
|
- Can track preferences per user
|
||||||
|
- User-specific command permissions
|
||||||
|
- Better privacy (know who's speaking)
|
||||||
|
|
||||||
|
#### Phase 3: Multiple Wake Words (Month 2+)
|
||||||
|
```bash
|
||||||
|
# Add alternative wake words for different contexts
|
||||||
|
# "Hey Mycroft" - General commands
|
||||||
|
# "Hey Jarvis" - Media/Plex control
|
||||||
|
# "Computer" - Quick commands (lights, temp)
|
||||||
|
|
||||||
|
# Deploy multiple models on server
|
||||||
|
python voice_server.py \
|
||||||
|
--enable-precise \
|
||||||
|
--precise-models "mycroft:hey-mycroft.net:0.5,jarvis:hey-jarvis.net:0.5"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Why:**
|
||||||
|
- Different wake words for different contexts
|
||||||
|
- Reduces false positives (more specific triggers)
|
||||||
|
- Fun factor (Jarvis for media!)
|
||||||
|
- Server can handle 2-3 easily
|
||||||
|
|
||||||
|
## Implementation Guide: Multiple Wake Words
|
||||||
|
|
||||||
|
### Update voice_server.py for Multiple Wake Words
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Add to voice_server.py
|
||||||
|
|
||||||
|
def start_multiple_wake_words(configs):
|
||||||
|
"""
|
||||||
|
Start multiple wake word detectors
|
||||||
|
|
||||||
|
Args:
|
||||||
|
configs: List of dicts with 'name', 'model_path', 'sensitivity'
|
||||||
|
"""
|
||||||
|
global precise_runners
|
||||||
|
precise_runners = {}
|
||||||
|
|
||||||
|
for config in configs:
|
||||||
|
try:
|
||||||
|
engine = PreciseEngine(
|
||||||
|
DEFAULT_PRECISE_ENGINE,
|
||||||
|
config['model_path']
|
||||||
|
)
|
||||||
|
|
||||||
|
def make_callback(wake_word_name):
|
||||||
|
def callback():
|
||||||
|
print(f"Wake word detected: {wake_word_name}")
|
||||||
|
wake_word_queue.put({
|
||||||
|
'wake_word': wake_word_name,
|
||||||
|
'timestamp': time.time(),
|
||||||
|
'source': 'precise'
|
||||||
|
})
|
||||||
|
return callback
|
||||||
|
|
||||||
|
runner = PreciseRunner(
|
||||||
|
engine,
|
||||||
|
sensitivity=config['sensitivity'],
|
||||||
|
on_activation=make_callback(config['name'])
|
||||||
|
)
|
||||||
|
|
||||||
|
runner.start()
|
||||||
|
precise_runners[config['name']] = runner
|
||||||
|
|
||||||
|
print(f"✓ Started: {config['name']} (sensitivity: {config['sensitivity']})")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"✗ Failed to start {config['name']}: {e}")
|
||||||
|
|
||||||
|
return len(precise_runners) > 0
|
||||||
|
|
||||||
|
# Add to main()
|
||||||
|
parser.add_argument('--precise-models',
|
||||||
|
help='Multiple models: name:path:sensitivity,name2:path2:sensitivity2')
|
||||||
|
|
||||||
|
# Parse multiple models
|
||||||
|
if args.precise_models:
|
||||||
|
configs = []
|
||||||
|
for model_spec in args.precise_models.split(','):
|
||||||
|
name, path, sensitivity = model_spec.split(':')
|
||||||
|
configs.append({
|
||||||
|
'name': name,
|
||||||
|
'model_path': os.path.expanduser(path),
|
||||||
|
'sensitivity': float(sensitivity)
|
||||||
|
})
|
||||||
|
|
||||||
|
start_multiple_wake_words(configs)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Usage Example
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd ~/voice-assistant
|
||||||
|
|
||||||
|
# Start with multiple wake words
|
||||||
|
python voice_server.py \
|
||||||
|
--enable-precise \
|
||||||
|
--precise-models "\
|
||||||
|
hey-mycroft:~/precise-models/pretrained/hey-mycroft.net:0.5,\
|
||||||
|
hey-jarvis:~/precise-models/pretrained/hey-jarvis.net:0.5"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Implementation Guide: Speaker Identification
|
||||||
|
|
||||||
|
### Add to voice_server.py
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Add resemblyzer support
|
||||||
|
try:
|
||||||
|
from resemblyzer import VoiceEncoder, preprocess_wav
|
||||||
|
import numpy as np
|
||||||
|
SPEAKER_ID_AVAILABLE = True
|
||||||
|
except ImportError:
|
||||||
|
SPEAKER_ID_AVAILABLE = False
|
||||||
|
print("Warning: resemblyzer not available. Speaker ID disabled.")
|
||||||
|
|
||||||
|
# Initialize encoder
|
||||||
|
voice_encoder = None
|
||||||
|
speaker_profiles = {}
|
||||||
|
|
||||||
|
def load_speaker_profiles(profiles_dir='~/voice-assistant/profiles'):
|
||||||
|
"""Load enrolled speaker profiles"""
|
||||||
|
global speaker_profiles, voice_encoder
|
||||||
|
|
||||||
|
if not SPEAKER_ID_AVAILABLE:
|
||||||
|
return False
|
||||||
|
|
||||||
|
profiles_dir = os.path.expanduser(profiles_dir)
|
||||||
|
|
||||||
|
if not os.path.exists(profiles_dir):
|
||||||
|
print(f"No speaker profiles found at {profiles_dir}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
# Initialize encoder
|
||||||
|
voice_encoder = VoiceEncoder()
|
||||||
|
|
||||||
|
# Load all profiles
|
||||||
|
for profile_file in os.listdir(profiles_dir):
|
||||||
|
if profile_file.endswith('.npy'):
|
||||||
|
name = profile_file.replace('.npy', '')
|
||||||
|
profile = np.load(os.path.join(profiles_dir, profile_file))
|
||||||
|
speaker_profiles[name] = profile
|
||||||
|
print(f"Loaded speaker profile: {name}")
|
||||||
|
|
||||||
|
return len(speaker_profiles) > 0
|
||||||
|
|
||||||
|
def identify_speaker(audio_path, threshold=0.7):
|
||||||
|
"""Identify speaker from audio file"""
|
||||||
|
if not SPEAKER_ID_AVAILABLE or not speaker_profiles:
|
||||||
|
return None
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Get embedding for test audio
|
||||||
|
wav = preprocess_wav(audio_path)
|
||||||
|
test_embedding = voice_encoder.embed_utterance(wav)
|
||||||
|
|
||||||
|
# Compare to all profiles
|
||||||
|
similarities = {}
|
||||||
|
for name, profile in speaker_profiles.items():
|
||||||
|
similarity = np.dot(test_embedding, profile)
|
||||||
|
similarities[name] = similarity
|
||||||
|
|
||||||
|
# Get best match
|
||||||
|
best_match = max(similarities, key=similarities.get)
|
||||||
|
confidence = similarities[best_match]
|
||||||
|
|
||||||
|
print(f"Speaker ID: {best_match} (confidence: {confidence:.2f})")
|
||||||
|
|
||||||
|
if confidence > threshold:
|
||||||
|
return best_match
|
||||||
|
else:
|
||||||
|
return "unknown"
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error identifying speaker: {e}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
# Update process endpoint to include speaker ID
|
||||||
|
@app.route('/process', methods=['POST'])
|
||||||
|
def process():
|
||||||
|
"""Process complete voice command with speaker identification"""
|
||||||
|
# ... existing code ...
|
||||||
|
|
||||||
|
# Add speaker identification
|
||||||
|
speaker = identify_speaker(temp_path) if speaker_profiles else None
|
||||||
|
|
||||||
|
if speaker:
|
||||||
|
print(f"Detected speaker: {speaker}")
|
||||||
|
# Could personalize response based on speaker
|
||||||
|
|
||||||
|
# ... rest of processing ...
|
||||||
|
```
|
||||||
|
|
||||||
|
### Enrollment Script
|
||||||
|
|
||||||
|
Create `enroll_speaker.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Enroll users for speaker identification
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python enroll_speaker.py --name Alice --audio alice_sample.wav
|
||||||
|
python enroll_speaker.py --name Alice --duration 20 # Record live
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import os
|
||||||
|
import numpy as np
|
||||||
|
from resemblyzer import VoiceEncoder, preprocess_wav
|
||||||
|
import pyaudio
|
||||||
|
import wave
|
||||||
|
|
||||||
|
def record_audio(duration=20, sample_rate=16000):
|
||||||
|
"""Record audio from microphone"""
|
||||||
|
print(f"Recording for {duration} seconds...")
|
||||||
|
print("Speak naturally - read a paragraph, have a conversation, etc.")
|
||||||
|
|
||||||
|
chunk = 1024
|
||||||
|
format = pyaudio.paInt16
|
||||||
|
channels = 1
|
||||||
|
|
||||||
|
p = pyaudio.PyAudio()
|
||||||
|
|
||||||
|
stream = p.open(
|
||||||
|
format=format,
|
||||||
|
channels=channels,
|
||||||
|
rate=sample_rate,
|
||||||
|
input=True,
|
||||||
|
frames_per_buffer=chunk
|
||||||
|
)
|
||||||
|
|
||||||
|
frames = []
|
||||||
|
for i in range(0, int(sample_rate / chunk * duration)):
|
||||||
|
data = stream.read(chunk)
|
||||||
|
frames.append(data)
|
||||||
|
|
||||||
|
stream.stop_stream()
|
||||||
|
stream.close()
|
||||||
|
p.terminate()
|
||||||
|
|
||||||
|
# Save to temp file
|
||||||
|
temp_file = f"/tmp/enrollment_{os.getpid()}.wav"
|
||||||
|
wf = wave.open(temp_file, 'wb')
|
||||||
|
wf.setnchannels(channels)
|
||||||
|
wf.setsampwidth(p.get_sample_size(format))
|
||||||
|
wf.setframerate(sample_rate)
|
||||||
|
wf.writeframes(b''.join(frames))
|
||||||
|
wf.close()
|
||||||
|
|
||||||
|
return temp_file
|
||||||
|
|
||||||
|
def enroll_speaker(name, audio_file, profiles_dir='~/voice-assistant/profiles'):
|
||||||
|
"""Create voice profile for speaker"""
|
||||||
|
profiles_dir = os.path.expanduser(profiles_dir)
|
||||||
|
os.makedirs(profiles_dir, exist_ok=True)
|
||||||
|
|
||||||
|
# Initialize encoder
|
||||||
|
encoder = VoiceEncoder()
|
||||||
|
|
||||||
|
# Process audio
|
||||||
|
wav = preprocess_wav(audio_file)
|
||||||
|
embedding = encoder.embed_utterance(wav)
|
||||||
|
|
||||||
|
# Save profile
|
||||||
|
profile_path = os.path.join(profiles_dir, f'{name}.npy')
|
||||||
|
np.save(profile_path, embedding)
|
||||||
|
|
||||||
|
print(f"✓ Enrolled speaker: {name}")
|
||||||
|
print(f" Profile saved to: {profile_path}")
|
||||||
|
|
||||||
|
return profile_path
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(description="Enroll speaker for voice identification")
|
||||||
|
parser.add_argument('--name', required=True, help='Speaker name')
|
||||||
|
parser.add_argument('--audio', help='Path to audio file (wav)')
|
||||||
|
parser.add_argument('--duration', type=int, default=20,
|
||||||
|
help='Recording duration if not using audio file')
|
||||||
|
parser.add_argument('--profiles-dir', default='~/voice-assistant/profiles',
|
||||||
|
help='Directory to save profiles')
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
# Get audio file
|
||||||
|
if args.audio:
|
||||||
|
audio_file = args.audio
|
||||||
|
if not os.path.exists(audio_file):
|
||||||
|
print(f"Error: Audio file not found: {audio_file}")
|
||||||
|
return 1
|
||||||
|
else:
|
||||||
|
audio_file = record_audio(args.duration)
|
||||||
|
|
||||||
|
# Enroll speaker
|
||||||
|
try:
|
||||||
|
enroll_speaker(args.name, audio_file, args.profiles_dir)
|
||||||
|
return 0
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error enrolling speaker: {e}")
|
||||||
|
return 1
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
import sys
|
||||||
|
sys.exit(main())
|
||||||
|
```
|
||||||
|
|
||||||
|
## Performance Comparison
|
||||||
|
|
||||||
|
### Single Wake Word
|
||||||
|
- **Latency:** 100-200ms
|
||||||
|
- **CPU:** ~5-10% (idle)
|
||||||
|
- **Memory:** ~100MB
|
||||||
|
- **Accuracy:** 95%+
|
||||||
|
|
||||||
|
### Multiple Wake Words (3 models)
|
||||||
|
- **Latency:** 100-200ms (parallel)
|
||||||
|
- **CPU:** ~15-30% (idle)
|
||||||
|
- **Memory:** ~300MB
|
||||||
|
- **Accuracy:** 95%+ each
|
||||||
|
|
||||||
|
### With Speaker Identification
|
||||||
|
- **Additional latency:** +100-200ms
|
||||||
|
- **Additional CPU:** +5% during ID
|
||||||
|
- **Additional memory:** +50MB
|
||||||
|
- **Accuracy:** 85-95% (depending on enrollment quality)
|
||||||
|
|
||||||
|
## Best Practices
|
||||||
|
|
||||||
|
### Wake Word Selection
|
||||||
|
1. **Different enough** - "Hey Mycroft" vs "Hey Jarvis" (not "Hey Alice" vs "Hey Alex")
|
||||||
|
2. **Clear consonants** - Easier to detect
|
||||||
|
3. **2-3 syllables** - Not too short, not too long
|
||||||
|
4. **Test in environment** - Check for false triggers
|
||||||
|
|
||||||
|
### Training
|
||||||
|
1. **Include all users** - If using single model
|
||||||
|
2. **Diverse conditions** - Different rooms, noise levels
|
||||||
|
3. **Regular updates** - Add false positives weekly
|
||||||
|
4. **Per-user models** - Higher accuracy, more compute
|
||||||
|
|
||||||
|
### Speaker Identification
|
||||||
|
1. **Quality enrollment** - 20+ seconds of clear speech
|
||||||
|
2. **Re-enroll periodically** - Voices change (colds, etc.)
|
||||||
|
3. **Test thresholds** - Balance accuracy vs false IDs
|
||||||
|
4. **Graceful fallback** - Handle unknown speakers
|
||||||
|
|
||||||
|
## Recommended Path for You
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Week 1: Start with pre-trained "Hey Mycroft"
|
||||||
|
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
|
||||||
|
precise-listen hey-mycroft.net # Test it!
|
||||||
|
|
||||||
|
# Week 2: Fine-tune with your voices
|
||||||
|
precise-train -e 30 hey-mycroft-custom.net . \
|
||||||
|
--from-checkpoint hey-mycroft.net
|
||||||
|
|
||||||
|
# Week 3: Add speaker identification
|
||||||
|
pip install resemblyzer
|
||||||
|
python enroll_speaker.py --name Alan --duration 20
|
||||||
|
python enroll_speaker.py --name [Family Member] --duration 20
|
||||||
|
|
||||||
|
# Week 4: Add second wake word ("Hey Jarvis" for Plex?)
|
||||||
|
wget hey-jarvis.tar.gz
|
||||||
|
# Run both in parallel
|
||||||
|
|
||||||
|
# Month 2+: Optimize and expand
|
||||||
|
# - More wake words for different contexts
|
||||||
|
# - Per-user wake word models
|
||||||
|
# - Context-aware responses
|
||||||
|
```
|
||||||
|
|
||||||
|
This gives you a smooth progression from simple to advanced!
|
||||||
1089
docs/ESP32_S3_VOICE_ASSISTANT_SPEC.md
Executable file
1089
docs/ESP32_S3_VOICE_ASSISTANT_SPEC.md
Executable file
File diff suppressed because it is too large
Load diff
542
docs/HARDWARE_BUYING_GUIDE.md
Executable file
542
docs/HARDWARE_BUYING_GUIDE.md
Executable file
|
|
@ -0,0 +1,542 @@
|
||||||
|
# Voice Assistant Hardware - Buying Guide for Second Unit
|
||||||
|
|
||||||
|
**Date:** 2025-11-29
|
||||||
|
**Context:** You have one Maix Duino (K210), planning multi-room deployment
|
||||||
|
**Question:** What should I buy for the second unit?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick Answer
|
||||||
|
|
||||||
|
**Best Overall:** **Buy another Maix Duino K210** (~$30-40)
|
||||||
|
**Runner-up:** **ESP32-S3 with audio board** (~$20-30)
|
||||||
|
**Budget:** **Generic ESP32 + I2S** (~$15-20)
|
||||||
|
**Future-proof:** **Sipeed Maix-III** (~$60-80, when available)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Analysis: Why Another Maix Duino K210?
|
||||||
|
|
||||||
|
### Pros ✅
|
||||||
|
- **Identical to first unit** - Code reuse, same workflow
|
||||||
|
- **Proven solution** - You'll know exactly what to expect
|
||||||
|
- **Stock availability** - Still widely available despite being "outdated"
|
||||||
|
- **Same accessories** - Microphones, displays, cables compatible
|
||||||
|
- **Edge detection ready** - Can upgrade to edge wake word later
|
||||||
|
- **Low cost** - ~$30-40 for full kit with LCD and camera
|
||||||
|
- **Multi-room consistency** - All units behave identically
|
||||||
|
|
||||||
|
### Cons ❌
|
||||||
|
- "Outdated" hardware (but doesn't matter for your use case)
|
||||||
|
- Limited future support from Sipeed
|
||||||
|
|
||||||
|
### Verdict: ✅ **RECOMMENDED - Best choice for consistency**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Alternative Options
|
||||||
|
|
||||||
|
### Option 1: Another Maix Duino K210
|
||||||
|
**Price:** $30-40 (kit with LCD)
|
||||||
|
**Where:** AliExpress, Amazon, Seeed Studio
|
||||||
|
|
||||||
|
**Specific Model:**
|
||||||
|
- **Sipeed Maix Duino** (original, what you have)
|
||||||
|
- Includes: LCD, camera module
|
||||||
|
- Need to add: I2S microphone
|
||||||
|
|
||||||
|
**Why Choose:**
|
||||||
|
- Identical setup to first unit
|
||||||
|
- Code works without modification
|
||||||
|
- Same troubleshooting experience
|
||||||
|
- Bulk buy discount possible
|
||||||
|
|
||||||
|
**Link Examples:**
|
||||||
|
- Seeed Studio: https://www.seeedstudio.com/Sipeed-Maix-Duino-Kit-for-RISC-V-AI-IoT.html
|
||||||
|
- AliExpress: Search "Sipeed Maix Duino" (~$25-35)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Option 2: Sipeed Maix Bit/Dock (K210 variant)
|
||||||
|
**Price:** $15-25 (smaller form factor)
|
||||||
|
|
||||||
|
**Differences from Maix Duino:**
|
||||||
|
- Smaller board
|
||||||
|
- May need separate LCD
|
||||||
|
- Same K210 chip
|
||||||
|
- Same capabilities
|
||||||
|
|
||||||
|
**Why Choose:**
|
||||||
|
- Cheaper
|
||||||
|
- More compact
|
||||||
|
- Same software
|
||||||
|
|
||||||
|
**Why Skip:**
|
||||||
|
- Need separate accessories
|
||||||
|
- Different form factor means different mounting
|
||||||
|
- Less convenient than all-in-one Duino
|
||||||
|
|
||||||
|
**Verdict:** ⚠️ Only if you want smaller/cheaper
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Option 3: ESP32-S3 with Audio Kit
|
||||||
|
**Price:** $20-30
|
||||||
|
**Chip:** ESP32-S3 (Xtensa dual-core @ 240MHz)
|
||||||
|
|
||||||
|
**Examples:**
|
||||||
|
- **ESP32-S3-Box** (~$30) - Has LCD, microphone, speaker built-in
|
||||||
|
- **Seeed XIAO ESP32-S3 Sense** (~$15) - Tiny, needs accessories
|
||||||
|
- **M5Stack Core S3** (~$50) - Premium, all-in-one
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
- ✅ More modern than K210
|
||||||
|
- ✅ Better WiFi/BLE support
|
||||||
|
- ✅ Lower power consumption
|
||||||
|
- ✅ Active development
|
||||||
|
- ✅ Arduino/ESP-IDF support
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
- ❌ No KPU (neural accelerator)
|
||||||
|
- ❌ Different code needed (ESP32 vs MaixPy)
|
||||||
|
- ❌ Less ML capability (for future edge wake word)
|
||||||
|
- ❌ Different ecosystem
|
||||||
|
|
||||||
|
**Best ESP32-S3 Choice:** **ESP32-S3-Box**
|
||||||
|
- All-in-one like your Maix Duino
|
||||||
|
- Built-in mic, speaker, LCD
|
||||||
|
- Good for server-side wake word
|
||||||
|
- Cheaper than Maix Duino
|
||||||
|
|
||||||
|
**Verdict:** 🤔 Good alternative if you want to experiment
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Option 4: Raspberry Pi Zero 2 W
|
||||||
|
**Price:** $15-20 (board only, need accessories)
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
- ✅ Full Linux
|
||||||
|
- ✅ Familiar ecosystem
|
||||||
|
- ✅ Tons of support
|
||||||
|
- ✅ Easy Python development
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
- ❌ No neural accelerator
|
||||||
|
- ❌ No dedicated audio hardware
|
||||||
|
- ❌ More power hungry (~500mW vs 200mW)
|
||||||
|
- ❌ Overkill for audio streaming
|
||||||
|
- ❌ Need USB sound card or I2S HAT
|
||||||
|
- ❌ Larger form factor
|
||||||
|
|
||||||
|
**Verdict:** ❌ Not ideal for this project
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Option 5: Sipeed Maix-III AXera-Pi (Future)
|
||||||
|
**Price:** $60-80 (when available)
|
||||||
|
**Chip:** AX620A (much more powerful than K210)
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
- ✅ Modern hardware (2023)
|
||||||
|
- ✅ Better AI performance
|
||||||
|
- ✅ Linux + Python support
|
||||||
|
- ✅ Sipeed ecosystem continuity
|
||||||
|
- ✅ Great for edge wake word
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
- ❌ More expensive
|
||||||
|
- ❌ Newer = less community support
|
||||||
|
- ❌ Overkill for server-side wake word
|
||||||
|
- ❌ Stock availability varies
|
||||||
|
|
||||||
|
**Verdict:** 🔮 Future-proof option if budget allows
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Option 6: Generic ESP32 + I2S Breakout
|
||||||
|
**Price:** $10-15 (cheapest option)
|
||||||
|
|
||||||
|
**What You Need:**
|
||||||
|
- ESP32 DevKit (~$5)
|
||||||
|
- I2S MEMS mic (~$5)
|
||||||
|
- Optional: I2S speaker amp (~$5)
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
- ✅ Cheapest option
|
||||||
|
- ✅ Minimal, focused on audio only
|
||||||
|
- ✅ Very low power
|
||||||
|
- ✅ WiFi built-in
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
- ❌ No LCD (would need separate)
|
||||||
|
- ❌ No camera
|
||||||
|
- ❌ DIY assembly required
|
||||||
|
- ❌ No neural accelerator
|
||||||
|
- ❌ Different code from K210
|
||||||
|
|
||||||
|
**Verdict:** 💰 Budget choice, but less polished
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Comparison Table
|
||||||
|
|
||||||
|
| Option | Price | Same Code? | LCD | AI Accel | Best For |
|
||||||
|
|--------|-------|------------|-----|----------|----------|
|
||||||
|
| **Maix Duino K210** | $30-40 | ✅ Yes | ✅ Included | ✅ KPU | **Multi-room consistency** |
|
||||||
|
| Maix Bit/Dock (K210) | $15-25 | ✅ Yes | ⚠️ Optional | ✅ KPU | Compact/Budget |
|
||||||
|
| ESP32-S3-Box | $25-35 | ❌ No | ✅ Included | ❌ No | Modern alternative |
|
||||||
|
| ESP32-S3 DIY | $15-25 | ❌ No | ❌ No | ❌ No | Custom build |
|
||||||
|
| Raspberry Pi Zero 2 W | $30+ | ❌ No | ❌ No | ❌ No | Linux/overkill |
|
||||||
|
| Maix-III | $60-80 | ⚠️ Similar | ✅ Varies | ✅ NPU | Future-proof |
|
||||||
|
| Generic ESP32 | $10-15 | ❌ No | ❌ No | ❌ No | Absolute budget |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recommended Purchase Plan
|
||||||
|
|
||||||
|
### Phase 1: Second Identical Unit (NOW)
|
||||||
|
**Buy:** Sipeed Maix Duino K210 (same as first)
|
||||||
|
**Cost:** ~$30-40
|
||||||
|
**Why:** Code reuse, proven solution, multi-room consistency
|
||||||
|
|
||||||
|
**What to Order:**
|
||||||
|
- [ ] Sipeed Maix Duino board with LCD and camera
|
||||||
|
- [ ] I2S MEMS microphone (if not included)
|
||||||
|
- [ ] Small speaker or audio output (3-5W)
|
||||||
|
- [ ] USB-C cable
|
||||||
|
- [ ] MicroSD card (4GB+)
|
||||||
|
|
||||||
|
**Total Cost:** ~$40-50 with accessories
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Phase 2: Third+ Units (LATER)
|
||||||
|
**Option A:** More Maix Duinos (if still available)
|
||||||
|
**Option B:** Switch to ESP32-S3-Box for variety/testing
|
||||||
|
**Option C:** Wait for Maix-III if you want cutting edge
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Where to Buy Maix Duino
|
||||||
|
|
||||||
|
### Recommended Sellers
|
||||||
|
|
||||||
|
**1. Seeed Studio (Official Partner)**
|
||||||
|
- URL: https://www.seeedstudio.com/
|
||||||
|
- Search: "Sipeed Maix Duino"
|
||||||
|
- Price: ~$35-45
|
||||||
|
- Shipping: International, good support
|
||||||
|
- **Pro:** Official, reliable, good documentation
|
||||||
|
- **Con:** Can be out of stock
|
||||||
|
|
||||||
|
**2. AliExpress (Direct from Sipeed/China)**
|
||||||
|
- Search: "Sipeed Maix Duino"
|
||||||
|
- Price: ~$25-35
|
||||||
|
- Shipping: 2-4 weeks (free or cheap)
|
||||||
|
- **Pro:** Cheapest, often bundled with accessories
|
||||||
|
- **Con:** Longer shipping, variable quality control
|
||||||
|
- **Tip:** Look for "Sipeed Official Store"
|
||||||
|
|
||||||
|
**3. Amazon**
|
||||||
|
- Search: "Maix Duino K210"
|
||||||
|
- Price: ~$40-50
|
||||||
|
- Shipping: Fast (Prime eligible sometimes)
|
||||||
|
- **Pro:** Fast shipping, easy returns
|
||||||
|
- **Con:** Higher price, limited stock
|
||||||
|
|
||||||
|
**4. Adafruit / SparkFun**
|
||||||
|
- May carry Sipeed products
|
||||||
|
- Higher price but US-based support
|
||||||
|
- Check availability
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Accessories to Buy
|
||||||
|
|
||||||
|
### Essential (for each unit)
|
||||||
|
|
||||||
|
**1. I2S MEMS Microphone**
|
||||||
|
- **Recommended:** Adafruit I2S MEMS Microphone Breakout (~$7)
|
||||||
|
- Model: SPH0645LM4H
|
||||||
|
- URL: https://www.adafruit.com/product/3421
|
||||||
|
- **Alternative:** INMP441 I2S Microphone (~$3 on AliExpress)
|
||||||
|
- Cheaper, works well
|
||||||
|
- Search: "INMP441 I2S microphone"
|
||||||
|
|
||||||
|
**2. Speaker / Audio Output**
|
||||||
|
- **Option A:** Small 3-5W speaker (~$5-10)
|
||||||
|
- Search: "3W 8 ohm speaker"
|
||||||
|
- **Option B:** I2S speaker amplifier + speaker
|
||||||
|
- MAX98357A I2S amp (~$5)
|
||||||
|
- 4-8 ohm speaker (~$5)
|
||||||
|
- **Option C:** Line out to existing speakers (cheapest)
|
||||||
|
|
||||||
|
**3. MicroSD Card**
|
||||||
|
- 4GB or larger
|
||||||
|
- FAT32 formatted
|
||||||
|
- Class 10 recommended
|
||||||
|
- ~$5
|
||||||
|
|
||||||
|
**4. USB-C Cable**
|
||||||
|
- For power and programming
|
||||||
|
- ~$3-5
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Optional but Nice
|
||||||
|
|
||||||
|
**1. Enclosure/Case**
|
||||||
|
- 3D print custom case
|
||||||
|
- Find STL files on Thingiverse
|
||||||
|
- Or use small project box (~$5)
|
||||||
|
|
||||||
|
**2. Microphone Array** (for better pickup)
|
||||||
|
- 2 or 4-mic array board (~$15-25)
|
||||||
|
- Better voice detection
|
||||||
|
- Phase 2+ enhancement
|
||||||
|
|
||||||
|
**3. Battery Pack** (for portable testing)
|
||||||
|
- USB-C power bank
|
||||||
|
- Makes testing easier
|
||||||
|
- Already have? Use it!
|
||||||
|
|
||||||
|
**4. Mounting Hardware**
|
||||||
|
- Velcro strips
|
||||||
|
- 3M command strips
|
||||||
|
- Wall mount brackets
|
||||||
|
- ~$5
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Multi-Unit Strategy
|
||||||
|
|
||||||
|
### Same Hardware (Recommended)
|
||||||
|
**Buy:** 2-4x Maix Duino K210 units
|
||||||
|
**Benefit:**
|
||||||
|
- All units identical
|
||||||
|
- Same code deployment
|
||||||
|
- Easy troubleshooting
|
||||||
|
- Bulk buy discount
|
||||||
|
|
||||||
|
**Deployment:**
|
||||||
|
- Unit 1: Living room
|
||||||
|
- Unit 2: Bedroom
|
||||||
|
- Unit 3: Kitchen
|
||||||
|
- Unit 4: Office
|
||||||
|
|
||||||
|
### Mixed Hardware (Experimental)
|
||||||
|
**Buy:**
|
||||||
|
- 2x Maix Duino K210 (proven)
|
||||||
|
- 1x ESP32-S3-Box (modern)
|
||||||
|
- 1x Maix-III (future-proof)
|
||||||
|
|
||||||
|
**Benefit:**
|
||||||
|
- Test different platforms
|
||||||
|
- Evaluate performance
|
||||||
|
- Future-proofing
|
||||||
|
|
||||||
|
**Drawback:**
|
||||||
|
- More complex code
|
||||||
|
- Different troubleshooting
|
||||||
|
- Inconsistent UX
|
||||||
|
|
||||||
|
**Verdict:** ⚠️ Only if you want to experiment
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Budget Options
|
||||||
|
|
||||||
|
### Ultra-Budget Multi-Room (~$50 total)
|
||||||
|
- 2x Generic ESP32 + I2S mic ($10 each = $20)
|
||||||
|
- 2x Speakers ($5 each = $10)
|
||||||
|
- 2x SD cards ($5 each = $10)
|
||||||
|
- Cables ($10)
|
||||||
|
- **Total:** ~$50 for 2 units
|
||||||
|
|
||||||
|
**Pros:** Cheap
|
||||||
|
**Cons:** No LCD, DIY assembly, different code
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Mid-Budget Multi-Room (~$100 total)
|
||||||
|
- 2x Maix Duino K210 ($35 each = $70)
|
||||||
|
- 2x I2S mics ($5 each = $10)
|
||||||
|
- 2x Speakers ($5 each = $10)
|
||||||
|
- Accessories ($10)
|
||||||
|
- **Total:** ~$100 for 2 units
|
||||||
|
|
||||||
|
**Pros:** Proven, consistent, LCD included
|
||||||
|
**Cons:** "Outdated" hardware (doesn't matter for your use)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Premium Multi-Room (~$200 total)
|
||||||
|
- 2x Maix-III AXera-Pi ($70 each = $140)
|
||||||
|
- 2x I2S mics ($10 each = $20)
|
||||||
|
- 2x Speakers ($10 each = $20)
|
||||||
|
- Accessories ($20)
|
||||||
|
- **Total:** ~$200 for 2 units
|
||||||
|
|
||||||
|
**Pros:** Future-proof, modern, powerful
|
||||||
|
**Cons:** More expensive, newer = less support
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## My Recommendation
|
||||||
|
|
||||||
|
### For Second Unit: Buy Another Maix Duino K210 ✅
|
||||||
|
|
||||||
|
**Reasoning:**
|
||||||
|
1. **Code reuse** - Everything you develop for unit 1 works on unit 2
|
||||||
|
2. **Known quantity** - No surprises, you know it works
|
||||||
|
3. **Multi-room consistency** - All units behave the same
|
||||||
|
4. **Edge wake word ready** - Can upgrade later if desired
|
||||||
|
5. **Cost-effective** - ~$40 for full kit with LCD
|
||||||
|
6. **Stock available** - Still widely sold despite being "outdated"
|
||||||
|
|
||||||
|
**Where to Buy:**
|
||||||
|
- **Best:** AliExpress "Sipeed Official Store" (~$30 + shipping)
|
||||||
|
- **Fastest:** Amazon (~$45 with Prime)
|
||||||
|
- **Support:** Seeed Studio (~$40 + shipping)
|
||||||
|
|
||||||
|
**What to Order:**
|
||||||
|
```
|
||||||
|
Shopping List for Second Unit:
|
||||||
|
[ ] 1x Sipeed Maix Duino Kit (board + LCD + camera) - $30-35
|
||||||
|
[ ] 1x I2S MEMS microphone (INMP441 or SPH0645) - $5-7
|
||||||
|
[ ] 1x Small speaker (3W, 8 ohm) - $5-10
|
||||||
|
[ ] 1x MicroSD card (8GB+, Class 10) - $5
|
||||||
|
[ ] 1x USB-C cable - $3-5
|
||||||
|
[ ] Optional: Enclosure/mounting - $5-10
|
||||||
|
|
||||||
|
Total: ~$50-75 (depending on shipping and options)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### For Third+ Units: Evaluate
|
||||||
|
|
||||||
|
By the time you're ready for 3rd/4th units:
|
||||||
|
- You'll have experience with K210
|
||||||
|
- You'll know if you want consistency (more K210s)
|
||||||
|
- Or variety (try ESP32-S3 or Maix-III)
|
||||||
|
- Maix-III may have better availability
|
||||||
|
- Prices may have changed
|
||||||
|
|
||||||
|
**Decision:** Revisit when units 1 and 2 are working
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Future-Proofing Considerations
|
||||||
|
|
||||||
|
### Will K210 be Supported?
|
||||||
|
- **MaixPy:** Still actively maintained for K210
|
||||||
|
- **Community:** Large existing user base
|
||||||
|
- **Models:** Pre-trained models still work
|
||||||
|
- **Lifespan:** Good for 3-5+ years
|
||||||
|
|
||||||
|
**Verdict:** ✅ Safe to buy more K210s now
|
||||||
|
|
||||||
|
### When to Switch Hardware?
|
||||||
|
Consider switching when:
|
||||||
|
- [ ] K210 becomes hard to find
|
||||||
|
- [ ] You need better performance (edge ML)
|
||||||
|
- [ ] Power consumption is critical
|
||||||
|
- [ ] New features require newer hardware
|
||||||
|
|
||||||
|
**Timeline:** Probably 2-3 years out
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Special Considerations
|
||||||
|
|
||||||
|
### Different Rooms, Different Needs?
|
||||||
|
|
||||||
|
**Living Room (Primary):**
|
||||||
|
- Needs: Best audio, LCD display, polish
|
||||||
|
- **Hardware:** Maix Duino K210 with all features
|
||||||
|
|
||||||
|
**Bedroom (Secondary):**
|
||||||
|
- Needs: Simple, no bright LCD at night
|
||||||
|
- **Hardware:** Maix Duino K210, disable LCD at night
|
||||||
|
|
||||||
|
**Kitchen (Ambient Noise):**
|
||||||
|
- Needs: Better microphone array
|
||||||
|
- **Hardware:** Maix Duino K210 + 4-mic array
|
||||||
|
|
||||||
|
**Office (Minimal):**
|
||||||
|
- Needs: Cheap, basic audio only
|
||||||
|
- **Hardware:** Generic ESP32 + I2S mic
|
||||||
|
|
||||||
|
### All Same vs Customized?
|
||||||
|
|
||||||
|
**Recommendation:** Start with all same (Maix Duino), customize later if needed.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Action Plan
|
||||||
|
|
||||||
|
### This Week
|
||||||
|
1. **Order second Maix Duino K210** (~$30-40)
|
||||||
|
2. **Order I2S microphone** (~$5-7)
|
||||||
|
3. **Order speaker** (~$5-10)
|
||||||
|
4. **Order SD card** (~$5)
|
||||||
|
|
||||||
|
**Total Investment:** ~$50-65
|
||||||
|
|
||||||
|
### Next Month
|
||||||
|
1. Wait for delivery (2-4 weeks from AliExpress)
|
||||||
|
2. Test unit 1 while waiting
|
||||||
|
3. Refine code and setup process
|
||||||
|
4. Prepare for unit 2 deployment
|
||||||
|
|
||||||
|
### In 2-3 Months
|
||||||
|
1. Deploy unit 2 (should be easy after unit 1)
|
||||||
|
2. Test multi-room
|
||||||
|
3. Decide on unit 3/4 based on experience
|
||||||
|
4. Consider bulk order if expanding
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
**Buy for Second Unit:**
|
||||||
|
- ✅ **Sipeed Maix Duino K210** (same as first) - ~$35
|
||||||
|
- ✅ **I2S MEMS microphone** (INMP441) - ~$5
|
||||||
|
- ✅ **Small speaker** (3W, 8 ohm) - ~$8
|
||||||
|
- ✅ **MicroSD card** (8GB Class 10) - ~$5
|
||||||
|
- ✅ **USB-C cable** - ~$5
|
||||||
|
|
||||||
|
**Total:** ~$60 shipped
|
||||||
|
|
||||||
|
**Why:** Code reuse, consistency, proven solution, future-expandable
|
||||||
|
|
||||||
|
**Where:** AliExpress (cheap) or Amazon (fast)
|
||||||
|
|
||||||
|
**When:** Order now, 2-4 weeks delivery
|
||||||
|
|
||||||
|
**Third+ Units:** Decide after testing 2 units (probably buy more K210s)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick Links
|
||||||
|
|
||||||
|
**Official Sipeed Store (AliExpress):**
|
||||||
|
https://sipeed.aliexpress.com/store/1101739727
|
||||||
|
|
||||||
|
**Seeed Studio:**
|
||||||
|
https://www.seeedstudio.com/catalogsearch/result/?q=maix+duino
|
||||||
|
|
||||||
|
**Amazon Search:**
|
||||||
|
"Sipeed Maix Duino K210"
|
||||||
|
|
||||||
|
**Microphone (Adafruit):**
|
||||||
|
https://www.adafruit.com/product/3421
|
||||||
|
|
||||||
|
**Alternative Mic (AliExpress):**
|
||||||
|
Search: "INMP441 I2S microphone breakout"
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Happy Building! 🏠🎙️**
|
||||||
223
docs/K210_PERFORMANCE_VERIFICATION.md
Executable file
223
docs/K210_PERFORMANCE_VERIFICATION.md
Executable file
|
|
@ -0,0 +1,223 @@
|
||||||
|
# K210 Performance Verification for Voice Assistant
|
||||||
|
|
||||||
|
**Date:** 2025-11-29
|
||||||
|
**Source:** https://github.com/sipeed/MaixPy Performance Comparison
|
||||||
|
**Question:** Is K210 suitable for our Mycroft Precise wake word detection project?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## K210 Specifications
|
||||||
|
|
||||||
|
- **Processor:** K210 dual-core RISC-V @ 400MHz
|
||||||
|
- **AI Accelerator:** KPU (Neural Network Processor)
|
||||||
|
- **SRAM:** 8MB
|
||||||
|
- **Status:** Considered "outdated" by Sipeed (2018 release)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Performance Comparison (from MaixPy GitHub)
|
||||||
|
|
||||||
|
### YOLOv2 Object Detection
|
||||||
|
| Chip | Performance | Notes |
|
||||||
|
|------|------------|-------|
|
||||||
|
| K210 | 1.8 ms | Limited to older models |
|
||||||
|
| V831 | 20-40 ms | More modern, but slower |
|
||||||
|
| R329 | N/A | Newer hardware |
|
||||||
|
|
||||||
|
### Our Use Case: Audio Processing
|
||||||
|
|
||||||
|
**For wake word detection, we need:**
|
||||||
|
- Audio input (16kHz, mono) ✅ K210 has I2S
|
||||||
|
- Real-time processing ✅ K210 KPU can handle this
|
||||||
|
- Network communication ✅ K210 has ESP32 WiFi
|
||||||
|
- Low latency (<100ms) ✅ Achievable
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Deployment Strategy Analysis
|
||||||
|
|
||||||
|
### Option A: Server-Side Wake Word (Recommended)
|
||||||
|
**K210 Role:** Audio I/O only
|
||||||
|
- Capture audio from I2S microphone ✅ Well supported
|
||||||
|
- Stream to Heimdall via WiFi ✅ No problem
|
||||||
|
- Receive and play TTS audio ✅ Works fine
|
||||||
|
- LED/display feedback ✅ Easy
|
||||||
|
|
||||||
|
**K210 Requirements:** MINIMAL
|
||||||
|
- No AI processing needed
|
||||||
|
- Simple audio streaming
|
||||||
|
- Network communication only
|
||||||
|
- **Verdict:** ✅ K210 is MORE than capable
|
||||||
|
|
||||||
|
### Option B: Edge Wake Word (Future)
|
||||||
|
**K210 Role:** Wake word detection on-device
|
||||||
|
- Load KMODEL wake word model ⚠️ Needs conversion
|
||||||
|
- Run inference on KPU ⚠️ Quantization required
|
||||||
|
- Detect wake word locally ⚠️ Possible but limited
|
||||||
|
|
||||||
|
**K210 Limitations:**
|
||||||
|
- KMODEL conversion complex (TF→ONNX→KMODEL)
|
||||||
|
- Quantization may reduce accuracy (80-90% vs 95%+)
|
||||||
|
- Limited to simpler models
|
||||||
|
- **Verdict:** ⚠️ Possible but challenging
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Why K210 is PERFECT for Our Project
|
||||||
|
|
||||||
|
### 1. We're Starting with Server-Side Detection
|
||||||
|
- K210 only does audio I/O
|
||||||
|
- All AI processing on Heimdall (powerful server)
|
||||||
|
- No need for cutting-edge hardware
|
||||||
|
- **K210 is ideal for this role**
|
||||||
|
|
||||||
|
### 2. Audio Processing is Not Computationally Intensive
|
||||||
|
Unlike YOLOv2 (60 FPS video processing):
|
||||||
|
- Audio: 16kHz sample rate = 16,000 samples/second
|
||||||
|
- Wake word: Simple streaming
|
||||||
|
- No real-time neural network inference needed (server-side)
|
||||||
|
- **K210's "old" specs don't matter**
|
||||||
|
|
||||||
|
### 3. Edge Detection is Optional (Future Enhancement)
|
||||||
|
- We can prove the concept with server-side first
|
||||||
|
- Edge detection is a nice-to-have optimization
|
||||||
|
- If we need edge later, we can:
|
||||||
|
- Use simpler wake word models
|
||||||
|
- Accept slightly lower accuracy
|
||||||
|
- Or upgrade hardware then
|
||||||
|
- **Starting point doesn't require latest hardware**
|
||||||
|
|
||||||
|
### 4. K210 Advantages We Actually Care About
|
||||||
|
- ✅ Well-documented (mature platform)
|
||||||
|
- ✅ Stable MaixPy firmware
|
||||||
|
- ✅ Large community and examples
|
||||||
|
- ✅ Proven audio processing
|
||||||
|
- ✅ Already have the hardware!
|
||||||
|
- ✅ Cost-effective ($30 vs $100+ newer boards)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Performance Targets vs K210 Capabilities
|
||||||
|
|
||||||
|
### What We Need:
|
||||||
|
- Audio capture: 16kHz, 1 channel ✅ K210: Easy
|
||||||
|
- Audio streaming: ~128 kbps over WiFi ✅ K210: No problem
|
||||||
|
- Wake word latency: <200ms ✅ K210: Achievable (server-side)
|
||||||
|
- LED feedback: Instant ✅ K210: Trivial
|
||||||
|
- Audio playback: 16kHz TTS ✅ K210: Supported
|
||||||
|
|
||||||
|
### What We DON'T Need (for initial deployment):
|
||||||
|
- ❌ Real-time video processing
|
||||||
|
- ❌ Complex neural networks on device
|
||||||
|
- ❌ Multi-model inference
|
||||||
|
- ❌ High-resolution image processing
|
||||||
|
- ❌ Latest and greatest AI accelerator
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Comparison to Alternatives
|
||||||
|
|
||||||
|
### If we bought newer hardware:
|
||||||
|
|
||||||
|
**V831 ($50-70):**
|
||||||
|
- Pros: Newer, better supported
|
||||||
|
- Cons:
|
||||||
|
- More expensive
|
||||||
|
- SLOWER at neural networks than K210
|
||||||
|
- Still need server for Whisper anyway
|
||||||
|
- Overkill for audio I/O
|
||||||
|
|
||||||
|
**ESP32-S3 ($10-20):**
|
||||||
|
- Pros: Cheap, WiFi built-in
|
||||||
|
- Cons:
|
||||||
|
- No KPU (if we want edge detection later)
|
||||||
|
- Less capable for ML
|
||||||
|
- Would work for server-side though
|
||||||
|
|
||||||
|
**Raspberry Pi Zero 2 W ($15):**
|
||||||
|
- Pros: Full Linux, familiar
|
||||||
|
- Cons:
|
||||||
|
- No dedicated audio hardware
|
||||||
|
- No neural accelerator
|
||||||
|
- More power hungry
|
||||||
|
- Overkill for our needs
|
||||||
|
|
||||||
|
**Verdict:** K210 is actually the sweet spot for this project!
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Real-World Comparison
|
||||||
|
|
||||||
|
### What K210 CAN Do (Proven):
|
||||||
|
- Audio classification ✅
|
||||||
|
- Simple keyword spotting ✅
|
||||||
|
- Voice activity detection ✅
|
||||||
|
- Audio streaming ✅
|
||||||
|
- Multi-microphone beamforming ✅
|
||||||
|
|
||||||
|
### What We're Asking It To Do:
|
||||||
|
- Stream audio to server ✅ Much easier
|
||||||
|
- (Optional future) Simple wake word detection ✅ Proven capability
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recommendation: Proceed with K210
|
||||||
|
|
||||||
|
### Phase 1: Server-Side (Now)
|
||||||
|
K210 role: Audio I/O device
|
||||||
|
- **Difficulty:** Easy
|
||||||
|
- **Performance:** Excellent
|
||||||
|
- **K210 utilization:** ~10-20%
|
||||||
|
- **Status:** No concerns whatsoever
|
||||||
|
|
||||||
|
### Phase 2: Edge Detection (Future)
|
||||||
|
K210 role: Wake word detection + audio I/O
|
||||||
|
- **Difficulty:** Moderate (model conversion)
|
||||||
|
- **Performance:** Good enough (80-90% accuracy)
|
||||||
|
- **K210 utilization:** ~30-40%
|
||||||
|
- **Status:** Feasible, community has done it
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Conclusion
|
||||||
|
|
||||||
|
**Is K210 outdated?** Yes, for cutting-edge ML applications.
|
||||||
|
|
||||||
|
**Is K210 suitable for our project?** ABSOLUTELY YES!
|
||||||
|
|
||||||
|
**Why:**
|
||||||
|
1. We're using server-side processing (K210 just streams audio)
|
||||||
|
2. K210's audio capabilities are excellent
|
||||||
|
3. Mature platform = more examples and stability
|
||||||
|
4. Already have the hardware
|
||||||
|
5. Cost-effective
|
||||||
|
6. Can optionally upgrade to edge detection later
|
||||||
|
|
||||||
|
**The "outdated" warning is for people wanting latest ML performance. We're using it as an audio I/O device with WiFi - it's perfect for that!**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Additional Notes
|
||||||
|
|
||||||
|
### From MaixPy GitHub Warning:
|
||||||
|
> "We now recommend users choose the MaixCAM ... For 2018 K210 ... limited performance"
|
||||||
|
|
||||||
|
**Our Response:**
|
||||||
|
- We don't need 2024 performance for audio streaming
|
||||||
|
- Server does the heavy lifting (Heimdall with NVIDIA GPU)
|
||||||
|
- K210 mature platform is actually an advantage
|
||||||
|
- If we need more later, we can upgrade edge device while keeping server
|
||||||
|
|
||||||
|
### Community Validation:
|
||||||
|
Many Mycroft Precise + K210 projects exist:
|
||||||
|
- Audio streaming: Proven ✅
|
||||||
|
- Edge wake word: Proven ✅
|
||||||
|
- Full voice assistant: Proven ✅
|
||||||
|
|
||||||
|
**The K210 is "outdated" for video/vision ML, not for audio projects.**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Final Verdict:** ✅ PROCEED WITH CONFIDENCE
|
||||||
|
|
||||||
|
The K210 is perfect for our use case. Ignore the "outdated" warning - that's for people doing real-time video processing or wanting the latest ML features. For a voice assistant where the heavy lifting happens server-side, the K210 is an excellent, mature, cost-effective choice!
|
||||||
566
docs/LCD_CAMERA_FEATURES.md
Executable file
566
docs/LCD_CAMERA_FEATURES.md
Executable file
|
|
@ -0,0 +1,566 @@
|
||||||
|
# Maix Duino LCD & Camera Feature Analysis
|
||||||
|
|
||||||
|
**Date:** 2025-11-29
|
||||||
|
**Hardware:** Sipeed Maix Duino (K210)
|
||||||
|
**Question:** What's the overhead for using LCD display and camera?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Hardware Capabilities
|
||||||
|
|
||||||
|
### LCD Display
|
||||||
|
- **Resolution:** Typically 320x240 or 240x135 (depending on model)
|
||||||
|
- **Interface:** SPI
|
||||||
|
- **Color:** RGB565 (16-bit color)
|
||||||
|
- **Frame Rate:** Up to 60 FPS (limited by SPI bandwidth)
|
||||||
|
- **Status:** ✅ Included with most Maix Duino kits
|
||||||
|
|
||||||
|
### Camera
|
||||||
|
- **Resolution:** Various (OV2640 common: 2MP, up to 1600x1200)
|
||||||
|
- **Interface:** DVP (Digital Video Port)
|
||||||
|
- **Frame Rate:** Up to 60 FPS (lower at high resolution)
|
||||||
|
- **Status:** ✅ Often included with Maix Duino kits
|
||||||
|
|
||||||
|
### K210 Resources
|
||||||
|
- **CPU:** Dual-core RISC-V @ 400MHz
|
||||||
|
- **KPU:** Neural network accelerator
|
||||||
|
- **SRAM:** 8MB total (6MB available for apps)
|
||||||
|
- **Flash:** 16MB
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## LCD Usage for Voice Assistant
|
||||||
|
|
||||||
|
### Use Case 1: Status Display (Minimal Overhead)
|
||||||
|
**What to Show:**
|
||||||
|
- Current state (idle/listening/processing/responding)
|
||||||
|
- Wake word detected indicator
|
||||||
|
- WiFi status and signal strength
|
||||||
|
- Server connection status
|
||||||
|
- Volume level
|
||||||
|
- Time/date
|
||||||
|
|
||||||
|
**Overhead:**
|
||||||
|
- **CPU:** ~2-5% (simple text/icons)
|
||||||
|
- **RAM:** ~200KB (framebuffer + assets)
|
||||||
|
- **Power:** ~50mW additional
|
||||||
|
- **Complexity:** Low (MaixPy has built-in LCD support)
|
||||||
|
|
||||||
|
**Code Example:**
|
||||||
|
```python
|
||||||
|
import lcd
|
||||||
|
import image
|
||||||
|
|
||||||
|
lcd.init()
|
||||||
|
lcd.rotation(2) # Rotate if needed
|
||||||
|
|
||||||
|
# Simple status display
|
||||||
|
img = image.Image(size=(320, 240))
|
||||||
|
img.draw_string(10, 10, "Listening...", color=(0, 255, 0), scale=3)
|
||||||
|
img.draw_circle(300, 20, 10, color=(0, 255, 0), fill=True) # Status LED
|
||||||
|
lcd.display(img)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Verdict:** ✅ **Very Low Overhead - Highly Recommended**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Use Case 2: Audio Waveform Visualizer (Moderate Overhead)
|
||||||
|
|
||||||
|
#### Input Waveform (Microphone)
|
||||||
|
**What to Show:**
|
||||||
|
- Real-time audio level meter
|
||||||
|
- Waveform display (oscilloscope style)
|
||||||
|
- VU meter
|
||||||
|
- Frequency spectrum (simple bars)
|
||||||
|
|
||||||
|
**Overhead:**
|
||||||
|
- **CPU:** ~10-15% (real-time drawing)
|
||||||
|
- **RAM:** ~300KB (framebuffer + audio buffer)
|
||||||
|
- **Frame Rate:** 15-30 FPS (sufficient for audio visualization)
|
||||||
|
- **Complexity:** Moderate (drawing primitives + FFT)
|
||||||
|
|
||||||
|
**Implementation:**
|
||||||
|
```python
|
||||||
|
import lcd, audio, image
|
||||||
|
import array
|
||||||
|
|
||||||
|
lcd.init()
|
||||||
|
audio.init()
|
||||||
|
|
||||||
|
def draw_waveform(audio_buffer):
|
||||||
|
img = image.Image(size=(320, 240))
|
||||||
|
|
||||||
|
# Draw waveform
|
||||||
|
width = 320
|
||||||
|
height = 240
|
||||||
|
center = height // 2
|
||||||
|
|
||||||
|
# Sample every Nth point to fit on screen
|
||||||
|
step = len(audio_buffer) // width
|
||||||
|
|
||||||
|
for x in range(width - 1):
|
||||||
|
y1 = center + (audio_buffer[x * step] // 256)
|
||||||
|
y2 = center + (audio_buffer[(x + 1) * step] // 256)
|
||||||
|
img.draw_line(x, y1, x + 1, y2, color=(0, 255, 0))
|
||||||
|
|
||||||
|
# Add level meter
|
||||||
|
level = max(abs(min(audio_buffer)), abs(max(audio_buffer)))
|
||||||
|
bar_height = (level * height) // 32768
|
||||||
|
img.draw_rectangle(0, height - bar_height, 20, bar_height,
|
||||||
|
color=(0, 255, 0), fill=True)
|
||||||
|
|
||||||
|
lcd.display(img)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Verdict:** ✅ **Moderate Overhead - Feasible and Cool!**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### Output Waveform (TTS Response)
|
||||||
|
**What to Show:**
|
||||||
|
- TTS audio being played back
|
||||||
|
- Speaking animation (mouth/sound waves)
|
||||||
|
- Response text scrolling
|
||||||
|
|
||||||
|
**Overhead:**
|
||||||
|
- **CPU:** ~10-15% (similar to input)
|
||||||
|
- **RAM:** ~300KB
|
||||||
|
- **Complexity:** Moderate
|
||||||
|
|
||||||
|
**Note:** Can reuse same visualization code as input waveform.
|
||||||
|
|
||||||
|
**Verdict:** ✅ **Same as Input - Totally Doable**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Use Case 3: Spectrum Analyzer (Higher Overhead)
|
||||||
|
**What to Show:**
|
||||||
|
- Frequency bars (FFT visualization)
|
||||||
|
- 8-16 frequency bands
|
||||||
|
- Classic "equalizer" look
|
||||||
|
|
||||||
|
**Overhead:**
|
||||||
|
- **CPU:** ~20-30% (FFT computation + drawing)
|
||||||
|
- **RAM:** ~500KB (FFT buffers + framebuffer)
|
||||||
|
- **Complexity:** Moderate-High (FFT required)
|
||||||
|
|
||||||
|
**Implementation Note:**
|
||||||
|
- K210 KPU can accelerate FFT operations
|
||||||
|
- Can do simple 8-band analysis with minimal CPU
|
||||||
|
- More bands = more CPU
|
||||||
|
|
||||||
|
**Verdict:** ⚠️ **Higher Overhead - Use Sparingly**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Use Case 4: Interactive UI (High Overhead)
|
||||||
|
**What to Show:**
|
||||||
|
- Touchscreen controls (if touchscreen available)
|
||||||
|
- Settings menu
|
||||||
|
- Volume slider
|
||||||
|
- Wake word selection
|
||||||
|
- Network configuration
|
||||||
|
|
||||||
|
**Overhead:**
|
||||||
|
- **CPU:** ~20-40% (touch detection + UI rendering)
|
||||||
|
- **RAM:** ~1MB (UI framework + assets)
|
||||||
|
- **Complexity:** High (need UI framework)
|
||||||
|
|
||||||
|
**Verdict:** ⚠️ **High Overhead - Nice-to-Have Later**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Camera Usage for Voice Assistant
|
||||||
|
|
||||||
|
### Use Case 1: Person Detection (Wake on Face)
|
||||||
|
**What to Do:**
|
||||||
|
- Detect person in frame
|
||||||
|
- Only listen when someone present
|
||||||
|
- Privacy mode: disable when no one around
|
||||||
|
|
||||||
|
**Overhead:**
|
||||||
|
- **CPU:** ~30-40% (KPU handles inference)
|
||||||
|
- **RAM:** ~1.5MB (model + frame buffers)
|
||||||
|
- **Power:** ~200mW additional
|
||||||
|
- **Complexity:** Moderate (pre-trained models available)
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
- ✅ Privacy enhancement (only listen when occupied)
|
||||||
|
- ✅ Power saving (sleep when empty room)
|
||||||
|
- ✅ Pre-trained models available for K210
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
- ❌ Adds latency (check camera before listening)
|
||||||
|
- ❌ Privacy concerns (camera always on)
|
||||||
|
- ❌ Moderate resource usage
|
||||||
|
|
||||||
|
**Verdict:** 🤔 **Interesting but Complex - Phase 2+**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Use Case 2: Visual Context (Future AI Integration)
|
||||||
|
**What to Do:**
|
||||||
|
- "What am I holding?" queries
|
||||||
|
- Visual scene understanding
|
||||||
|
- QR code scanning
|
||||||
|
- Gesture control
|
||||||
|
|
||||||
|
**Overhead:**
|
||||||
|
- **CPU:** 40-60% (vision processing)
|
||||||
|
- **RAM:** 2-3MB (models + buffers)
|
||||||
|
- **Complexity:** High (requires vision models)
|
||||||
|
|
||||||
|
**Verdict:** ❌ **Too Complex for Initial Release - Future Feature**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Use Case 3: Visual Wake Word (Gesture Detection)
|
||||||
|
**What to Do:**
|
||||||
|
- Wave hand to activate
|
||||||
|
- Thumbs up/down for feedback
|
||||||
|
- Alternative to voice wake word
|
||||||
|
|
||||||
|
**Overhead:**
|
||||||
|
- **CPU:** ~30-40% (gesture detection)
|
||||||
|
- **RAM:** ~1.5MB
|
||||||
|
- **Complexity:** Moderate-High
|
||||||
|
|
||||||
|
**Verdict:** 🤔 **Novel Idea - Phase 3+**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recommended LCD Implementation
|
||||||
|
|
||||||
|
### Phase 1: Basic Status Display (Recommended NOW)
|
||||||
|
```
|
||||||
|
┌─────────────────────────┐
|
||||||
|
│ Voice Assistant │
|
||||||
|
│ │
|
||||||
|
│ Status: Listening ● │
|
||||||
|
│ WiFi: ████░░ 75% │
|
||||||
|
│ Server: Connected │
|
||||||
|
│ │
|
||||||
|
│ Volume: [██████░░░] │
|
||||||
|
│ │
|
||||||
|
│ Time: 14:23 │
|
||||||
|
└─────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
**Features:**
|
||||||
|
- Current state indicator
|
||||||
|
- WiFi signal strength
|
||||||
|
- Server connection status
|
||||||
|
- Volume level bar
|
||||||
|
- Clock
|
||||||
|
- Wake word indicator (pulsing circle)
|
||||||
|
|
||||||
|
**Overhead:** ~2-5% CPU, 200KB RAM
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Phase 2: Waveform Visualization (Cool Addition)
|
||||||
|
```
|
||||||
|
┌─────────────────────────┐
|
||||||
|
│ Listening... [●] │
|
||||||
|
├─────────────────────────┤
|
||||||
|
│ ╱╲ ╱╲ ╱╲ ╱╲ │
|
||||||
|
│ ╱ ╲╱ ╲ ╱ ╲╱ ╲ │
|
||||||
|
│ │
|
||||||
|
│ Level: [████░░░░░░] │
|
||||||
|
└─────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
**Features:**
|
||||||
|
- Real-time waveform (15-30 FPS)
|
||||||
|
- Audio level meter
|
||||||
|
- State indicator
|
||||||
|
- Simple and clean
|
||||||
|
|
||||||
|
**Overhead:** ~10-15% CPU, 300KB RAM
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Phase 3: Enhanced Visualizer (Polish)
|
||||||
|
```
|
||||||
|
┌─────────────────────────┐
|
||||||
|
│ Hey Computer! [●] │
|
||||||
|
├─────────────────────────┤
|
||||||
|
│ ▁▂▃▄▅▆▇█ ▁▂▃▄▅▆▇█ │
|
||||||
|
│ ▁▂▃▄▅▆▇█ ▁▂▃▄▅▆▇█ │
|
||||||
|
│ │
|
||||||
|
│ "Turn off the lights" │
|
||||||
|
└─────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
**Features:**
|
||||||
|
- Spectrum analyzer (8-16 bands)
|
||||||
|
- Transcription display
|
||||||
|
- Animated response
|
||||||
|
- More polished UI
|
||||||
|
|
||||||
|
**Overhead:** ~20-30% CPU, 500KB RAM
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Resource Budget Analysis
|
||||||
|
|
||||||
|
### Total K210 Resources
|
||||||
|
- **CPU:** 2 cores @ 400MHz (assume ~100% available)
|
||||||
|
- **RAM:** 6MB available for app
|
||||||
|
- **Bandwidth:** SPI (LCD), I2S (audio), WiFi
|
||||||
|
|
||||||
|
### Current Voice Assistant Usage (Server-Side Wake Word)
|
||||||
|
|
||||||
|
| Component | CPU % | RAM (KB) |
|
||||||
|
|-----------|-------|----------|
|
||||||
|
| Audio Capture (I2S) | 5% | 128 |
|
||||||
|
| Audio Playback | 5% | 128 |
|
||||||
|
| WiFi Streaming | 10% | 256 |
|
||||||
|
| Network Stack | 5% | 512 |
|
||||||
|
| MaixPy Runtime | 10% | 1024 |
|
||||||
|
| **Base Total** | **35%** | **~2MB** |
|
||||||
|
|
||||||
|
### With LCD Features
|
||||||
|
|
||||||
|
| Display Mode | CPU % | RAM (KB) | Total CPU | Total RAM |
|
||||||
|
|--------------|-------|----------|-----------|-----------|
|
||||||
|
| **None** | 0% | 0 | 35% | 2MB |
|
||||||
|
| **Status Only** | 2-5% | 200 | 37-40% | 2.2MB |
|
||||||
|
| **Waveform** | 10-15% | 300 | 45-50% | 2.3MB |
|
||||||
|
| **Spectrum** | 20-30% | 500 | 55-65% | 2.5MB |
|
||||||
|
|
||||||
|
### With Camera Features
|
||||||
|
|
||||||
|
| Feature | CPU % | RAM (KB) | Feasible? |
|
||||||
|
|---------|-------|----------|-----------|
|
||||||
|
| Person Detection | 30-40% | 1500 | ⚠️ Tight |
|
||||||
|
| Gesture Control | 30-40% | 1500 | ⚠️ Tight |
|
||||||
|
| Visual Context | 40-60% | 2500 | ❌ Too much |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recommendations
|
||||||
|
|
||||||
|
### ✅ IMPLEMENT NOW: Basic Status Display
|
||||||
|
- **Why:** Very low overhead, huge UX improvement
|
||||||
|
- **Overhead:** 2-5% CPU, 200KB RAM
|
||||||
|
- **Benefit:** Users know what's happening at a glance
|
||||||
|
- **Difficulty:** Easy (MaixPy has good LCD support)
|
||||||
|
|
||||||
|
### ✅ IMPLEMENT SOON: Waveform Visualizer
|
||||||
|
- **Why:** Cool factor, moderate overhead
|
||||||
|
- **Overhead:** 10-15% CPU, 300KB RAM
|
||||||
|
- **Benefit:** Engaging, confirms mic is working, looks professional
|
||||||
|
- **Difficulty:** Moderate (simple drawing code)
|
||||||
|
|
||||||
|
### 🤔 CONSIDER LATER: Spectrum Analyzer
|
||||||
|
- **Why:** Higher overhead, diminishing returns
|
||||||
|
- **Overhead:** 20-30% CPU, 500KB RAM
|
||||||
|
- **Benefit:** Looks cool but not essential
|
||||||
|
- **Difficulty:** Moderate-High (FFT required)
|
||||||
|
|
||||||
|
### ❌ SKIP FOR NOW: Camera Features
|
||||||
|
- **Why:** High overhead, complex, privacy concerns
|
||||||
|
- **Overhead:** 30-60% CPU, 1.5-2.5MB RAM
|
||||||
|
- **Benefit:** Novel but not core functionality
|
||||||
|
- **Difficulty:** High (model integration, privacy handling)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Implementation Priority
|
||||||
|
|
||||||
|
### Phase 1 (Week 1): Core Functionality
|
||||||
|
- [x] Audio capture and streaming
|
||||||
|
- [x] Server integration
|
||||||
|
- [ ] Basic LCD status display
|
||||||
|
- Idle/Listening/Processing states
|
||||||
|
- WiFi status
|
||||||
|
- Connection indicator
|
||||||
|
|
||||||
|
### Phase 2 (Week 2-3): Visual Enhancement
|
||||||
|
- [ ] Audio waveform visualizer
|
||||||
|
- Input (microphone) waveform
|
||||||
|
- Output (TTS) waveform
|
||||||
|
- Level meters
|
||||||
|
- Clean, minimal design
|
||||||
|
|
||||||
|
### Phase 3 (Month 2): Polish
|
||||||
|
- [ ] Spectrum analyzer option
|
||||||
|
- [ ] Animated transitions
|
||||||
|
- [ ] Settings display
|
||||||
|
- [ ] Network configuration UI (optional)
|
||||||
|
|
||||||
|
### Phase 4 (Month 3+): Advanced Features
|
||||||
|
- [ ] Camera person detection (privacy mode)
|
||||||
|
- [ ] Gesture control experiments
|
||||||
|
- [ ] Visual wake word alternative
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Code Structure Recommendation
|
||||||
|
|
||||||
|
```python
|
||||||
|
# main.py structure with modular display
|
||||||
|
|
||||||
|
import lcd, audio, network
|
||||||
|
from display_manager import DisplayManager
|
||||||
|
from audio_processor import AudioProcessor
|
||||||
|
from voice_client import VoiceClient
|
||||||
|
|
||||||
|
# Initialize
|
||||||
|
lcd.init()
|
||||||
|
display = DisplayManager(mode='waveform') # or 'status' or 'spectrum'
|
||||||
|
|
||||||
|
# Main loop
|
||||||
|
while True:
|
||||||
|
# Audio processing
|
||||||
|
audio_buffer = audio.capture()
|
||||||
|
|
||||||
|
# Update display (non-blocking)
|
||||||
|
if display.mode == 'status':
|
||||||
|
display.show_status(state='listening', wifi_level=75)
|
||||||
|
elif display.mode == 'waveform':
|
||||||
|
display.show_waveform(audio_buffer)
|
||||||
|
elif display.mode == 'spectrum':
|
||||||
|
display.show_spectrum(audio_buffer)
|
||||||
|
|
||||||
|
# Network communication
|
||||||
|
voice_client.stream_audio(audio_buffer)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Measured Overhead (Estimated)
|
||||||
|
|
||||||
|
### Status Display Only
|
||||||
|
- **CPU:** 38% total (3% for display)
|
||||||
|
- **RAM:** 2.2MB total (200KB for display)
|
||||||
|
- **Battery Life:** -2% (minimal impact)
|
||||||
|
- **WiFi Latency:** No impact
|
||||||
|
- **Verdict:** ✅ Negligible impact, worth it!
|
||||||
|
|
||||||
|
### Waveform Visualizer
|
||||||
|
- **CPU:** 48% total (13% for display)
|
||||||
|
- **RAM:** 2.3MB total (300KB for display)
|
||||||
|
- **Battery Life:** -5% (minor impact)
|
||||||
|
- **WiFi Latency:** No impact (still <200ms)
|
||||||
|
- **Verdict:** ✅ Acceptable, looks great!
|
||||||
|
|
||||||
|
### Spectrum Analyzer
|
||||||
|
- **CPU:** 60% total (25% for display)
|
||||||
|
- **RAM:** 2.5MB total (500KB for display)
|
||||||
|
- **Battery Life:** -8% (noticeable)
|
||||||
|
- **WiFi Latency:** Possible minor impact
|
||||||
|
- **Verdict:** ⚠️ Usable but pushing limits
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Camera: Should You Use It?
|
||||||
|
|
||||||
|
### Pros
|
||||||
|
- ✅ Already have the hardware (free!)
|
||||||
|
- ✅ Novel features (person detection, gestures)
|
||||||
|
- ✅ Privacy enhancement potential
|
||||||
|
- ✅ Future-proofing
|
||||||
|
|
||||||
|
### Cons
|
||||||
|
- ❌ High resource usage (30-60% CPU, 1.5-2.5MB RAM)
|
||||||
|
- ❌ Complex implementation
|
||||||
|
- ❌ Privacy concerns (camera always on)
|
||||||
|
- ❌ Not core to voice assistant
|
||||||
|
- ❌ Competes with audio processing resources
|
||||||
|
|
||||||
|
### Recommendation
|
||||||
|
**Skip camera for initial implementation.** Focus on core voice assistant functionality. Revisit in Phase 3+ when:
|
||||||
|
1. Core features are stable
|
||||||
|
2. You want to experiment
|
||||||
|
3. You have time for optimization
|
||||||
|
4. You want to differentiate from commercial assistants
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Final Recommendations
|
||||||
|
|
||||||
|
### Start With (NOW):
|
||||||
|
```python
|
||||||
|
# Simple status display
|
||||||
|
# - State indicator
|
||||||
|
# - WiFi status
|
||||||
|
# - Connection status
|
||||||
|
# - Time/date
|
||||||
|
# Overhead: ~3% CPU, 200KB RAM
|
||||||
|
```
|
||||||
|
|
||||||
|
### Add Next (Week 2):
|
||||||
|
```python
|
||||||
|
# Waveform visualizer
|
||||||
|
# - Real-time audio waveform
|
||||||
|
# - Level meter
|
||||||
|
# - Clean design
|
||||||
|
# Overhead: +10% CPU, +100KB RAM
|
||||||
|
```
|
||||||
|
|
||||||
|
### Maybe Later (Month 2+):
|
||||||
|
```python
|
||||||
|
# Spectrum analyzer
|
||||||
|
# - 8-16 frequency bands
|
||||||
|
# - FFT visualization
|
||||||
|
# - Optional mode
|
||||||
|
# Overhead: +15% CPU, +200KB RAM
|
||||||
|
```
|
||||||
|
|
||||||
|
### Skip (For Now):
|
||||||
|
```python
|
||||||
|
# Camera features
|
||||||
|
# - Person detection
|
||||||
|
# - Gestures
|
||||||
|
# - Visual context
|
||||||
|
# Too complex, revisit later
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Example: Combined Status + Waveform Display
|
||||||
|
|
||||||
|
```
|
||||||
|
┌───────────────────────────────┐
|
||||||
|
│ Voice Assistant [LISTENING]│
|
||||||
|
├───────────────────────────────┤
|
||||||
|
│ │
|
||||||
|
│ ╱╲ ╱╲ ╱╲ ╱╲ ╱╲ │
|
||||||
|
│ ╱ ╲ ╱ ╲╱ ╲ ╱ ╲╱ ╲ │
|
||||||
|
│ ╲╱ ╲╱ │
|
||||||
|
│ │
|
||||||
|
│ Vol: [████████░░] WiFi: ▂▃▅█ │
|
||||||
|
│ │
|
||||||
|
│ Server: 10.1.10.71 ● 14:23 │
|
||||||
|
└───────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
**Total Overhead:** ~15% CPU, 300KB RAM
|
||||||
|
**Impact:** Minimal, excellent UX improvement
|
||||||
|
**Coolness Factor:** 9/10
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Conclusion
|
||||||
|
|
||||||
|
### LCD: YES! Definitely Use It! ✅
|
||||||
|
- **Status display:** Low overhead, huge benefit
|
||||||
|
- **Waveform:** Moderate overhead, looks amazing
|
||||||
|
- **Spectrum:** Higher overhead, nice-to-have
|
||||||
|
|
||||||
|
**Recommendation:** Start with status, add waveform, consider spectrum later.
|
||||||
|
|
||||||
|
### Camera: Skip For Now ❌
|
||||||
|
- High overhead
|
||||||
|
- Complex implementation
|
||||||
|
- Not core functionality
|
||||||
|
- Revisit in Phase 3+
|
||||||
|
|
||||||
|
**Focus on nailing the voice assistant first, then add visual features incrementally!**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**TL;DR:** Use the LCD for status + waveform visualization (~15% overhead total). Skip the camera for now. Your K210 can easily handle this! 🎉
|
||||||
638
docs/MYCROFT_PRECISE_GUIDE.md
Executable file
638
docs/MYCROFT_PRECISE_GUIDE.md
Executable file
|
|
@ -0,0 +1,638 @@
|
||||||
|
# Mycroft Precise Wake Word Training Guide
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Mycroft Precise is a neural network-based wake word detector that you can train on custom wake words. This guide covers two deployment approaches for your Maix Duino voice assistant:
|
||||||
|
|
||||||
|
1. **Server-side detection** (Recommended to start) - Run Precise on Heimdall
|
||||||
|
2. **Edge detection** (Advanced) - Convert model for K210 on Maix Duino
|
||||||
|
|
||||||
|
## Architecture Options
|
||||||
|
|
||||||
|
### Option A: Server-Side Wake Word Detection (Recommended)
|
||||||
|
|
||||||
|
```
|
||||||
|
Maix Duino Heimdall
|
||||||
|
┌─────────────────┐ ┌──────────────────────┐
|
||||||
|
│ Continuous │ Audio Stream │ Mycroft Precise │
|
||||||
|
│ Audio Capture │───────────────>│ Wake Word Detection │
|
||||||
|
│ │ │ │
|
||||||
|
│ LED Feedback │<───────────────│ Whisper STT │
|
||||||
|
│ Speaker Output │ Response │ HA Integration │
|
||||||
|
│ │ │ Piper TTS │
|
||||||
|
└─────────────────┘ └──────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
- Easier setup and debugging
|
||||||
|
- Better accuracy (more compute available)
|
||||||
|
- Easy to retrain and update models
|
||||||
|
- Can use ensemble models
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
- Continuous audio streaming (bandwidth)
|
||||||
|
- Slightly higher latency (~100-200ms)
|
||||||
|
- Requires stable network
|
||||||
|
|
||||||
|
### Option B: Edge Detection on Maix Duino (Advanced)
|
||||||
|
|
||||||
|
```
|
||||||
|
Maix Duino Heimdall
|
||||||
|
┌─────────────────┐ ┌──────────────────────┐
|
||||||
|
│ Precise Model │ │ │
|
||||||
|
│ (K210 KPU) │ │ │
|
||||||
|
│ Wake Detection │ Audio (on wake)│ Whisper STT │
|
||||||
|
│ │───────────────>│ HA Integration │
|
||||||
|
│ Audio Capture │ │ Piper TTS │
|
||||||
|
│ LED Feedback │<───────────────│ │
|
||||||
|
└─────────────────┘ Response └──────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
- Lower latency (~50ms wake detection)
|
||||||
|
- Less network traffic
|
||||||
|
- Works even if server is down
|
||||||
|
- Better privacy (no continuous streaming)
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
- Complex model conversion (TensorFlow → ONNX → KMODEL)
|
||||||
|
- Limited by K210 compute
|
||||||
|
- Harder to update models
|
||||||
|
- Requires careful optimization
|
||||||
|
|
||||||
|
## Recommended Approach: Start with Server-Side
|
||||||
|
|
||||||
|
Begin with server-side detection on Heimdall, then optimize to edge detection once everything works.
|
||||||
|
|
||||||
|
## Phase 1: Mycroft Precise Setup on Heimdall
|
||||||
|
|
||||||
|
### Install Mycroft Precise
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# SSH to Heimdall
|
||||||
|
ssh alan@10.1.10.71
|
||||||
|
|
||||||
|
# Create conda environment for Precise
|
||||||
|
conda create -n precise python=3.7 -y
|
||||||
|
conda activate precise
|
||||||
|
|
||||||
|
# Install TensorFlow 1.x (Precise requires this)
|
||||||
|
pip install tensorflow==1.15.5 --break-system-packages
|
||||||
|
|
||||||
|
# Install Precise
|
||||||
|
pip install mycroft-precise --break-system-packages
|
||||||
|
|
||||||
|
# Install audio dependencies
|
||||||
|
sudo apt-get install -y portaudio19-dev sox libatlas-base-dev
|
||||||
|
|
||||||
|
# Install precise-engine (for faster inference)
|
||||||
|
wget https://github.com/MycroftAI/mycroft-precise/releases/download/v0.3.0/precise-engine_0.3.0_x86_64.tar.gz
|
||||||
|
tar xvf precise-engine_0.3.0_x86_64.tar.gz
|
||||||
|
sudo cp precise-engine/precise-engine /usr/local/bin/
|
||||||
|
sudo chmod +x /usr/local/bin/precise-engine
|
||||||
|
```
|
||||||
|
|
||||||
|
### Verify Installation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
precise-engine --version
|
||||||
|
# Should output: Precise v0.3.0
|
||||||
|
|
||||||
|
precise-listen --help
|
||||||
|
# Should show help text
|
||||||
|
```
|
||||||
|
|
||||||
|
## Phase 2: Training Your Custom Wake Word
|
||||||
|
|
||||||
|
### Step 1: Collect Wake Word Samples
|
||||||
|
|
||||||
|
You'll need ~50-100 samples of your wake word. Choose something:
|
||||||
|
- 2-3 syllables long
|
||||||
|
- Easy to pronounce
|
||||||
|
- Unlikely to occur in normal speech
|
||||||
|
|
||||||
|
Example wake words:
|
||||||
|
- "Hey Computer" (recommended - similar to commercial products)
|
||||||
|
- "Okay Jarvis"
|
||||||
|
- "Hello Assistant"
|
||||||
|
- "Activate Assistant"
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Create project directory
|
||||||
|
mkdir -p ~/precise-models/hey-computer
|
||||||
|
cd ~/precise-models/hey-computer
|
||||||
|
|
||||||
|
# Record wake word samples
|
||||||
|
precise-collect
|
||||||
|
```
|
||||||
|
|
||||||
|
When prompted:
|
||||||
|
1. Type your wake word ("hey computer")
|
||||||
|
2. Press SPACE to record
|
||||||
|
3. Say the wake word clearly
|
||||||
|
4. Press SPACE to stop
|
||||||
|
5. Repeat 50-100 times
|
||||||
|
|
||||||
|
**Tips for good samples:**
|
||||||
|
- Vary your tone and speed
|
||||||
|
- Different distances from mic
|
||||||
|
- Different background noise levels
|
||||||
|
- Different pronunciations
|
||||||
|
- Have family members record too
|
||||||
|
|
||||||
|
### Step 2: Collect "Not Wake Word" Samples
|
||||||
|
|
||||||
|
Record background audio and similar-sounding phrases:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Create not-wake-word directory
|
||||||
|
mkdir -p not-wake-word
|
||||||
|
|
||||||
|
# Record random speech, music, TV, etc.
|
||||||
|
# These help the model learn what NOT to trigger on
|
||||||
|
precise-collect -f not-wake-word/random.wav
|
||||||
|
```
|
||||||
|
|
||||||
|
Collect ~200-500 samples of:
|
||||||
|
- Normal conversation
|
||||||
|
- TV/music in background
|
||||||
|
- Similar sounding phrases ("hey commuter", "they computed", etc.)
|
||||||
|
- Ambient noise
|
||||||
|
- Other household sounds
|
||||||
|
|
||||||
|
### Step 3: Generate Training Data
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Organize samples
|
||||||
|
mkdir -p hey-computer/{wake-word,not-wake-word,test/wake-word,test/not-wake-word}
|
||||||
|
|
||||||
|
# Split samples (80% train, 20% test)
|
||||||
|
# Move 80% of wake-word samples to hey-computer/wake-word/
|
||||||
|
# Move 20% to hey-computer/test/wake-word/
|
||||||
|
# Move 80% of not-wake-word to hey-computer/not-wake-word/
|
||||||
|
# Move 20% to hey-computer/test/not-wake-word/
|
||||||
|
|
||||||
|
# Generate training data
|
||||||
|
precise-train-incremental hey-computer.net hey-computer/
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4: Train the Model
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Basic training (will take 30-60 minutes)
|
||||||
|
precise-train -e 60 hey-computer.net hey-computer/
|
||||||
|
|
||||||
|
# For better accuracy, train longer
|
||||||
|
precise-train -e 120 hey-computer.net hey-computer/
|
||||||
|
|
||||||
|
# Watch for overfitting - validation loss should decrease
|
||||||
|
# Stop if validation loss starts increasing
|
||||||
|
```
|
||||||
|
|
||||||
|
Training output will show:
|
||||||
|
```
|
||||||
|
Epoch 1/60
|
||||||
|
loss: 0.4523 - val_loss: 0.3891
|
||||||
|
Epoch 2/60
|
||||||
|
loss: 0.3102 - val_loss: 0.2845
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 5: Test the Model
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Test with microphone
|
||||||
|
precise-listen hey-computer.net
|
||||||
|
|
||||||
|
# Speak your wake word - should see "!" when detected
|
||||||
|
# Speak other phrases - should not trigger
|
||||||
|
|
||||||
|
# Test with audio files
|
||||||
|
precise-test hey-computer.net hey-computer/test/
|
||||||
|
|
||||||
|
# Should show accuracy metrics:
|
||||||
|
# Wake word accuracy: 95%+
|
||||||
|
# False positive rate: <5%
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 6: Optimize Sensitivity
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Adjust activation threshold
|
||||||
|
precise-listen hey-computer.net -t 0.5 # Default
|
||||||
|
precise-listen hey-computer.net -t 0.7 # More conservative
|
||||||
|
precise-listen hey-computer.net -t 0.3 # More aggressive
|
||||||
|
|
||||||
|
# Find optimal threshold for your use case
|
||||||
|
# Higher = fewer false positives, more false negatives
|
||||||
|
# Lower = more false positives, fewer false negatives
|
||||||
|
```
|
||||||
|
|
||||||
|
## Phase 3: Integration with Voice Server
|
||||||
|
|
||||||
|
### Update voice_server.py
|
||||||
|
|
||||||
|
Add Mycroft Precise support to the server:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Add to imports
|
||||||
|
from precise_runner import PreciseEngine, PreciseRunner
|
||||||
|
import pyaudio
|
||||||
|
|
||||||
|
# Add to configuration
|
||||||
|
PRECISE_MODEL = os.getenv("PRECISE_MODEL",
|
||||||
|
"/home/alan/precise-models/hey-computer.net")
|
||||||
|
PRECISE_SENSITIVITY = float(os.getenv("PRECISE_SENSITIVITY", "0.5"))
|
||||||
|
|
||||||
|
# Global precise runner
|
||||||
|
precise_runner = None
|
||||||
|
|
||||||
|
def on_activation():
|
||||||
|
"""Called when wake word is detected"""
|
||||||
|
print("Wake word detected!")
|
||||||
|
# Trigger recording and processing
|
||||||
|
# (Implementation depends on your audio streaming setup)
|
||||||
|
|
||||||
|
def start_precise_listener():
|
||||||
|
"""Start Mycroft Precise wake word detection"""
|
||||||
|
global precise_runner
|
||||||
|
|
||||||
|
engine = PreciseEngine(
|
||||||
|
'/usr/local/bin/precise-engine',
|
||||||
|
PRECISE_MODEL
|
||||||
|
)
|
||||||
|
|
||||||
|
precise_runner = PreciseRunner(
|
||||||
|
engine,
|
||||||
|
sensitivity=PRECISE_SENSITIVITY,
|
||||||
|
on_activation=on_activation
|
||||||
|
)
|
||||||
|
|
||||||
|
precise_runner.start()
|
||||||
|
print(f"Precise listening with model: {PRECISE_MODEL}")
|
||||||
|
```
|
||||||
|
|
||||||
|
### Server-Side Wake Word Detection Architecture
|
||||||
|
|
||||||
|
For server-side detection, you need continuous audio streaming from Maix Duino:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# New endpoint for audio streaming
|
||||||
|
@app.route('/stream', methods=['POST'])
|
||||||
|
def stream_audio():
|
||||||
|
"""
|
||||||
|
Receive continuous audio stream for wake word detection
|
||||||
|
|
||||||
|
This endpoint processes incoming audio chunks and runs them
|
||||||
|
through Mycroft Precise for wake word detection.
|
||||||
|
"""
|
||||||
|
# Implementation here
|
||||||
|
pass
|
||||||
|
```
|
||||||
|
|
||||||
|
## Phase 4: Maix Duino Integration (Server-Side Detection)
|
||||||
|
|
||||||
|
### Update maix_voice_client.py
|
||||||
|
|
||||||
|
For server-side detection, stream audio continuously:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Add to configuration
|
||||||
|
STREAM_ENDPOINT = "/stream"
|
||||||
|
WAKE_WORD_CHECK_INTERVAL = 0.1 # Check every 100ms
|
||||||
|
|
||||||
|
def stream_audio_continuous():
|
||||||
|
"""
|
||||||
|
Stream audio to server for wake word detection
|
||||||
|
|
||||||
|
Server will notify us when wake word is detected
|
||||||
|
"""
|
||||||
|
import socket
|
||||||
|
import struct
|
||||||
|
|
||||||
|
# Create socket connection
|
||||||
|
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
|
||||||
|
server_addr = (VOICE_SERVER_URL.replace('http://', '').split(':')[0], 8888)
|
||||||
|
|
||||||
|
try:
|
||||||
|
sock.connect(server_addr)
|
||||||
|
print("Connected to wake word server")
|
||||||
|
|
||||||
|
while True:
|
||||||
|
# Capture audio chunk
|
||||||
|
chunk = i2s_dev.record(CHUNK_SIZE)
|
||||||
|
|
||||||
|
if chunk:
|
||||||
|
# Send chunk size first, then chunk
|
||||||
|
sock.sendall(struct.pack('>I', len(chunk)))
|
||||||
|
sock.sendall(chunk)
|
||||||
|
|
||||||
|
# Check for wake word detection signal
|
||||||
|
# (simplified - actual implementation needs non-blocking socket)
|
||||||
|
|
||||||
|
time.sleep(0.01)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Streaming error: {e}")
|
||||||
|
finally:
|
||||||
|
sock.close()
|
||||||
|
```
|
||||||
|
|
||||||
|
## Phase 5: Edge Detection on Maix Duino (Advanced)
|
||||||
|
|
||||||
|
### Convert Precise Model to KMODEL
|
||||||
|
|
||||||
|
This is complex and requires several conversion steps:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Step 1: Convert TensorFlow model to ONNX
|
||||||
|
pip install tf2onnx --break-system-packages
|
||||||
|
|
||||||
|
python -m tf2onnx.convert \
|
||||||
|
--saved-model hey-computer.net \
|
||||||
|
--output hey-computer.onnx
|
||||||
|
|
||||||
|
# Step 2: Optimize ONNX model
|
||||||
|
pip install onnx --break-system-packages
|
||||||
|
|
||||||
|
python -c "
|
||||||
|
import onnx
|
||||||
|
from onnx import optimizer
|
||||||
|
|
||||||
|
model = onnx.load('hey-computer.onnx')
|
||||||
|
passes = ['eliminate_deadend', 'eliminate_identity',
|
||||||
|
'eliminate_nop_dropout', 'eliminate_nop_pad']
|
||||||
|
optimized = optimizer.optimize(model, passes)
|
||||||
|
onnx.save(optimized, 'hey-computer-opt.onnx')
|
||||||
|
"
|
||||||
|
|
||||||
|
# Step 3: Convert ONNX to KMODEL (for K210)
|
||||||
|
# Use nncase (https://github.com/kendryte/nncase)
|
||||||
|
# This step is hardware-specific and complex
|
||||||
|
|
||||||
|
# Install nncase
|
||||||
|
pip install nncase --break-system-packages
|
||||||
|
|
||||||
|
# Convert (adjust parameters based on your model)
|
||||||
|
ncc compile hey-computer-opt.onnx \
|
||||||
|
-i onnx \
|
||||||
|
--dataset calibration_data \
|
||||||
|
-o hey-computer.kmodel \
|
||||||
|
--target k210
|
||||||
|
```
|
||||||
|
|
||||||
|
**Note:** KMODEL conversion is non-trivial and may require model architecture adjustments. The K210 has limitations:
|
||||||
|
- Max model size: ~6MB
|
||||||
|
- Limited operators support
|
||||||
|
- Quantization required for performance
|
||||||
|
|
||||||
|
### Testing KMODEL on Maix Duino
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Load model in maix_voice_client.py
|
||||||
|
import KPU as kpu
|
||||||
|
|
||||||
|
def load_wake_word_model_kmodel():
|
||||||
|
"""Load converted KMODEL for wake word detection"""
|
||||||
|
global kpu_task
|
||||||
|
|
||||||
|
try:
|
||||||
|
kpu_task = kpu.load("/sd/models/hey-computer.kmodel")
|
||||||
|
print("Wake word model loaded on K210")
|
||||||
|
return True
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Failed to load model: {e}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
def detect_wake_word_kmodel():
|
||||||
|
"""Run wake word detection using K210 KPU"""
|
||||||
|
global kpu_task
|
||||||
|
|
||||||
|
# Capture audio
|
||||||
|
audio_chunk = i2s_dev.record(CHUNK_SIZE)
|
||||||
|
|
||||||
|
# Preprocess for model (depends on model input format)
|
||||||
|
# This is model-specific - adjust based on your training
|
||||||
|
|
||||||
|
# Run inference
|
||||||
|
features = preprocess_audio(audio_chunk)
|
||||||
|
output = kpu.run_yolo2(kpu_task, features) # Adjust based on model type
|
||||||
|
|
||||||
|
# Check confidence
|
||||||
|
if output[0] > WAKE_WORD_THRESHOLD:
|
||||||
|
return True
|
||||||
|
|
||||||
|
return False
|
||||||
|
```
|
||||||
|
|
||||||
|
## Recommended Wake Words
|
||||||
|
|
||||||
|
Based on testing and community feedback:
|
||||||
|
|
||||||
|
**Best performers:**
|
||||||
|
1. "Hey Computer" - Clear, distinct, 2-syllable, hard consonants
|
||||||
|
2. "Okay Jarvis" - Pop culture reference, easy to say
|
||||||
|
3. "Hey Mycroft" - Original Mycroft wake word (lots of training data available)
|
||||||
|
|
||||||
|
**Avoid:**
|
||||||
|
- Single syllable words (too easy to trigger)
|
||||||
|
- Common phrases ("okay", "hey there")
|
||||||
|
- Names of people in your household
|
||||||
|
- Words that sound like common speech patterns
|
||||||
|
|
||||||
|
## Training Tips
|
||||||
|
|
||||||
|
### For Best Accuracy
|
||||||
|
|
||||||
|
1. **Diverse training data:**
|
||||||
|
- Multiple speakers
|
||||||
|
- Various distances (1ft to 15ft)
|
||||||
|
- Different noise conditions
|
||||||
|
- Accent variations
|
||||||
|
|
||||||
|
2. **Quality over quantity:**
|
||||||
|
- 50 good samples > 200 poor samples
|
||||||
|
- Clear pronunciation
|
||||||
|
- Consistent volume
|
||||||
|
|
||||||
|
3. **Hard negatives:**
|
||||||
|
- Include similar-sounding phrases
|
||||||
|
- Include partial wake words
|
||||||
|
- Include common false triggers you notice
|
||||||
|
|
||||||
|
4. **Regular retraining:**
|
||||||
|
- Add false positives to training set
|
||||||
|
- Add missed detections
|
||||||
|
- Retrain every few weeks initially
|
||||||
|
|
||||||
|
### Collecting Hard Negatives
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Run Precise in test mode and collect false positives
|
||||||
|
precise-listen hey-computer.net --save-false-positives
|
||||||
|
|
||||||
|
# This will save audio clips when model triggers incorrectly
|
||||||
|
# Add these to your not-wake-word training set
|
||||||
|
# Retrain to reduce false positives
|
||||||
|
```
|
||||||
|
|
||||||
|
## Performance Benchmarks
|
||||||
|
|
||||||
|
### Server-Side Detection (Heimdall)
|
||||||
|
- **Latency:** 100-200ms from utterance to detection
|
||||||
|
- **Accuracy:** 95%+ with good training
|
||||||
|
- **False positive rate:** <1 per hour with tuning
|
||||||
|
- **CPU usage:** ~5-10% (single core)
|
||||||
|
- **Network:** ~128kbps continuous stream
|
||||||
|
|
||||||
|
### Edge Detection (Maix Duino)
|
||||||
|
- **Latency:** 50-100ms
|
||||||
|
- **Accuracy:** 80-90% (limited by K210 quantization)
|
||||||
|
- **False positive rate:** Varies by model optimization
|
||||||
|
- **CPU usage:** ~30% K210 (leaves room for other tasks)
|
||||||
|
- **Network:** 0 until wake detected
|
||||||
|
|
||||||
|
## Monitoring and Debugging
|
||||||
|
|
||||||
|
### Log Wake Word Detections
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Add to voice_server.py
|
||||||
|
import datetime
|
||||||
|
|
||||||
|
def log_wake_word(confidence, timestamp=None):
|
||||||
|
"""Log wake word detections for analysis"""
|
||||||
|
if timestamp is None:
|
||||||
|
timestamp = datetime.datetime.now()
|
||||||
|
|
||||||
|
log_file = "/home/alan/voice-assistant/logs/wake_words.log"
|
||||||
|
|
||||||
|
with open(log_file, 'a') as f:
|
||||||
|
f.write(f"{timestamp.isoformat()},{confidence}\n")
|
||||||
|
```
|
||||||
|
|
||||||
|
### Analyze False Positives
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check wake word log
|
||||||
|
tail -f ~/voice-assistant/logs/wake_words.log
|
||||||
|
|
||||||
|
# Find patterns in false positives
|
||||||
|
grep "wake_word" ~/voice-assistant/logs/wake_words.log | \
|
||||||
|
awk -F',' '{print $2}' | \
|
||||||
|
sort -n | uniq -c
|
||||||
|
```
|
||||||
|
|
||||||
|
## Production Deployment
|
||||||
|
|
||||||
|
### Systemd Service with Precise
|
||||||
|
|
||||||
|
Update the systemd service to include Precise:
|
||||||
|
|
||||||
|
```ini
|
||||||
|
[Unit]
|
||||||
|
Description=Voice Assistant with Wake Word Detection
|
||||||
|
After=network.target
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=simple
|
||||||
|
User=alan
|
||||||
|
WorkingDirectory=/home/alan/voice-assistant
|
||||||
|
Environment="PATH=/home/alan/miniconda3/envs/precise/bin:/usr/local/bin:/usr/bin:/bin"
|
||||||
|
EnvironmentFile=/home/alan/voice-assistant/config/.env
|
||||||
|
ExecStart=/home/alan/miniconda3/envs/precise/bin/python voice_server.py --enable-precise
|
||||||
|
Restart=on-failure
|
||||||
|
RestartSec=10
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=multi-user.target
|
||||||
|
```
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Precise Won't Start
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check TensorFlow version
|
||||||
|
python -c "import tensorflow as tf; print(tf.__version__)"
|
||||||
|
# Should be 1.15.x
|
||||||
|
|
||||||
|
# Check model file
|
||||||
|
file hey-computer.net
|
||||||
|
# Should be "TensorFlow SavedModel"
|
||||||
|
|
||||||
|
# Test model directly
|
||||||
|
precise-engine hey-computer.net
|
||||||
|
# Should load without errors
|
||||||
|
```
|
||||||
|
|
||||||
|
### Low Accuracy
|
||||||
|
|
||||||
|
1. **Collect more training data** - Especially hard negatives
|
||||||
|
2. **Increase training epochs** - Try 200-300 epochs
|
||||||
|
3. **Verify training/test split** - Should be 80/20
|
||||||
|
4. **Check audio quality** - Sample rate should match (16kHz)
|
||||||
|
5. **Try different wake words** - Some are easier to detect
|
||||||
|
|
||||||
|
### High False Positive Rate
|
||||||
|
|
||||||
|
1. **Increase threshold** - Try 0.6, 0.7, 0.8
|
||||||
|
2. **Add false positives to training** - Retrain with false triggers
|
||||||
|
3. **Collect more negative samples** - Expand not-wake-word set
|
||||||
|
4. **Use ensemble models** - Run multiple models, require agreement
|
||||||
|
|
||||||
|
### KMODEL Conversion Fails
|
||||||
|
|
||||||
|
This is expected - K210 conversion is complex:
|
||||||
|
|
||||||
|
1. **Simplify model architecture** - Reduce layer count
|
||||||
|
2. **Use quantization-aware training** - Train with quantization in mind
|
||||||
|
3. **Check operator support** - K210 doesn't support all TF ops
|
||||||
|
4. **Consider alternatives:**
|
||||||
|
- Use pre-trained models for K210
|
||||||
|
- Stick with server-side detection
|
||||||
|
- Use Porcupine instead (has K210 support)
|
||||||
|
|
||||||
|
## Alternative: Use Pre-trained Models
|
||||||
|
|
||||||
|
Mycroft provides some pre-trained models:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Download Hey Mycroft model
|
||||||
|
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
|
||||||
|
tar xzf hey-mycroft.tar.gz
|
||||||
|
|
||||||
|
# Test it
|
||||||
|
precise-listen hey-mycroft.net
|
||||||
|
```
|
||||||
|
|
||||||
|
Then train your own wake word starting from this base:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Fine-tune from pre-trained model
|
||||||
|
precise-train -e 60 my-wake-word.net my-wake-word/ \
|
||||||
|
--from-checkpoint hey-mycroft.net
|
||||||
|
```
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
|
||||||
|
1. **Start with server-side** - Get it working on Heimdall first
|
||||||
|
2. **Collect good training data** - Quality samples are key
|
||||||
|
3. **Test and tune threshold** - Find the sweet spot for your environment
|
||||||
|
4. **Monitor performance** - Track false positives and misses
|
||||||
|
5. **Iterate on training** - Add hard examples, retrain
|
||||||
|
6. **Consider edge deployment** - Once server-side is solid
|
||||||
|
|
||||||
|
## Resources
|
||||||
|
|
||||||
|
- Mycroft Precise Docs: https://github.com/MycroftAI/mycroft-precise
|
||||||
|
- Training Guide: https://mycroft-ai.gitbook.io/docs/mycroft-technologies/precise
|
||||||
|
- Community Models: https://github.com/MycroftAI/precise-data
|
||||||
|
- K210 Docs: https://canaan-creative.com/developer
|
||||||
|
- nncase: https://github.com/kendryte/nncase
|
||||||
|
|
||||||
|
## Conclusion
|
||||||
|
|
||||||
|
Mycroft Precise gives you full control over your wake word detection with complete privacy. Start with server-side detection for easier development, then optimize to edge detection once you have a well-trained model.
|
||||||
|
|
||||||
|
The key to success is good training data - invest time in collecting diverse, high-quality samples!
|
||||||
577
docs/PRECISE_DEPLOYMENT.md
Executable file
577
docs/PRECISE_DEPLOYMENT.md
Executable file
|
|
@ -0,0 +1,577 @@
|
||||||
|
# Mycroft Precise Deployment Guide
|
||||||
|
|
||||||
|
## Quick Reference: Server vs Edge Detection
|
||||||
|
|
||||||
|
### Server-Side Detection (Recommended for Start)
|
||||||
|
|
||||||
|
**Setup:**
|
||||||
|
```bash
|
||||||
|
# 1. On Heimdall: Setup Precise
|
||||||
|
./setup_precise.sh --wake-word "hey computer"
|
||||||
|
|
||||||
|
# 2. Train your model (follow scripts in ~/precise-models/hey-computer/)
|
||||||
|
cd ~/precise-models/hey-computer
|
||||||
|
./1-record-wake-word.sh
|
||||||
|
./2-record-not-wake-word.sh
|
||||||
|
# Organize samples, then:
|
||||||
|
./3-train-model.sh
|
||||||
|
./4-test-model.sh
|
||||||
|
|
||||||
|
# 3. Start voice server with Precise
|
||||||
|
cd ~/voice-assistant
|
||||||
|
conda activate precise
|
||||||
|
python voice_server.py \
|
||||||
|
--enable-precise \
|
||||||
|
--precise-model ~/precise-models/hey-computer/hey-computer.net \
|
||||||
|
--precise-sensitivity 0.5
|
||||||
|
```
|
||||||
|
|
||||||
|
**Architecture:**
|
||||||
|
- Maix Duino → Continuous audio stream → Heimdall
|
||||||
|
- Heimdall runs Precise on audio stream
|
||||||
|
- On wake word: Process command with Whisper
|
||||||
|
- Response → TTS → Stream back to Maix Duino
|
||||||
|
|
||||||
|
**Pros:** Easier setup, better accuracy, simple updates
|
||||||
|
**Cons:** More network traffic, requires stable connection
|
||||||
|
|
||||||
|
### Edge Detection (Advanced - Future Phase)
|
||||||
|
|
||||||
|
**Setup:**
|
||||||
|
```bash
|
||||||
|
# 1. Train model on Heimdall (same as above)
|
||||||
|
# 2. Convert to KMODEL for K210
|
||||||
|
# 3. Deploy to Maix Duino
|
||||||
|
# (See MYCROFT_PRECISE_GUIDE.md for detailed conversion steps)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Architecture:**
|
||||||
|
- Maix Duino runs Precise locally on K210
|
||||||
|
- Only sends audio after wake word detected
|
||||||
|
- Lower latency, less network traffic
|
||||||
|
|
||||||
|
**Pros:** Lower latency, less bandwidth, works offline
|
||||||
|
**Cons:** Complex conversion, lower accuracy, harder updates
|
||||||
|
|
||||||
|
## Phase-by-Phase Deployment
|
||||||
|
|
||||||
|
### Phase 1: Server Setup (Day 1)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# On Heimdall
|
||||||
|
ssh alan@10.1.10.71
|
||||||
|
|
||||||
|
# 1. Setup voice assistant base
|
||||||
|
./setup_voice_assistant.sh
|
||||||
|
|
||||||
|
# 2. Setup Mycroft Precise
|
||||||
|
./setup_precise.sh --wake-word "hey computer"
|
||||||
|
|
||||||
|
# 3. Configure environment
|
||||||
|
vim ~/voice-assistant/config/.env
|
||||||
|
```
|
||||||
|
|
||||||
|
Update `.env`:
|
||||||
|
```bash
|
||||||
|
HA_URL=http://your-home-assistant:8123
|
||||||
|
HA_TOKEN=your_token_here
|
||||||
|
PRECISE_MODEL=/home/alan/precise-models/hey-computer/hey-computer.net
|
||||||
|
PRECISE_SENSITIVITY=0.5
|
||||||
|
```
|
||||||
|
|
||||||
|
### Phase 2: Wake Word Training (Day 1-2)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Navigate to training directory
|
||||||
|
cd ~/precise-models/hey-computer
|
||||||
|
conda activate precise
|
||||||
|
|
||||||
|
# Record samples (30-60 minutes)
|
||||||
|
./1-record-wake-word.sh # Record 50-100 wake word samples
|
||||||
|
./2-record-not-wake-word.sh # Record 200-500 negative samples
|
||||||
|
|
||||||
|
# Organize samples
|
||||||
|
# Move 80% of wake-word recordings to wake-word/
|
||||||
|
# Move 20% of wake-word recordings to test/wake-word/
|
||||||
|
# Move 80% of not-wake-word to not-wake-word/
|
||||||
|
# Move 20% of not-wake-word to test/not-wake-word/
|
||||||
|
|
||||||
|
# Train model (30-60 minutes)
|
||||||
|
./3-train-model.sh
|
||||||
|
|
||||||
|
# Test model
|
||||||
|
./4-test-model.sh
|
||||||
|
|
||||||
|
# Evaluate on test set
|
||||||
|
./5-evaluate-model.sh
|
||||||
|
|
||||||
|
# Tune threshold
|
||||||
|
./6-tune-threshold.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
### Phase 3: Server Integration (Day 2)
|
||||||
|
|
||||||
|
#### Option A: Manual Testing
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd ~/voice-assistant
|
||||||
|
conda activate precise
|
||||||
|
|
||||||
|
# Start server with Precise enabled
|
||||||
|
python voice_server.py \
|
||||||
|
--enable-precise \
|
||||||
|
--precise-model ~/precise-models/hey-computer/hey-computer.net \
|
||||||
|
--precise-sensitivity 0.5 \
|
||||||
|
--ha-url http://your-ha:8123 \
|
||||||
|
--ha-token your_token
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Option B: Systemd Service
|
||||||
|
|
||||||
|
Update systemd service to use Precise environment:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo vim /etc/systemd/system/voice-assistant.service
|
||||||
|
```
|
||||||
|
|
||||||
|
```ini
|
||||||
|
[Unit]
|
||||||
|
Description=Voice Assistant with Wake Word Detection
|
||||||
|
After=network.target
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=simple
|
||||||
|
User=alan
|
||||||
|
WorkingDirectory=/home/alan/voice-assistant
|
||||||
|
Environment="PATH=/home/alan/miniconda3/envs/precise/bin:/usr/local/bin:/usr/bin:/bin"
|
||||||
|
EnvironmentFile=/home/alan/voice-assistant/config/.env
|
||||||
|
ExecStart=/home/alan/miniconda3/envs/precise/bin/python voice_server.py \
|
||||||
|
--enable-precise \
|
||||||
|
--precise-model /home/alan/precise-models/hey-computer/hey-computer.net \
|
||||||
|
--precise-sensitivity 0.5
|
||||||
|
Restart=on-failure
|
||||||
|
RestartSec=10
|
||||||
|
StandardOutput=append:/home/alan/voice-assistant/logs/voice_assistant.log
|
||||||
|
StandardError=append:/home/alan/voice-assistant/logs/voice_assistant_error.log
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=multi-user.target
|
||||||
|
```
|
||||||
|
|
||||||
|
Enable and start:
|
||||||
|
```bash
|
||||||
|
sudo systemctl daemon-reload
|
||||||
|
sudo systemctl enable voice-assistant
|
||||||
|
sudo systemctl start voice-assistant
|
||||||
|
sudo systemctl status voice-assistant
|
||||||
|
```
|
||||||
|
|
||||||
|
### Phase 4: Maix Duino Setup (Day 2-3)
|
||||||
|
|
||||||
|
For server-side wake word detection, Maix Duino streams audio:
|
||||||
|
|
||||||
|
Update `maix_voice_client.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Use simplified mode - just stream audio
|
||||||
|
# Server handles wake word detection
|
||||||
|
CONTINUOUS_STREAM = True # Enable continuous streaming
|
||||||
|
WAKE_WORD_CHECK_INTERVAL = 0 # Server-side detection
|
||||||
|
```
|
||||||
|
|
||||||
|
Flash and test:
|
||||||
|
1. Copy updated script to SD card
|
||||||
|
2. Boot Maix Duino
|
||||||
|
3. Check serial console for connection
|
||||||
|
4. Speak wake word
|
||||||
|
5. Verify server logs show detection
|
||||||
|
|
||||||
|
### Phase 5: Testing & Tuning (Day 3-7)
|
||||||
|
|
||||||
|
#### Test Wake Word Detection
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Monitor server logs
|
||||||
|
journalctl -u voice-assistant -f
|
||||||
|
|
||||||
|
# Or check detections via API
|
||||||
|
curl http://10.1.10.71:5000/wake-word/detections
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Test End-to-End Flow
|
||||||
|
|
||||||
|
1. Say wake word: "Hey Computer"
|
||||||
|
2. Wait for LED/beep on Maix Duino
|
||||||
|
3. Say command: "Turn on the living room lights"
|
||||||
|
4. Verify HA command executes
|
||||||
|
5. Hear TTS response
|
||||||
|
|
||||||
|
#### Monitor Performance
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check wake word log
|
||||||
|
tail -f ~/voice-assistant/logs/wake_words.log
|
||||||
|
|
||||||
|
# Check false positive rate
|
||||||
|
grep "wake_word" ~/voice-assistant/logs/wake_words.log | wc -l
|
||||||
|
|
||||||
|
# Check accuracy
|
||||||
|
# Should see detections when you say wake word
|
||||||
|
# Should NOT see detections during normal conversation
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Tune Sensitivity
|
||||||
|
|
||||||
|
If too many false positives:
|
||||||
|
```bash
|
||||||
|
# Increase threshold (more conservative)
|
||||||
|
# Edit systemd service or restart with:
|
||||||
|
python voice_server.py --precise-sensitivity 0.7
|
||||||
|
```
|
||||||
|
|
||||||
|
If missing wake words:
|
||||||
|
```bash
|
||||||
|
# Decrease threshold (more aggressive)
|
||||||
|
python voice_server.py --precise-sensitivity 0.3
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Collect Hard Examples
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# When you notice false positives, record them
|
||||||
|
cd ~/precise-models/hey-computer
|
||||||
|
precise-collect -f not-wake-word/false-positive-$(date +%s).wav
|
||||||
|
|
||||||
|
# When wake word is missed, record it
|
||||||
|
precise-collect -f wake-word/missed-$(date +%s).wav
|
||||||
|
|
||||||
|
# After collecting 10-20 examples, retrain
|
||||||
|
./3-train-model.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
## Monitoring Commands
|
||||||
|
|
||||||
|
### Check System Status
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Service status
|
||||||
|
sudo systemctl status voice-assistant
|
||||||
|
|
||||||
|
# Server health
|
||||||
|
curl http://10.1.10.71:5000/health
|
||||||
|
|
||||||
|
# Wake word status
|
||||||
|
curl http://10.1.10.71:5000/wake-word/status
|
||||||
|
|
||||||
|
# Recent detections
|
||||||
|
curl http://10.1.10.71:5000/wake-word/detections
|
||||||
|
```
|
||||||
|
|
||||||
|
### View Logs
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Real-time server logs
|
||||||
|
journalctl -u voice-assistant -f
|
||||||
|
|
||||||
|
# Last 50 lines
|
||||||
|
journalctl -u voice-assistant -n 50
|
||||||
|
|
||||||
|
# Specific log file
|
||||||
|
tail -f ~/voice-assistant/logs/voice_assistant.log
|
||||||
|
|
||||||
|
# Wake word detections
|
||||||
|
tail -f ~/voice-assistant/logs/wake_words.log
|
||||||
|
|
||||||
|
# Maix Duino serial console
|
||||||
|
screen /dev/ttyUSB0 115200
|
||||||
|
```
|
||||||
|
|
||||||
|
### Performance Metrics
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# CPU usage (should be ~5-10% idle, spikes during processing)
|
||||||
|
top -p $(pgrep -f voice_server.py)
|
||||||
|
|
||||||
|
# Memory usage
|
||||||
|
ps aux | grep voice_server.py
|
||||||
|
|
||||||
|
# Network traffic (if streaming audio)
|
||||||
|
iftop -i eth0 # or your network interface
|
||||||
|
```
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Wake Word Not Detecting
|
||||||
|
|
||||||
|
**Check model is loaded:**
|
||||||
|
```bash
|
||||||
|
curl http://10.1.10.71:5000/wake-word/status
|
||||||
|
# Should show: "enabled": true
|
||||||
|
```
|
||||||
|
|
||||||
|
**Test model directly:**
|
||||||
|
```bash
|
||||||
|
conda activate precise
|
||||||
|
precise-listen ~/precise-models/hey-computer/hey-computer.net
|
||||||
|
# Speak wake word - should see "!"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Check sensitivity:**
|
||||||
|
```bash
|
||||||
|
# Try lower threshold
|
||||||
|
precise-listen ~/precise-models/hey-computer/hey-computer.net -t 0.3
|
||||||
|
```
|
||||||
|
|
||||||
|
**Verify audio input:**
|
||||||
|
```bash
|
||||||
|
# Test microphone
|
||||||
|
arecord -d 5 test.wav
|
||||||
|
aplay test.wav
|
||||||
|
```
|
||||||
|
|
||||||
|
### Too Many False Positives
|
||||||
|
|
||||||
|
**Increase threshold:**
|
||||||
|
```bash
|
||||||
|
# Edit service or restart with higher sensitivity
|
||||||
|
python voice_server.py --precise-sensitivity 0.7
|
||||||
|
```
|
||||||
|
|
||||||
|
**Retrain with false positives:**
|
||||||
|
```bash
|
||||||
|
cd ~/precise-models/hey-computer
|
||||||
|
# Record false triggers in not-wake-word/
|
||||||
|
precise-collect -f not-wake-word/false-triggers.wav
|
||||||
|
# Add to not-wake-word training set
|
||||||
|
./3-train-model.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
### Server Won't Start with Precise
|
||||||
|
|
||||||
|
**Check Precise installation:**
|
||||||
|
```bash
|
||||||
|
conda activate precise
|
||||||
|
python -c "from precise_runner import PreciseRunner; print('OK')"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Check engine:**
|
||||||
|
```bash
|
||||||
|
precise-engine --version
|
||||||
|
# Should show: Precise v0.3.0
|
||||||
|
```
|
||||||
|
|
||||||
|
**Check model file:**
|
||||||
|
```bash
|
||||||
|
ls -lh ~/precise-models/hey-computer/hey-computer.net
|
||||||
|
file ~/precise-models/hey-computer/hey-computer.net
|
||||||
|
```
|
||||||
|
|
||||||
|
**Check permissions:**
|
||||||
|
```bash
|
||||||
|
chmod +x /usr/local/bin/precise-engine
|
||||||
|
chmod 644 ~/precise-models/hey-computer/hey-computer.net
|
||||||
|
```
|
||||||
|
|
||||||
|
### Audio Quality Issues
|
||||||
|
|
||||||
|
**Test audio path:**
|
||||||
|
```bash
|
||||||
|
# Record test on server
|
||||||
|
arecord -f S16_LE -r 16000 -c 1 -d 5 test.wav
|
||||||
|
|
||||||
|
# Transcribe with Whisper
|
||||||
|
conda activate voice-assistant
|
||||||
|
python -c "
|
||||||
|
import whisper
|
||||||
|
model = whisper.load_model('base')
|
||||||
|
result = model.transcribe('test.wav')
|
||||||
|
print(result['text'])
|
||||||
|
"
|
||||||
|
```
|
||||||
|
|
||||||
|
**If poor quality:**
|
||||||
|
- Check microphone connection
|
||||||
|
- Verify sample rate (16kHz)
|
||||||
|
- Test with USB microphone
|
||||||
|
- Check for interference/noise
|
||||||
|
|
||||||
|
### Maix Duino Connection Issues
|
||||||
|
|
||||||
|
**Check WiFi:**
|
||||||
|
```python
|
||||||
|
# In Maix Duino serial console
|
||||||
|
import network
|
||||||
|
wlan = network.WLAN(network.STA_IF)
|
||||||
|
print(wlan.isconnected())
|
||||||
|
print(wlan.ifconfig())
|
||||||
|
```
|
||||||
|
|
||||||
|
**Check server reachability:**
|
||||||
|
```python
|
||||||
|
# From Maix Duino
|
||||||
|
import urequests
|
||||||
|
response = urequests.get('http://10.1.10.71:5000/health')
|
||||||
|
print(response.json())
|
||||||
|
```
|
||||||
|
|
||||||
|
**Check audio streaming:**
|
||||||
|
```bash
|
||||||
|
# On Heimdall, monitor network
|
||||||
|
sudo tcpdump -i any -n host <maix-duino-ip>
|
||||||
|
# Should see continuous packets when streaming
|
||||||
|
```
|
||||||
|
|
||||||
|
## Optimization Tips
|
||||||
|
|
||||||
|
### Reduce Latency
|
||||||
|
|
||||||
|
1. **Use smaller Whisper model:**
|
||||||
|
```bash
|
||||||
|
# Edit .env
|
||||||
|
WHISPER_MODEL=base # or tiny
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Optimize Precise sensitivity:**
|
||||||
|
```bash
|
||||||
|
# Find sweet spot between false positives and latency
|
||||||
|
# Lower threshold = faster trigger but more false positives
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Pre-load models:**
|
||||||
|
```python
|
||||||
|
# Models load on startup, not first request
|
||||||
|
# Adds ~30s startup time but eliminates first-request delay
|
||||||
|
```
|
||||||
|
|
||||||
|
### Improve Accuracy
|
||||||
|
|
||||||
|
1. **Use larger Whisper model:**
|
||||||
|
```bash
|
||||||
|
WHISPER_MODEL=large
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Train more wake word samples:**
|
||||||
|
```bash
|
||||||
|
# Aim for 100+ high-quality samples
|
||||||
|
# Diverse speakers, conditions, distances
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Increase training epochs:**
|
||||||
|
```bash
|
||||||
|
# In 3-train-model.sh
|
||||||
|
precise-train -e 120 hey-computer.net . # vs default 60
|
||||||
|
```
|
||||||
|
|
||||||
|
### Reduce False Positives
|
||||||
|
|
||||||
|
1. **Collect hard negatives:**
|
||||||
|
```bash
|
||||||
|
# Record TV, music, similar phrases
|
||||||
|
# Add to not-wake-word training set
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Increase threshold:**
|
||||||
|
```bash
|
||||||
|
--precise-sensitivity 0.7 # vs default 0.5
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Use ensemble model:**
|
||||||
|
```python
|
||||||
|
# Run multiple models, require agreement
|
||||||
|
# Advanced - requires code modification
|
||||||
|
```
|
||||||
|
|
||||||
|
## Production Checklist
|
||||||
|
|
||||||
|
- [ ] Wake word model trained with 50+ samples
|
||||||
|
- [ ] Model tested with <5% false positive rate
|
||||||
|
- [ ] Server service enabled and auto-starting
|
||||||
|
- [ ] Home Assistant token configured
|
||||||
|
- [ ] Maix Duino WiFi configured
|
||||||
|
- [ ] End-to-end test successful
|
||||||
|
- [ ] Logs rotating properly
|
||||||
|
- [ ] Monitoring in place
|
||||||
|
- [ ] Backup of trained model
|
||||||
|
- [ ] Documentation updated
|
||||||
|
|
||||||
|
## Backup and Recovery
|
||||||
|
|
||||||
|
### Backup Trained Model
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Backup model
|
||||||
|
cp ~/precise-models/hey-computer/hey-computer.net \
|
||||||
|
~/precise-models/hey-computer/hey-computer.net.backup
|
||||||
|
|
||||||
|
# Backup to another host
|
||||||
|
scp ~/precise-models/hey-computer/hey-computer.net \
|
||||||
|
user@backup-host:/path/to/backups/
|
||||||
|
```
|
||||||
|
|
||||||
|
### Restore from Backup
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Restore model
|
||||||
|
cp ~/precise-models/hey-computer/hey-computer.net.backup \
|
||||||
|
~/precise-models/hey-computer/hey-computer.net
|
||||||
|
|
||||||
|
# Restart service
|
||||||
|
sudo systemctl restart voice-assistant
|
||||||
|
```
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
|
||||||
|
Once basic server-side detection is working:
|
||||||
|
|
||||||
|
1. **Add more intents** - Expand Home Assistant control
|
||||||
|
2. **Implement TTS playback** - Complete the audio response loop
|
||||||
|
3. **Multi-room support** - Deploy multiple Maix Duino units
|
||||||
|
4. **Voice profiles** - Train model on family members
|
||||||
|
5. **Edge deployment** - Convert model for K210 (advanced)
|
||||||
|
|
||||||
|
## Resources
|
||||||
|
|
||||||
|
- Main guide: MYCROFT_PRECISE_GUIDE.md
|
||||||
|
- Quick start: QUICKSTART.md
|
||||||
|
- Architecture: maix-voice-assistant-architecture.md
|
||||||
|
- Mycroft Docs: https://github.com/MycroftAI/mycroft-precise
|
||||||
|
- Community: https://community.mycroft.ai/
|
||||||
|
|
||||||
|
## Support
|
||||||
|
|
||||||
|
### Log an Issue
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Collect debug info
|
||||||
|
echo "=== System Info ===" > debug.log
|
||||||
|
uname -a >> debug.log
|
||||||
|
conda list >> debug.log
|
||||||
|
echo "=== Service Status ===" >> debug.log
|
||||||
|
systemctl status voice-assistant >> debug.log
|
||||||
|
echo "=== Recent Logs ===" >> debug.log
|
||||||
|
journalctl -u voice-assistant -n 100 >> debug.log
|
||||||
|
echo "=== Wake Word Status ===" >> debug.log
|
||||||
|
curl http://10.1.10.71:5000/wake-word/status >> debug.log
|
||||||
|
```
|
||||||
|
|
||||||
|
Then share `debug.log` when asking for help.
|
||||||
|
|
||||||
|
### Common Issues Database
|
||||||
|
|
||||||
|
| Symptom | Likely Cause | Solution |
|
||||||
|
|---------|--------------|----------|
|
||||||
|
| No wake detection | Model not loaded | Check `/wake-word/status` |
|
||||||
|
| Service won't start | Missing dependencies | Reinstall Precise |
|
||||||
|
| High false positives | Low threshold | Increase to 0.7+ |
|
||||||
|
| Missing wake words | High threshold | Decrease to 0.3-0.4 |
|
||||||
|
| Poor transcription | Bad audio quality | Check microphone |
|
||||||
|
| HA commands fail | Wrong token | Update .env |
|
||||||
|
| High CPU usage | Large Whisper model | Use smaller model |
|
||||||
|
|
||||||
|
## Conclusion
|
||||||
|
|
||||||
|
With Mycroft Precise, you have complete control over your wake word detection. Start with server-side detection for easier debugging, collect good training data, and tune the threshold for your environment. Once it's working well, you can optionally optimize to edge detection for lower latency.
|
||||||
|
|
||||||
|
The key to success: **Quality training data > Quantity**
|
||||||
|
|
||||||
|
Happy voice assisting! 🎙️
|
||||||
470
docs/QUESTIONS_ANSWERED.md
Executable file
470
docs/QUESTIONS_ANSWERED.md
Executable file
|
|
@ -0,0 +1,470 @@
|
||||||
|
# Your Questions Answered - Quick Reference
|
||||||
|
|
||||||
|
## TL;DR: Yes, Yes, and Multiple Options!
|
||||||
|
|
||||||
|
### Q1: Pre-trained "Hey Mycroft" Model?
|
||||||
|
|
||||||
|
**Answer: YES! ✅**
|
||||||
|
|
||||||
|
Download and use immediately:
|
||||||
|
```bash
|
||||||
|
./quick_start_hey_mycroft.sh
|
||||||
|
# Done in 5 minutes - no training!
|
||||||
|
```
|
||||||
|
|
||||||
|
The pre-trained model works great and saves you 1-2 hours of training time.
|
||||||
|
|
||||||
|
### Q2: Multiple Wake Words?
|
||||||
|
|
||||||
|
**Answer: YES! ✅ (with considerations)**
|
||||||
|
|
||||||
|
**Server-side (Heimdall):** Easy, run 3-5 wake words
|
||||||
|
```bash
|
||||||
|
python voice_server_enhanced.py \
|
||||||
|
--enable-precise \
|
||||||
|
--multi-wake-word
|
||||||
|
```
|
||||||
|
|
||||||
|
**Edge (K210):** Feasible for 1-2, challenging for 3+
|
||||||
|
|
||||||
|
### Q3: Adopting New Users' Voices?
|
||||||
|
|
||||||
|
**Answer: Multiple approaches ✅**
|
||||||
|
|
||||||
|
**Best option:** Train one model with everyone's voices upfront
|
||||||
|
**Alternative:** Incremental retraining as new users join
|
||||||
|
**Advanced:** Speaker identification with personalization
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Detailed Answers
|
||||||
|
|
||||||
|
### 1. Pre-trained "Hey Mycroft" Model
|
||||||
|
|
||||||
|
#### Where to Get It
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Quick start script does this for you
|
||||||
|
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
|
||||||
|
tar xzf hey-mycroft.tar.gz
|
||||||
|
```
|
||||||
|
|
||||||
|
#### How to Use
|
||||||
|
|
||||||
|
**Instant deployment:**
|
||||||
|
```bash
|
||||||
|
python voice_server.py \
|
||||||
|
--enable-precise \
|
||||||
|
--precise-model ~/precise-models/pretrained/hey-mycroft.net
|
||||||
|
```
|
||||||
|
|
||||||
|
**Fine-tune with your voice:**
|
||||||
|
```bash
|
||||||
|
# Record 20-30 samples of your voice saying "Hey Mycroft"
|
||||||
|
precise-collect
|
||||||
|
|
||||||
|
# Fine-tune from pre-trained
|
||||||
|
precise-train -e 30 my-hey-mycroft.net . \
|
||||||
|
--from-checkpoint ~/precise-models/pretrained/hey-mycroft.net
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Advantages
|
||||||
|
|
||||||
|
✅ **Zero training time** - Works immediately
|
||||||
|
✅ **Proven accuracy** - Tested by thousands
|
||||||
|
✅ **Good baseline** - Already includes diverse voices
|
||||||
|
✅ **Easy fine-tuning** - Add your voice in 30 mins vs 60+ mins from scratch
|
||||||
|
|
||||||
|
#### When to Use Pre-trained vs Custom
|
||||||
|
|
||||||
|
**Use Pre-trained "Hey Mycroft" when:**
|
||||||
|
- You want to test quickly
|
||||||
|
- "Hey Mycroft" is an acceptable wake word
|
||||||
|
- You want proven accuracy out-of-box
|
||||||
|
|
||||||
|
**Train Custom when:**
|
||||||
|
- You want a different wake word ("Hey Computer", "Jarvis", etc.)
|
||||||
|
- Maximum accuracy for your specific environment
|
||||||
|
- Family-specific wake word
|
||||||
|
|
||||||
|
**Hybrid (Recommended):**
|
||||||
|
- Start with pre-trained "Hey Mycroft"
|
||||||
|
- Test and learn the system
|
||||||
|
- Fine-tune with your samples
|
||||||
|
- Or add custom wake word later
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 2. Multiple Wake Words
|
||||||
|
|
||||||
|
#### Can You Have Multiple?
|
||||||
|
|
||||||
|
**Yes!** Options:
|
||||||
|
|
||||||
|
#### Option A: Server-Side (Recommended)
|
||||||
|
|
||||||
|
**Easy implementation:**
|
||||||
|
```bash
|
||||||
|
# Use the enhanced server
|
||||||
|
python voice_server_enhanced.py \
|
||||||
|
--enable-precise \
|
||||||
|
--multi-wake-word
|
||||||
|
```
|
||||||
|
|
||||||
|
**Configured wake words:**
|
||||||
|
- "Hey Mycroft" (pre-trained)
|
||||||
|
- "Hey Computer" (custom)
|
||||||
|
- "Jarvis" (custom)
|
||||||
|
|
||||||
|
**Resource impact:**
|
||||||
|
- 3 models = ~15-30% CPU (Heimdall handles easily)
|
||||||
|
- ~300-600MB RAM
|
||||||
|
- Each model runs independently
|
||||||
|
|
||||||
|
**Example use cases:**
|
||||||
|
```python
|
||||||
|
"Hey Mycroft, what's the time?" → General assistant
|
||||||
|
"Jarvis, run diagnostics" → Personal assistant mode
|
||||||
|
"Emergency, call help" → Priority/emergency mode
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Option B: Edge (K210)
|
||||||
|
|
||||||
|
**Feasible for 1-2 wake words:**
|
||||||
|
```python
|
||||||
|
# Sequential checking
|
||||||
|
for model in ['hey-mycroft.kmodel', 'emergency.kmodel']:
|
||||||
|
if detect_wake_word(model):
|
||||||
|
return model
|
||||||
|
```
|
||||||
|
|
||||||
|
**Limitations:**
|
||||||
|
- +50-100ms latency per additional model
|
||||||
|
- Memory constraints (6MB total for all models)
|
||||||
|
- More models = more power consumption
|
||||||
|
|
||||||
|
**Recommendation:**
|
||||||
|
- K210: 1 wake word (optimal)
|
||||||
|
- K210: 2 wake words (acceptable)
|
||||||
|
- K210: 3+ wake words (not recommended)
|
||||||
|
|
||||||
|
#### Option C: Contextual Wake Words
|
||||||
|
|
||||||
|
Different wake words for different purposes:
|
||||||
|
```python
|
||||||
|
wake_word_contexts = {
|
||||||
|
'hey_mycroft': 'general_assistant',
|
||||||
|
'emergency': 'priority_emergency',
|
||||||
|
'goodnight': 'bedtime_routine',
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Should You Use Multiple?
|
||||||
|
|
||||||
|
**One wake word is usually enough!**
|
||||||
|
|
||||||
|
Commercial products (Alexa, Google) use one wake word and they work fine.
|
||||||
|
|
||||||
|
**Use multiple when:**
|
||||||
|
- Different family members want different wake words
|
||||||
|
- You want context-specific behaviors (emergency vs. general)
|
||||||
|
- You enjoy the flexibility
|
||||||
|
|
||||||
|
**Start with one, add more later if needed.**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3. Adopting New Users' Voices
|
||||||
|
|
||||||
|
#### Challenge
|
||||||
|
|
||||||
|
Same wake word, different voices:
|
||||||
|
- Mom says "Hey Mycroft" (soprano)
|
||||||
|
- Dad says "Hey Mycroft" (bass)
|
||||||
|
- Kids say "Hey Mycroft" (high-pitched)
|
||||||
|
|
||||||
|
All need to work!
|
||||||
|
|
||||||
|
#### Solution 1: Diverse Training (Recommended)
|
||||||
|
|
||||||
|
**During initial training, have everyone record samples:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd ~/precise-models/family-hey-mycroft
|
||||||
|
|
||||||
|
# Session 1: Mom records 30 samples
|
||||||
|
precise-collect # Mom speaks "Hey Mycroft" 30 times
|
||||||
|
|
||||||
|
# Session 2: Dad records 30 samples
|
||||||
|
precise-collect # Dad speaks "Hey Mycroft" 30 times
|
||||||
|
|
||||||
|
# Session 3: Kids record 20 samples each
|
||||||
|
precise-collect # Kids speak "Hey Mycroft" 40 times total
|
||||||
|
|
||||||
|
# Train one model with all voices
|
||||||
|
precise-train -e 60 family-hey-mycroft.net .
|
||||||
|
|
||||||
|
# Deploy
|
||||||
|
python voice_server.py \
|
||||||
|
--enable-precise \
|
||||||
|
--precise-model family-hey-mycroft.net
|
||||||
|
```
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
✅ One model works for everyone
|
||||||
|
✅ Simple deployment
|
||||||
|
✅ No switching needed
|
||||||
|
✅ Works from day one
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
❌ Need everyone's time upfront
|
||||||
|
❌ Slightly lower per-person accuracy than individual models
|
||||||
|
|
||||||
|
#### Solution 2: Incremental Training
|
||||||
|
|
||||||
|
**Start with one person, add others over time:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Week 1: Train with Dad's voice
|
||||||
|
precise-train -e 60 hey-mycroft.net .
|
||||||
|
|
||||||
|
# Week 2: Mom wants to use it
|
||||||
|
# Collect Mom's samples
|
||||||
|
precise-collect # Mom records 20-30 samples
|
||||||
|
|
||||||
|
# Add to training set
|
||||||
|
cp mom-samples/* wake-word/
|
||||||
|
|
||||||
|
# Retrain from checkpoint (faster!)
|
||||||
|
precise-train -e 30 hey-mycroft.net . \
|
||||||
|
--from-checkpoint hey-mycroft.net
|
||||||
|
|
||||||
|
# Now works for both Dad and Mom!
|
||||||
|
|
||||||
|
# Week 3: Kids want in
|
||||||
|
# Repeat process...
|
||||||
|
```
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
✅ Don't need everyone upfront
|
||||||
|
✅ Easy to add new users
|
||||||
|
✅ Model improves gradually
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
❌ New users may have issues initially
|
||||||
|
❌ Requires periodic retraining
|
||||||
|
|
||||||
|
#### Solution 3: Speaker Identification (Advanced)
|
||||||
|
|
||||||
|
**Identify who's speaking, use personalized model/settings:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Install speaker ID
|
||||||
|
pip install pyannote.audio scipy --break-system-packages
|
||||||
|
|
||||||
|
# Use enhanced server
|
||||||
|
python voice_server_enhanced.py \
|
||||||
|
--enable-precise \
|
||||||
|
--enable-speaker-id \
|
||||||
|
--hf-token YOUR_HF_TOKEN
|
||||||
|
```
|
||||||
|
|
||||||
|
**Enroll users:**
|
||||||
|
```bash
|
||||||
|
# Record 30-second voice sample from each person
|
||||||
|
# POST to /speakers/enroll with audio + name
|
||||||
|
|
||||||
|
curl -F "name=alan" \
|
||||||
|
-F "audio=@alan_voice.wav" \
|
||||||
|
http://localhost:5000/speakers/enroll
|
||||||
|
|
||||||
|
curl -F "name=sarah" \
|
||||||
|
-F "audio=@sarah_voice.wav" \
|
||||||
|
http://localhost:5000/speakers/enroll
|
||||||
|
```
|
||||||
|
|
||||||
|
**Benefits:**
|
||||||
|
```python
|
||||||
|
# Different responses per user
|
||||||
|
if speaker == 'alan':
|
||||||
|
turn_on('light.alan_office')
|
||||||
|
elif speaker == 'sarah':
|
||||||
|
turn_on('light.sarah_office')
|
||||||
|
|
||||||
|
# Different permissions
|
||||||
|
if speaker == 'kids' and command.startswith('buy'):
|
||||||
|
return "Sorry, kids can't make purchases"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
✅ Personalized responses
|
||||||
|
✅ User-specific settings
|
||||||
|
✅ Better accuracy (optimized per voice)
|
||||||
|
✅ Can track who said what
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
❌ More complex
|
||||||
|
❌ Privacy considerations
|
||||||
|
❌ Additional CPU/RAM (~10% + 200MB)
|
||||||
|
❌ Requires voice enrollment
|
||||||
|
|
||||||
|
#### Solution 4: Pre-trained Model (Easiest)
|
||||||
|
|
||||||
|
**"Hey Mycroft" already includes diverse voices!**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Just use it - already trained on many voices
|
||||||
|
./quick_start_hey_mycroft.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
The community model was trained with:
|
||||||
|
- Male and female voices
|
||||||
|
- Different accents
|
||||||
|
- Different ages
|
||||||
|
- Various environments
|
||||||
|
|
||||||
|
**It should work for most family members out-of-box!**
|
||||||
|
|
||||||
|
Then fine-tune if needed.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recommended Path for Your Situation
|
||||||
|
|
||||||
|
### Scenario: Family of 3-4 People
|
||||||
|
|
||||||
|
**Week 1: Quick Start**
|
||||||
|
```bash
|
||||||
|
# Use pre-trained "Hey Mycroft"
|
||||||
|
./quick_start_hey_mycroft.sh
|
||||||
|
|
||||||
|
# Test with all family members
|
||||||
|
# Likely works for everyone already!
|
||||||
|
```
|
||||||
|
|
||||||
|
**Week 2: Fine-tune if Needed**
|
||||||
|
```bash
|
||||||
|
# If someone has issues:
|
||||||
|
# Have them record 20 samples
|
||||||
|
# Fine-tune the model
|
||||||
|
|
||||||
|
precise-train -e 30 family-hey-mycroft.net . \
|
||||||
|
--from-checkpoint ~/precise-models/pretrained/hey-mycroft.net
|
||||||
|
```
|
||||||
|
|
||||||
|
**Week 3: Add Features**
|
||||||
|
```bash
|
||||||
|
# If you want personalization:
|
||||||
|
python voice_server_enhanced.py \
|
||||||
|
--enable-speaker-id
|
||||||
|
|
||||||
|
# Enroll each family member
|
||||||
|
```
|
||||||
|
|
||||||
|
### Scenario: Just You (or 1-2 People)
|
||||||
|
|
||||||
|
**Option 1: Pre-trained**
|
||||||
|
```bash
|
||||||
|
./quick_start_hey_mycroft.sh
|
||||||
|
# Done!
|
||||||
|
```
|
||||||
|
|
||||||
|
**Option 2: Custom Wake Word**
|
||||||
|
```bash
|
||||||
|
# Train custom "Hey Computer"
|
||||||
|
cd ~/precise-models/hey-computer
|
||||||
|
./1-record-wake-word.sh # 50 samples
|
||||||
|
./2-record-not-wake-word.sh # 200 samples
|
||||||
|
./3-train-model.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
### Scenario: Multiple People + Multiple Wake Words
|
||||||
|
|
||||||
|
**Full setup:**
|
||||||
|
```bash
|
||||||
|
# Pre-trained for family
|
||||||
|
./quick_start_hey_mycroft.sh
|
||||||
|
|
||||||
|
# Personal wake word for Dad
|
||||||
|
cd ~/precise-models/jarvis
|
||||||
|
# Train custom wake word
|
||||||
|
|
||||||
|
# Emergency wake word
|
||||||
|
cd ~/precise-models/emergency
|
||||||
|
# Train emergency wake word
|
||||||
|
|
||||||
|
# Run multi-wake-word server
|
||||||
|
python voice_server_enhanced.py \
|
||||||
|
--enable-precise \
|
||||||
|
--multi-wake-word \
|
||||||
|
--enable-speaker-id
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick Decision Matrix
|
||||||
|
|
||||||
|
| Your Situation | Recommendation |
|
||||||
|
|----------------|----------------|
|
||||||
|
| **Just getting started** | Pre-trained "Hey Mycroft" |
|
||||||
|
| **Want different wake word** | Train custom model |
|
||||||
|
| **Family of 3-4** | Pre-trained + fine-tune if needed |
|
||||||
|
| **Want personalization** | Add speaker ID |
|
||||||
|
| **Multiple purposes** | Multiple wake words (server-side) |
|
||||||
|
| **Deploying to K210** | 1 wake word, no speaker ID |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Files to Use
|
||||||
|
|
||||||
|
**Quick start with pre-trained:**
|
||||||
|
- `quick_start_hey_mycroft.sh` - Zero training, 5 minutes!
|
||||||
|
|
||||||
|
**Multiple wake words:**
|
||||||
|
- `voice_server_enhanced.py` - Multi-wake-word + speaker ID support
|
||||||
|
|
||||||
|
**Training custom:**
|
||||||
|
- `setup_precise.sh` - Setup training environment
|
||||||
|
- Scripts in `~/precise-models/your-wake-word/`
|
||||||
|
|
||||||
|
**Documentation:**
|
||||||
|
- `WAKE_WORD_ADVANCED.md` - Detailed guide (this is comprehensive!)
|
||||||
|
- `PRECISE_DEPLOYMENT.md` - Production deployment
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
✅ **Yes**, pre-trained "Hey Mycroft" exists and works great
|
||||||
|
✅ **Yes**, you can have multiple wake words (server-side is easy)
|
||||||
|
✅ **Yes**, multiple approaches for multi-user support
|
||||||
|
|
||||||
|
**Recommended approach:**
|
||||||
|
1. Start with `./quick_start_hey_mycroft.sh` (5 mins)
|
||||||
|
2. Test with all family members
|
||||||
|
3. Fine-tune if anyone has issues
|
||||||
|
4. Add speaker ID later if you want personalization
|
||||||
|
5. Consider multiple wake words only if you have specific use cases
|
||||||
|
|
||||||
|
**Keep it simple!** One pre-trained wake word works for most people.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Next Actions
|
||||||
|
|
||||||
|
**Ready to start?**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 5-minute quick start
|
||||||
|
./quick_start_hey_mycroft.sh
|
||||||
|
|
||||||
|
# Or read more first
|
||||||
|
cat WAKE_WORD_ADVANCED.md
|
||||||
|
```
|
||||||
|
|
||||||
|
**Questions?**
|
||||||
|
- Pre-trained models: See WAKE_WORD_ADVANCED.md § Pre-trained
|
||||||
|
- Multiple wake words: See WAKE_WORD_ADVANCED.md § Multiple Wake Words
|
||||||
|
- Voice adaptation: See WAKE_WORD_ADVANCED.md § Voice Adaptation
|
||||||
|
|
||||||
|
**Happy voice assisting! 🎙️**
|
||||||
421
docs/QUICKSTART.md
Executable file
421
docs/QUICKSTART.md
Executable file
|
|
@ -0,0 +1,421 @@
|
||||||
|
# Maix Duino Voice Assistant - Quick Start Guide
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
This guide will walk you through setting up a local, privacy-focused voice assistant using your Maix Duino board and Home Assistant integration. All processing happens on your local network - no cloud services required.
|
||||||
|
|
||||||
|
## What You'll Build
|
||||||
|
- Wake word detection on Maix Duino (edge device)
|
||||||
|
- Speech-to-text using Whisper on Heimdall
|
||||||
|
- Home Assistant integration for smart home control
|
||||||
|
- Text-to-speech responses using Piper
|
||||||
|
- All processing local to your 10.1.10.0/24 network
|
||||||
|
|
||||||
|
## Hardware Requirements
|
||||||
|
- [x] Sipeed Maix Duino board (you have this!)
|
||||||
|
- [ ] I2S MEMS microphone (or microphone array)
|
||||||
|
- [ ] Small speaker (3-5W) or audio output
|
||||||
|
- [ ] MicroSD card (4GB+) formatted as FAT32
|
||||||
|
- [ ] USB-C cable for power and programming
|
||||||
|
|
||||||
|
## Network Prerequisites
|
||||||
|
- Maix Duino will need WiFi access to your 10.1.10.0/24 network
|
||||||
|
- Heimdall (10.1.10.71) for AI processing
|
||||||
|
- Home Assistant instance (configure URL in setup)
|
||||||
|
|
||||||
|
## Setup Process
|
||||||
|
|
||||||
|
### Phase 1: Server Setup (Heimdall)
|
||||||
|
|
||||||
|
#### Step 1: Run the setup script
|
||||||
|
```bash
|
||||||
|
# Transfer files to Heimdall
|
||||||
|
scp setup_voice_assistant.sh voice_server.py alan@10.1.10.71:~/
|
||||||
|
|
||||||
|
# SSH to Heimdall
|
||||||
|
ssh alan@10.1.10.71
|
||||||
|
|
||||||
|
# Make setup script executable and run it
|
||||||
|
chmod +x setup_voice_assistant.sh
|
||||||
|
./setup_voice_assistant.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Step 2: Configure Home Assistant access
|
||||||
|
```bash
|
||||||
|
# Edit the config file
|
||||||
|
vim ~/voice-assistant/config/.env
|
||||||
|
```
|
||||||
|
|
||||||
|
Update these values:
|
||||||
|
```env
|
||||||
|
HA_URL=http://your-home-assistant:8123
|
||||||
|
HA_TOKEN=your_long_lived_access_token_here
|
||||||
|
```
|
||||||
|
|
||||||
|
To get a long-lived access token:
|
||||||
|
1. Open Home Assistant
|
||||||
|
2. Click your profile (bottom left)
|
||||||
|
3. Scroll to "Long-Lived Access Tokens"
|
||||||
|
4. Click "Create Token"
|
||||||
|
5. Copy the token and paste it in .env
|
||||||
|
|
||||||
|
#### Step 3: Test the server
|
||||||
|
```bash
|
||||||
|
cd ~/voice-assistant
|
||||||
|
./test_server.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
You should see:
|
||||||
|
```
|
||||||
|
Loading Whisper model: medium
|
||||||
|
Whisper model loaded successfully
|
||||||
|
Starting voice processing server on 0.0.0.0:5000
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Step 4: Test with curl (from another terminal)
|
||||||
|
```bash
|
||||||
|
# Test health endpoint
|
||||||
|
curl http://10.1.10.71:5000/health
|
||||||
|
|
||||||
|
# Should return:
|
||||||
|
# {"status":"healthy","whisper_loaded":true,"ha_connected":true}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Phase 2: Maix Duino Setup
|
||||||
|
|
||||||
|
#### Step 1: Flash MaixPy firmware
|
||||||
|
1. Download latest MaixPy firmware from: https://dl.sipeed.com/MAIX/MaixPy/release/
|
||||||
|
2. Download Kflash GUI: https://github.com/sipeed/kflash_gui
|
||||||
|
3. Connect Maix Duino via USB
|
||||||
|
4. Flash firmware using Kflash GUI
|
||||||
|
|
||||||
|
#### Step 2: Prepare SD card
|
||||||
|
```bash
|
||||||
|
# Format SD card as FAT32
|
||||||
|
# Create directory structure:
|
||||||
|
mkdir -p /path/to/sdcard/models
|
||||||
|
|
||||||
|
# Copy the client script
|
||||||
|
cp maix_voice_client.py /path/to/sdcard/main.py
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Step 3: Configure WiFi settings
|
||||||
|
Edit `/path/to/sdcard/main.py`:
|
||||||
|
```python
|
||||||
|
# WiFi Settings
|
||||||
|
WIFI_SSID = "YourNetworkName"
|
||||||
|
WIFI_PASSWORD = "YourPassword"
|
||||||
|
|
||||||
|
# Server Settings
|
||||||
|
VOICE_SERVER_URL = "http://10.1.10.71:5000"
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Step 4: Test the board
|
||||||
|
1. Insert SD card into Maix Duino
|
||||||
|
2. Connect to serial console (115200 baud)
|
||||||
|
```bash
|
||||||
|
screen /dev/ttyUSB0 115200
|
||||||
|
# or
|
||||||
|
minicom -D /dev/ttyUSB0 -b 115200
|
||||||
|
```
|
||||||
|
3. Power on the board
|
||||||
|
4. Watch the serial output for connection status
|
||||||
|
|
||||||
|
### Phase 3: Integration & Testing
|
||||||
|
|
||||||
|
#### Test 1: Basic connectivity
|
||||||
|
1. Maix Duino should connect to WiFi and display IP on LCD
|
||||||
|
2. Server should show in logs when Maix connects
|
||||||
|
|
||||||
|
#### Test 2: Audio capture
|
||||||
|
The current implementation uses amplitude-based wake word detection as a placeholder. To test:
|
||||||
|
1. Clap loudly near the microphone
|
||||||
|
2. Speak a command (e.g., "turn on the living room lights")
|
||||||
|
3. Watch the LCD for transcription and response
|
||||||
|
|
||||||
|
#### Test 3: Home Assistant control
|
||||||
|
Supported commands (add more in voice_server.py):
|
||||||
|
- "Turn on the living room lights"
|
||||||
|
- "Turn off the bedroom lights"
|
||||||
|
- "What's the temperature?"
|
||||||
|
- "Toggle the kitchen lights"
|
||||||
|
|
||||||
|
### Phase 4: Wake Word Training (Advanced)
|
||||||
|
|
||||||
|
The placeholder wake word detection uses simple amplitude triggering. For production use:
|
||||||
|
|
||||||
|
#### Option A: Use Porcupine (easiest)
|
||||||
|
1. Sign up at: https://console.picovoice.ai/
|
||||||
|
2. Train custom wake word
|
||||||
|
3. Download .ppn model
|
||||||
|
4. Convert to .kmodel for K210
|
||||||
|
|
||||||
|
#### Option B: Use Mycroft Precise (FOSS)
|
||||||
|
```bash
|
||||||
|
# On a machine with GPU
|
||||||
|
conda create -n precise python=3.6
|
||||||
|
conda activate precise
|
||||||
|
pip install precise-runner
|
||||||
|
|
||||||
|
# Record wake word samples
|
||||||
|
precise-collect
|
||||||
|
|
||||||
|
# Train model
|
||||||
|
precise-train -e 60 my-wake-word.net my-wake-word/
|
||||||
|
|
||||||
|
# Convert to .kmodel
|
||||||
|
# (requires additional tools - see MaixPy docs)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Architecture Diagram
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────┐
|
||||||
|
│ Your Home Network (10.1.10.0/24) │
|
||||||
|
│ │
|
||||||
|
│ ┌──────────────┐ ┌──────────────┐ │
|
||||||
|
│ │ Maix Duino │────────>│ Heimdall │ │
|
||||||
|
│ │ 10.1.10.xxx │ Audio │ 10.1.10.71 │ │
|
||||||
|
│ │ │<────────│ │ │
|
||||||
|
│ │ - Wake Word │ Response│ - Whisper │ │
|
||||||
|
│ │ - Mic Input │ │ - Piper TTS │ │
|
||||||
|
│ │ - Speaker │ │ - Flask API │ │
|
||||||
|
│ └──────────────┘ └──────┬───────┘ │
|
||||||
|
│ │ │
|
||||||
|
│ │ REST API │
|
||||||
|
│ v │
|
||||||
|
│ ┌──────────────┐ │
|
||||||
|
│ │ Home Asst. │ │
|
||||||
|
│ │ homeassistant│ │
|
||||||
|
│ │ │ │
|
||||||
|
│ │ - Devices │ │
|
||||||
|
│ │ - Automation │ │
|
||||||
|
│ └──────────────┘ │
|
||||||
|
└─────────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Maix Duino won't connect to WiFi
|
||||||
|
```python
|
||||||
|
# Check serial output for errors
|
||||||
|
# Common issues:
|
||||||
|
# - Wrong SSID/password
|
||||||
|
# - WPA3 not supported (use WPA2)
|
||||||
|
# - 5GHz network (use 2.4GHz)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Whisper transcription is slow
|
||||||
|
```bash
|
||||||
|
# Use a smaller model on Heimdall
|
||||||
|
# Edit ~/voice-assistant/config/.env:
|
||||||
|
WHISPER_MODEL=base # or tiny for fastest
|
||||||
|
```
|
||||||
|
|
||||||
|
### Home Assistant commands don't work
|
||||||
|
```bash
|
||||||
|
# Check server logs
|
||||||
|
journalctl -u voice-assistant -f
|
||||||
|
|
||||||
|
# Test HA connection manually
|
||||||
|
curl -H "Authorization: Bearer YOUR_TOKEN" \
|
||||||
|
http://your-ha:8123/api/states
|
||||||
|
```
|
||||||
|
|
||||||
|
### Audio quality is poor
|
||||||
|
1. Check microphone connections
|
||||||
|
2. Adjust `SAMPLE_RATE` in maix_voice_client.py
|
||||||
|
3. Test with USB microphone first
|
||||||
|
4. Consider microphone array for better pickup
|
||||||
|
|
||||||
|
### Out of memory on Maix Duino
|
||||||
|
```python
|
||||||
|
# In main_loop(), add more frequent GC:
|
||||||
|
if gc.mem_free() < 200000: # Increase threshold
|
||||||
|
gc.collect()
|
||||||
|
```
|
||||||
|
|
||||||
|
## Adding New Intents
|
||||||
|
|
||||||
|
Edit `voice_server.py` and add patterns to `IntentParser.PATTERNS`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
PATTERNS = {
|
||||||
|
# Existing patterns...
|
||||||
|
|
||||||
|
'set_temperature': [
|
||||||
|
r'set (?:the )?temperature to (\d+)',
|
||||||
|
r'make it (\d+) degrees',
|
||||||
|
],
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Then add the handler in `execute_intent()`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
elif intent == 'set_temperature':
|
||||||
|
temp = params.get('temperature')
|
||||||
|
success = ha_client.call_service(
|
||||||
|
'climate', 'set_temperature',
|
||||||
|
entity_id, temperature=temp
|
||||||
|
)
|
||||||
|
return f"Set temperature to {temp} degrees"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Entity Mapping
|
||||||
|
|
||||||
|
Add your Home Assistant entities to `IntentParser.ENTITY_MAP`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
ENTITY_MAP = {
|
||||||
|
# Lights
|
||||||
|
'living room light': 'light.living_room',
|
||||||
|
'bedroom light': 'light.bedroom',
|
||||||
|
|
||||||
|
# Climate
|
||||||
|
'thermostat': 'climate.main_floor',
|
||||||
|
'temperature': 'sensor.main_floor_temperature',
|
||||||
|
|
||||||
|
# Switches
|
||||||
|
'coffee maker': 'switch.coffee_maker',
|
||||||
|
'fan': 'switch.bedroom_fan',
|
||||||
|
|
||||||
|
# Media
|
||||||
|
'tv': 'media_player.living_room_tv',
|
||||||
|
'music': 'media_player.whole_house',
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Performance Tuning
|
||||||
|
|
||||||
|
### Reduce latency
|
||||||
|
1. Use Whisper `tiny` or `base` model
|
||||||
|
2. Implement streaming audio (currently batch)
|
||||||
|
3. Pre-load TTS models
|
||||||
|
4. Use faster TTS engine (e.g., espeak)
|
||||||
|
|
||||||
|
### Improve accuracy
|
||||||
|
1. Use Whisper `large` model (slower)
|
||||||
|
2. Train custom wake word
|
||||||
|
3. Add NLU layer (Rasa, spaCy)
|
||||||
|
4. Collect and fine-tune on your voice
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
|
||||||
|
### Short term
|
||||||
|
- [ ] Add more Home Assistant entity mappings
|
||||||
|
- [ ] Implement Piper TTS playback on Maix Duino
|
||||||
|
- [ ] Train custom wake word model
|
||||||
|
- [ ] Add LED animations for better feedback
|
||||||
|
- [ ] Implement conversation context
|
||||||
|
|
||||||
|
### Medium term
|
||||||
|
- [ ] Multi-room support (multiple Maix Duino units)
|
||||||
|
- [ ] Voice profiles for different users
|
||||||
|
- [ ] Integration with Plex for media control
|
||||||
|
- [ ] Calendar and reminder functionality
|
||||||
|
- [ ] Weather updates from local weather station
|
||||||
|
|
||||||
|
### Long term
|
||||||
|
- [ ] Custom skills/plugins system
|
||||||
|
- [ ] Integration with other services (Nextcloud, Matrix)
|
||||||
|
- [ ] Sound event detection (doorbell, smoke alarm)
|
||||||
|
- [ ] Intercom functionality between rooms
|
||||||
|
- [ ] Voice-controlled automation creation
|
||||||
|
|
||||||
|
## Alternatives & Fallbacks
|
||||||
|
|
||||||
|
If the Maix Duino proves limiting:
|
||||||
|
|
||||||
|
### Raspberry Pi Zero 2 W
|
||||||
|
- More processing power
|
||||||
|
- Better software support
|
||||||
|
- USB audio support
|
||||||
|
- Cost: ~$15
|
||||||
|
|
||||||
|
### ESP32-S3
|
||||||
|
- Better WiFi
|
||||||
|
- More RAM (8MB)
|
||||||
|
- Cheaper (~$10)
|
||||||
|
- Good community support
|
||||||
|
|
||||||
|
### Orange Pi Zero 2
|
||||||
|
- ARM Cortex-A53 quad-core
|
||||||
|
- 512MB-1GB RAM
|
||||||
|
- Full Linux support
|
||||||
|
- Cost: ~$20
|
||||||
|
|
||||||
|
## Resources
|
||||||
|
|
||||||
|
### Documentation
|
||||||
|
- Maix Duino: https://wiki.sipeed.com/hardware/en/maix/
|
||||||
|
- MaixPy: https://maixpy.sipeed.com/
|
||||||
|
- Whisper: https://github.com/openai/whisper
|
||||||
|
- Piper TTS: https://github.com/rhasspy/piper
|
||||||
|
- Home Assistant API: https://developers.home-assistant.io/
|
||||||
|
|
||||||
|
### Community Projects
|
||||||
|
- Rhasspy: https://rhasspy.readthedocs.io/
|
||||||
|
- Willow: https://github.com/toverainc/willow
|
||||||
|
- Mycroft: https://mycroft.ai/
|
||||||
|
|
||||||
|
### Wake Word Tools
|
||||||
|
- Porcupine: https://picovoice.ai/platform/porcupine/
|
||||||
|
- Mycroft Precise: https://github.com/MycroftAI/mycroft-precise
|
||||||
|
- Snowboy (archived): https://github.com/Kitt-AI/snowboy
|
||||||
|
|
||||||
|
## Getting Help
|
||||||
|
|
||||||
|
### Check logs
|
||||||
|
```bash
|
||||||
|
# Server logs (if using systemd)
|
||||||
|
sudo journalctl -u voice-assistant -f
|
||||||
|
|
||||||
|
# Or manual log file
|
||||||
|
tail -f ~/voice-assistant/logs/voice_assistant.log
|
||||||
|
|
||||||
|
# Maix Duino serial console
|
||||||
|
screen /dev/ttyUSB0 115200
|
||||||
|
```
|
||||||
|
|
||||||
|
### Common issues and solutions
|
||||||
|
See the Troubleshooting section above
|
||||||
|
|
||||||
|
### Useful commands
|
||||||
|
```bash
|
||||||
|
# Restart service
|
||||||
|
sudo systemctl restart voice-assistant
|
||||||
|
|
||||||
|
# Check service status
|
||||||
|
sudo systemctl status voice-assistant
|
||||||
|
|
||||||
|
# Test HA connection
|
||||||
|
curl http://10.1.10.71:5000/health
|
||||||
|
|
||||||
|
# Monitor Maix Duino
|
||||||
|
minicom -D /dev/ttyUSB0 -b 115200
|
||||||
|
```
|
||||||
|
|
||||||
|
## Cost Breakdown
|
||||||
|
|
||||||
|
| Item | Cost | Status |
|
||||||
|
|------|------|--------|
|
||||||
|
| Maix Duino | $30 | Have it! |
|
||||||
|
| I2S Microphone | $5-10 | Need |
|
||||||
|
| Speaker | $10 | Need (or use existing) |
|
||||||
|
| MicroSD Card | $5 | Have it? |
|
||||||
|
| **Total** | **$15-25** | (vs $50+ commercial) |
|
||||||
|
|
||||||
|
**Benefits of local solution:**
|
||||||
|
- No subscription fees
|
||||||
|
- Complete privacy (no cloud)
|
||||||
|
- Customizable to your needs
|
||||||
|
- Integration with existing infrastructure
|
||||||
|
- Learning experience!
|
||||||
|
|
||||||
|
## Conclusion
|
||||||
|
|
||||||
|
You now have everything you need to build a local, privacy-focused voice assistant! The setup leverages your existing infrastructure (Heimdall for processing, Home Assistant for automation) while keeping costs minimal.
|
||||||
|
|
||||||
|
Start with the basic setup, test each component, then iterate and improve. The beauty of this approach is you can enhance it over time without being locked into a commercial platform.
|
||||||
|
|
||||||
|
Good luck, and enjoy your new voice assistant! 🎙️
|
||||||
723
docs/WAKE_WORD_ADVANCED.md
Executable file
723
docs/WAKE_WORD_ADVANCED.md
Executable file
|
|
@ -0,0 +1,723 @@
|
||||||
|
# Wake Word Models: Pre-trained, Multiple, and Voice Adaptation
|
||||||
|
|
||||||
|
## Pre-trained Wake Word Models
|
||||||
|
|
||||||
|
### Yes! "Hey Mycroft" Already Exists
|
||||||
|
|
||||||
|
Mycroft provides several pre-trained models that you can use immediately:
|
||||||
|
|
||||||
|
#### Available Pre-trained Models
|
||||||
|
|
||||||
|
**Hey Mycroft** (Official)
|
||||||
|
```bash
|
||||||
|
# Download from Mycroft's model repository
|
||||||
|
cd ~/precise-models/pretrained
|
||||||
|
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
|
||||||
|
tar xzf hey-mycroft.tar.gz
|
||||||
|
|
||||||
|
# Test immediately
|
||||||
|
conda activate precise
|
||||||
|
precise-listen hey-mycroft.net
|
||||||
|
|
||||||
|
# Should detect "Hey Mycroft" right away!
|
||||||
|
```
|
||||||
|
|
||||||
|
**Other Available Models:**
|
||||||
|
- **Hey Mycroft** - Best tested, most reliable
|
||||||
|
- **Christopher** - Alternative wake word
|
||||||
|
- **Hey Jarvis** - Community contributed
|
||||||
|
- **Computer** - Star Trek style
|
||||||
|
|
||||||
|
#### Using Pre-trained Models
|
||||||
|
|
||||||
|
**Option 1: Use as-is**
|
||||||
|
```bash
|
||||||
|
# Just point your server to the pre-trained model
|
||||||
|
python voice_server.py \
|
||||||
|
--enable-precise \
|
||||||
|
--precise-model ~/precise-models/pretrained/hey-mycroft.net \
|
||||||
|
--precise-sensitivity 0.5
|
||||||
|
```
|
||||||
|
|
||||||
|
**Option 2: Fine-tune for your voice**
|
||||||
|
```bash
|
||||||
|
# Use pre-trained as starting point, add your samples
|
||||||
|
cd ~/precise-models/my-hey-mycroft
|
||||||
|
|
||||||
|
# Record additional samples
|
||||||
|
precise-collect
|
||||||
|
|
||||||
|
# Train from checkpoint (much faster than from scratch!)
|
||||||
|
precise-train -e 30 my-hey-mycroft.net . \
|
||||||
|
--from-checkpoint ~/precise-models/pretrained/hey-mycroft.net
|
||||||
|
|
||||||
|
# This adds your voice/environment while keeping the base model
|
||||||
|
```
|
||||||
|
|
||||||
|
**Option 3: Ensemble with custom**
|
||||||
|
```python
|
||||||
|
# Use both pre-trained and custom model
|
||||||
|
# Require both to agree (reduces false positives)
|
||||||
|
# See implementation below
|
||||||
|
```
|
||||||
|
|
||||||
|
### Advantages of Pre-trained Models
|
||||||
|
|
||||||
|
✅ **Instant deployment** - No training required
|
||||||
|
✅ **Proven accuracy** - Tested by thousands of users
|
||||||
|
✅ **Good starting point** - Fine-tune rather than train from scratch
|
||||||
|
✅ **Multiple speakers** - Already includes diverse voices
|
||||||
|
✅ **Save time** - Skip 1-2 hours of training
|
||||||
|
|
||||||
|
### Disadvantages
|
||||||
|
|
||||||
|
❌ **Generic** - Not optimized for your voice/environment
|
||||||
|
❌ **May need tuning** - Threshold adjustment required
|
||||||
|
❌ **Limited choice** - Only a few wake words available
|
||||||
|
|
||||||
|
### Recommendation
|
||||||
|
|
||||||
|
**Start with "Hey Mycroft"** pre-trained model:
|
||||||
|
1. Deploy immediately (zero training time)
|
||||||
|
2. Test in your environment
|
||||||
|
3. Collect false positives/negatives
|
||||||
|
4. Fine-tune with your examples
|
||||||
|
5. Best of both worlds!
|
||||||
|
|
||||||
|
## Multiple Wake Words
|
||||||
|
|
||||||
|
### Can You Have Multiple Wake Words?
|
||||||
|
|
||||||
|
**Short answer:** Yes, but with tradeoffs.
|
||||||
|
|
||||||
|
### Implementation Approaches
|
||||||
|
|
||||||
|
#### Approach 1: Server-Side Multiple Models (Recommended)
|
||||||
|
|
||||||
|
Run multiple Precise models in parallel on Heimdall:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# In voice_server.py
|
||||||
|
from precise_runner import PreciseEngine, PreciseRunner
|
||||||
|
|
||||||
|
# Global runners for each wake word
|
||||||
|
precise_runners = {}
|
||||||
|
wake_word_configs = {
|
||||||
|
'hey_mycroft': {
|
||||||
|
'model': '~/precise-models/pretrained/hey-mycroft.net',
|
||||||
|
'sensitivity': 0.5,
|
||||||
|
'response': 'Yes?'
|
||||||
|
},
|
||||||
|
'hey_computer': {
|
||||||
|
'model': '~/precise-models/hey-computer/hey-computer.net',
|
||||||
|
'sensitivity': 0.5,
|
||||||
|
'response': 'I\'m listening'
|
||||||
|
},
|
||||||
|
'jarvis': {
|
||||||
|
'model': '~/precise-models/jarvis/jarvis.net',
|
||||||
|
'sensitivity': 0.6,
|
||||||
|
'response': 'At your service, sir'
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
def on_wake_word_detected(wake_word_name):
|
||||||
|
"""Callback with wake word identifier"""
|
||||||
|
def callback():
|
||||||
|
print(f"Wake word detected: {wake_word_name}")
|
||||||
|
wake_word_queue.put({
|
||||||
|
'timestamp': time.time(),
|
||||||
|
'wake_word': wake_word_name,
|
||||||
|
'response': wake_word_configs[wake_word_name]['response']
|
||||||
|
})
|
||||||
|
return callback
|
||||||
|
|
||||||
|
def start_multiple_wake_words():
|
||||||
|
"""Start multiple Precise listeners"""
|
||||||
|
for name, config in wake_word_configs.items():
|
||||||
|
engine = PreciseEngine(
|
||||||
|
'/usr/local/bin/precise-engine',
|
||||||
|
os.path.expanduser(config['model'])
|
||||||
|
)
|
||||||
|
|
||||||
|
runner = PreciseRunner(
|
||||||
|
engine,
|
||||||
|
sensitivity=config['sensitivity'],
|
||||||
|
on_activation=on_wake_word_detected(name)
|
||||||
|
)
|
||||||
|
|
||||||
|
runner.start()
|
||||||
|
precise_runners[name] = runner
|
||||||
|
print(f"Started wake word listener: {name}")
|
||||||
|
```
|
||||||
|
|
||||||
|
**Resource Usage:**
|
||||||
|
- CPU: ~5-10% per model (3 models = ~15-30%)
|
||||||
|
- RAM: ~100-200MB per model
|
||||||
|
- Still very manageable on Heimdall
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
✅ Different wake words for different purposes
|
||||||
|
✅ Family members can choose preferred wake word
|
||||||
|
✅ Context-aware responses
|
||||||
|
✅ Easy to add/remove models
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
❌ Higher CPU usage (scales linearly)
|
||||||
|
❌ Increased false positive risk (3x models = 3x chance)
|
||||||
|
❌ More complex configuration
|
||||||
|
|
||||||
|
#### Approach 2: Edge Multiple Models (K210)
|
||||||
|
|
||||||
|
**Challenge:** K210 has limited resources
|
||||||
|
|
||||||
|
**Option A: Sequential checking** (Feasible)
|
||||||
|
```python
|
||||||
|
# Check each model in sequence
|
||||||
|
models = ['hey-mycroft.kmodel', 'hey-computer.kmodel']
|
||||||
|
|
||||||
|
for model in models:
|
||||||
|
kpu_task = kpu.load(f"/sd/models/{model}")
|
||||||
|
result = kpu.run(kpu_task, audio_features)
|
||||||
|
if result > threshold:
|
||||||
|
return model # Wake word detected
|
||||||
|
```
|
||||||
|
|
||||||
|
**Resource impact:**
|
||||||
|
- Latency: +50-100ms per additional model
|
||||||
|
- Memory: Models must fit in 6MB total
|
||||||
|
- CPU: ~30% per model check
|
||||||
|
|
||||||
|
**Option B: Combined model** (Advanced)
|
||||||
|
```python
|
||||||
|
# Train a single model that recognizes multiple phrases
|
||||||
|
# Each phrase maps to different output class
|
||||||
|
# More complex training but single inference
|
||||||
|
```
|
||||||
|
|
||||||
|
**Recommendation for edge:**
|
||||||
|
- **1-2 wake words max** on K210
|
||||||
|
- **Server-side** for 3+ wake words
|
||||||
|
|
||||||
|
#### Approach 3: Contextual Wake Words
|
||||||
|
|
||||||
|
Different wake words trigger different behaviors:
|
||||||
|
|
||||||
|
```python
|
||||||
|
wake_word_contexts = {
|
||||||
|
'hey_mycroft': 'general', # General commands
|
||||||
|
'hey_assistant': 'general', # Alternative general
|
||||||
|
'emergency': 'priority', # High priority
|
||||||
|
'goodnight': 'bedtime', # Bedtime routine
|
||||||
|
}
|
||||||
|
|
||||||
|
def handle_wake_word(wake_word, command):
|
||||||
|
context = wake_word_contexts[wake_word]
|
||||||
|
|
||||||
|
if context == 'priority':
|
||||||
|
# Skip queue, process immediately
|
||||||
|
# Maybe call emergency contact
|
||||||
|
pass
|
||||||
|
elif context == 'bedtime':
|
||||||
|
# Trigger bedtime automation
|
||||||
|
# Lower volume for responses
|
||||||
|
pass
|
||||||
|
else:
|
||||||
|
# Normal processing
|
||||||
|
pass
|
||||||
|
```
|
||||||
|
|
||||||
|
### Best Practices for Multiple Wake Words
|
||||||
|
|
||||||
|
1. **Start with one** - Get it working well first
|
||||||
|
2. **Add gradually** - One at a time, test thoroughly
|
||||||
|
3. **Different purposes** - Each wake word should have a reason
|
||||||
|
4. **Monitor performance** - Track false positives per wake word
|
||||||
|
5. **User preference** - Let family members choose their favorite
|
||||||
|
|
||||||
|
### Recommended Configuration
|
||||||
|
|
||||||
|
**For most users:**
|
||||||
|
```python
|
||||||
|
wake_words = {
|
||||||
|
'hey_mycroft': 'primary', # Main wake word (pre-trained)
|
||||||
|
'hey_computer': 'alternative' # Custom trained for your voice
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**For power users:**
|
||||||
|
```python
|
||||||
|
wake_words = {
|
||||||
|
'hey_mycroft': 'general',
|
||||||
|
'jarvis': 'personal_assistant', # Custom responses
|
||||||
|
'computer': 'technical_queries', # Different intent parser
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**For families:**
|
||||||
|
```python
|
||||||
|
wake_words = {
|
||||||
|
'hey_mycroft': 'shared', # Everyone can use
|
||||||
|
'dad': 'user_alan', # Personalized
|
||||||
|
'mom': 'user_sarah', # Personalized
|
||||||
|
'kids': 'user_children', # Kid-safe responses
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Voice Adaptation and Multi-User Support
|
||||||
|
|
||||||
|
### Challenge: Different Voices, Same Wake Word
|
||||||
|
|
||||||
|
When multiple people use the system:
|
||||||
|
- Different accents
|
||||||
|
- Different speech patterns
|
||||||
|
- Different pronunciations
|
||||||
|
- Different vocal characteristics
|
||||||
|
|
||||||
|
### Solution Approaches
|
||||||
|
|
||||||
|
#### Approach 1: Diverse Training Data (Recommended)
|
||||||
|
|
||||||
|
**During initial training:**
|
||||||
|
```bash
|
||||||
|
# Have everyone in household record samples
|
||||||
|
cd ~/precise-models/hey-computer
|
||||||
|
|
||||||
|
# Alan records 30 samples
|
||||||
|
precise-collect # Record as user 1
|
||||||
|
|
||||||
|
# Sarah records 30 samples
|
||||||
|
precise-collect # Record as user 2
|
||||||
|
|
||||||
|
# Kids record 20 samples
|
||||||
|
precise-collect # Record as user 3
|
||||||
|
|
||||||
|
# Combine all in training set
|
||||||
|
# Train one model that works for everyone
|
||||||
|
./3-train-model.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
✅ Single model for everyone
|
||||||
|
✅ No user switching needed
|
||||||
|
✅ Simple to maintain
|
||||||
|
✅ Works immediately for all users
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
❌ May have lower per-person accuracy
|
||||||
|
❌ Requires upfront time from everyone
|
||||||
|
❌ Hard to add new users later
|
||||||
|
|
||||||
|
#### Approach 2: Incremental Training
|
||||||
|
|
||||||
|
Start with your voice, add others over time:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Week 1: Train with Alan's voice
|
||||||
|
cd ~/precise-models/hey-computer
|
||||||
|
# Record and train with Alan's samples
|
||||||
|
precise-train -e 60 hey-computer.net .
|
||||||
|
|
||||||
|
# Week 2: Sarah wants to use it
|
||||||
|
# Collect Sarah's samples
|
||||||
|
mkdir -p sarah-samples/wake-word
|
||||||
|
precise-collect # Sarah records 20-30 samples
|
||||||
|
|
||||||
|
# Add to existing training set
|
||||||
|
cp sarah-samples/wake-word/* wake-word/
|
||||||
|
|
||||||
|
# Retrain (continue from checkpoint)
|
||||||
|
precise-train -e 30 hey-computer.net . \
|
||||||
|
--from-checkpoint hey-computer.net
|
||||||
|
|
||||||
|
# Now works for both Alan and Sarah!
|
||||||
|
```
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
✅ Gradual improvement
|
||||||
|
✅ Don't need everyone upfront
|
||||||
|
✅ Easy to add new users
|
||||||
|
✅ Maintains accuracy for existing users
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
❌ May not work well for new users initially
|
||||||
|
❌ Requires retraining periodically
|
||||||
|
|
||||||
|
#### Approach 3: Per-User Models with Speaker Identification
|
||||||
|
|
||||||
|
Train separate models + identify who's speaking:
|
||||||
|
|
||||||
|
**Step 1: Train per-user wake word models**
|
||||||
|
```bash
|
||||||
|
# Alan's model
|
||||||
|
~/precise-models/hey-computer-alan/
|
||||||
|
|
||||||
|
# Sarah's model
|
||||||
|
~/precise-models/hey-computer-sarah/
|
||||||
|
|
||||||
|
# Kids' model
|
||||||
|
~/precise-models/hey-computer-kids/
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 2: Use speaker identification**
|
||||||
|
```python
|
||||||
|
# Pseudo-code for speaker identification
|
||||||
|
def identify_speaker(audio):
|
||||||
|
"""
|
||||||
|
Identify speaker from voice characteristics
|
||||||
|
Using speaker embeddings (x-vectors, d-vectors)
|
||||||
|
"""
|
||||||
|
# Extract speaker embedding
|
||||||
|
embedding = speaker_encoder.encode(audio)
|
||||||
|
|
||||||
|
# Compare to known users
|
||||||
|
similarities = {
|
||||||
|
'alan': cosine_similarity(embedding, alan_embedding),
|
||||||
|
'sarah': cosine_similarity(embedding, sarah_embedding),
|
||||||
|
'kids': cosine_similarity(embedding, kids_embedding),
|
||||||
|
}
|
||||||
|
|
||||||
|
# Return most similar
|
||||||
|
return max(similarities, key=similarities.get)
|
||||||
|
|
||||||
|
def process_command(audio):
|
||||||
|
# Detect wake word with all models
|
||||||
|
wake_detected = check_all_models(audio)
|
||||||
|
|
||||||
|
if wake_detected:
|
||||||
|
# Identify speaker
|
||||||
|
speaker = identify_speaker(audio)
|
||||||
|
|
||||||
|
# Use speaker-specific model for better accuracy
|
||||||
|
model = f'~/precise-models/hey-computer-{speaker}/'
|
||||||
|
|
||||||
|
# Continue with speaker context
|
||||||
|
process_with_context(audio, speaker)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Speaker identification libraries:**
|
||||||
|
- **Resemblyzer** - Simple speaker verification
|
||||||
|
- **speechbrain** - Complete toolkit
|
||||||
|
- **pyannote.audio** - You already use this for diarization!
|
||||||
|
|
||||||
|
**Implementation:**
|
||||||
|
```bash
|
||||||
|
# You already have pyannote for diarization!
|
||||||
|
conda activate voice-assistant
|
||||||
|
pip install pyannote.audio --break-system-packages
|
||||||
|
|
||||||
|
# Can use speaker embeddings for identification
|
||||||
|
```
|
||||||
|
|
||||||
|
```python
|
||||||
|
from pyannote.audio import Inference
|
||||||
|
|
||||||
|
# Load speaker embedding model
|
||||||
|
inference = Inference(
|
||||||
|
"pyannote/embedding",
|
||||||
|
use_auth_token=hf_token
|
||||||
|
)
|
||||||
|
|
||||||
|
# Extract embeddings for known users
|
||||||
|
alan_embedding = inference("alan_voice_sample.wav")
|
||||||
|
sarah_embedding = inference("sarah_voice_sample.wav")
|
||||||
|
|
||||||
|
# Compare with incoming audio
|
||||||
|
unknown_embedding = inference(audio_buffer)
|
||||||
|
|
||||||
|
from scipy.spatial.distance import cosine
|
||||||
|
alan_similarity = 1 - cosine(unknown_embedding, alan_embedding)
|
||||||
|
sarah_similarity = 1 - cosine(unknown_embedding, sarah_embedding)
|
||||||
|
|
||||||
|
if alan_similarity > 0.8:
|
||||||
|
user = 'alan'
|
||||||
|
elif sarah_similarity > 0.8:
|
||||||
|
user = 'sarah'
|
||||||
|
else:
|
||||||
|
user = 'unknown'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
✅ Personalized responses per user
|
||||||
|
✅ Better accuracy (model optimized for each voice)
|
||||||
|
✅ User-specific preferences/permissions
|
||||||
|
✅ Can track who said what
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
❌ More complex setup
|
||||||
|
❌ Higher resource usage
|
||||||
|
❌ Requires voice samples from each user
|
||||||
|
❌ Privacy considerations
|
||||||
|
|
||||||
|
#### Approach 4: Adaptive/Online Learning
|
||||||
|
|
||||||
|
Model improves automatically based on usage:
|
||||||
|
|
||||||
|
```python
|
||||||
|
class AdaptiveWakeWord:
|
||||||
|
def __init__(self, base_model):
|
||||||
|
self.base_model = base_model
|
||||||
|
self.user_samples = []
|
||||||
|
self.retrain_threshold = 50 # Retrain after N samples
|
||||||
|
|
||||||
|
def on_detection(self, audio, user_confirmed=True):
|
||||||
|
"""User confirms this was correct detection"""
|
||||||
|
if user_confirmed:
|
||||||
|
self.user_samples.append(audio)
|
||||||
|
|
||||||
|
# Periodically retrain
|
||||||
|
if len(self.user_samples) >= self.retrain_threshold:
|
||||||
|
self.retrain_with_samples()
|
||||||
|
self.user_samples = []
|
||||||
|
|
||||||
|
def retrain_with_samples(self):
|
||||||
|
"""Background retraining with collected samples"""
|
||||||
|
# Add samples to training set
|
||||||
|
# Retrain model
|
||||||
|
# Swap in new model
|
||||||
|
pass
|
||||||
|
```
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
✅ Automatic improvement
|
||||||
|
✅ Adapts to user's voice over time
|
||||||
|
✅ No manual retraining
|
||||||
|
✅ Gets better with use
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
❌ Complex implementation
|
||||||
|
❌ Requires user feedback mechanism
|
||||||
|
❌ Risk of drift/degradation
|
||||||
|
❌ Background training overhead
|
||||||
|
|
||||||
|
## Recommended Strategy
|
||||||
|
|
||||||
|
### Phase 1: Single Wake Word, Single Model
|
||||||
|
```bash
|
||||||
|
# Week 1-2
|
||||||
|
# Use pre-trained "Hey Mycroft"
|
||||||
|
# OR train custom "Hey Computer" with all family members' voices
|
||||||
|
# Keep it simple, get it working
|
||||||
|
```
|
||||||
|
|
||||||
|
### Phase 2: Add Fine-tuning
|
||||||
|
```bash
|
||||||
|
# Week 3-4
|
||||||
|
# Collect false positives/negatives
|
||||||
|
# Retrain with household-specific data
|
||||||
|
# Optimize threshold
|
||||||
|
```
|
||||||
|
|
||||||
|
### Phase 3: Consider Multiple Wake Words
|
||||||
|
```bash
|
||||||
|
# Month 2
|
||||||
|
# If needed, add second wake word
|
||||||
|
# "Hey Mycroft" for general
|
||||||
|
# "Jarvis" for personal assistant tasks
|
||||||
|
```
|
||||||
|
|
||||||
|
### Phase 4: Personalization
|
||||||
|
```bash
|
||||||
|
# Month 3+
|
||||||
|
# If desired, add speaker identification
|
||||||
|
# Personalized responses
|
||||||
|
# User-specific preferences
|
||||||
|
```
|
||||||
|
|
||||||
|
## Practical Examples
|
||||||
|
|
||||||
|
### Example 1: Family of 4, Single Model
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Training session with everyone
|
||||||
|
cd ~/precise-models/hey-mycroft-family
|
||||||
|
|
||||||
|
# Dad records 25 samples
|
||||||
|
precise-collect
|
||||||
|
|
||||||
|
# Mom records 25 samples
|
||||||
|
precise-collect
|
||||||
|
|
||||||
|
# Kid 1 records 15 samples
|
||||||
|
precise-collect
|
||||||
|
|
||||||
|
# Kid 2 records 15 samples
|
||||||
|
precise-collect
|
||||||
|
|
||||||
|
# Collect shared negative samples (200+)
|
||||||
|
# TV, music, conversation, etc.
|
||||||
|
precise-collect -f not-wake-word/household.wav
|
||||||
|
|
||||||
|
# Train single model for everyone
|
||||||
|
precise-train -e 60 hey-mycroft-family.net .
|
||||||
|
|
||||||
|
# Deploy
|
||||||
|
python voice_server.py \
|
||||||
|
--enable-precise \
|
||||||
|
--precise-model hey-mycroft-family.net
|
||||||
|
```
|
||||||
|
|
||||||
|
**Result:** Everyone can use it, one model, simple.
|
||||||
|
|
||||||
|
### Example 2: Two Wake Words, Different Purposes
|
||||||
|
|
||||||
|
```python
|
||||||
|
# voice_server.py configuration
|
||||||
|
wake_words = {
|
||||||
|
'hey_mycroft': {
|
||||||
|
'model': 'hey-mycroft.net',
|
||||||
|
'sensitivity': 0.5,
|
||||||
|
'intent_parser': 'general', # All commands
|
||||||
|
'response': 'Yes?'
|
||||||
|
},
|
||||||
|
'emergency': {
|
||||||
|
'model': 'emergency.net',
|
||||||
|
'sensitivity': 0.7, # Higher threshold
|
||||||
|
'intent_parser': 'emergency', # Limited commands
|
||||||
|
'response': 'Emergency mode activated'
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
# "Hey Mycroft, turn on the lights" - works
|
||||||
|
# "Emergency, call for help" - triggers emergency protocol
|
||||||
|
```
|
||||||
|
|
||||||
|
### Example 3: Speaker Identification + Personalization
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Enhanced processing with speaker ID
|
||||||
|
def process_with_speaker_id(audio, speaker):
|
||||||
|
# Different HA entity based on speaker
|
||||||
|
entity_maps = {
|
||||||
|
'alan': {
|
||||||
|
'bedroom_light': 'light.master_bedroom',
|
||||||
|
'office_light': 'light.alan_office',
|
||||||
|
},
|
||||||
|
'sarah': {
|
||||||
|
'bedroom_light': 'light.master_bedroom',
|
||||||
|
'office_light': 'light.sarah_office',
|
||||||
|
},
|
||||||
|
'kids': {
|
||||||
|
'bedroom_light': 'light.kids_bedroom',
|
||||||
|
'tv': None, # Kids can't control TV
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
# Transcribe command
|
||||||
|
text = whisper_transcribe(audio)
|
||||||
|
|
||||||
|
# "Turn on bedroom light"
|
||||||
|
if 'bedroom light' in text:
|
||||||
|
entity = entity_maps[speaker]['bedroom_light']
|
||||||
|
ha_client.turn_on(entity)
|
||||||
|
|
||||||
|
response = f"Turned on your bedroom light"
|
||||||
|
|
||||||
|
return response
|
||||||
|
```
|
||||||
|
|
||||||
|
## Resource Requirements
|
||||||
|
|
||||||
|
### Single Wake Word
|
||||||
|
- **CPU:** 5-10% (Heimdall)
|
||||||
|
- **RAM:** 100-200MB
|
||||||
|
- **Model size:** 1-3MB
|
||||||
|
- **Training time:** 30-60 min
|
||||||
|
|
||||||
|
### Multiple Wake Words (3 models)
|
||||||
|
- **CPU:** 15-30% (Heimdall)
|
||||||
|
- **RAM:** 300-600MB
|
||||||
|
- **Model size:** 3-9MB total
|
||||||
|
- **Training time:** 90-180 min
|
||||||
|
|
||||||
|
### With Speaker Identification
|
||||||
|
- **CPU:** +5-10% for speaker ID
|
||||||
|
- **RAM:** +200-300MB for embedding model
|
||||||
|
- **Model size:** +50MB for speaker model
|
||||||
|
- **Setup time:** +30-60 min for voice enrollment
|
||||||
|
|
||||||
|
### K210 Edge (Maix Duino)
|
||||||
|
- **Single model:** Feasible, ~30% CPU
|
||||||
|
- **2 models:** Feasible, ~60% CPU, higher latency
|
||||||
|
- **3+ models:** Not recommended
|
||||||
|
- **Speaker ID:** Not feasible (limited RAM/compute)
|
||||||
|
|
||||||
|
## Quick Decision Guide
|
||||||
|
|
||||||
|
**Just getting started?**
|
||||||
|
→ Use pre-trained "Hey Mycroft"
|
||||||
|
|
||||||
|
**Want custom wake word?**
|
||||||
|
→ Train one model with all family voices
|
||||||
|
|
||||||
|
**Need multiple wake words?**
|
||||||
|
→ Start server-side with 2-3 models
|
||||||
|
|
||||||
|
**Want personalization?**
|
||||||
|
→ Add speaker identification
|
||||||
|
|
||||||
|
**Deploying to edge (K210)?**
|
||||||
|
→ Stick to 1-2 wake words maximum
|
||||||
|
|
||||||
|
**Family of 4+ people?**
|
||||||
|
→ Train single model with everyone's voice
|
||||||
|
|
||||||
|
**Privacy is paramount?**
|
||||||
|
→ Skip speaker ID, use single universal model
|
||||||
|
|
||||||
|
## Testing Multiple Wake Words
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Test all wake words quickly
|
||||||
|
conda activate precise
|
||||||
|
|
||||||
|
# Terminal 1: Hey Mycroft
|
||||||
|
precise-listen hey-mycroft.net
|
||||||
|
|
||||||
|
# Terminal 2: Hey Computer
|
||||||
|
precise-listen hey-computer.net
|
||||||
|
|
||||||
|
# Terminal 3: Emergency
|
||||||
|
precise-listen emergency.net
|
||||||
|
|
||||||
|
# Say each wake word, verify correct detection
|
||||||
|
```
|
||||||
|
|
||||||
|
## Conclusion
|
||||||
|
|
||||||
|
### For Your Maix Duino Project:
|
||||||
|
|
||||||
|
**Recommended approach:**
|
||||||
|
1. **Start with "Hey Mycroft"** - Use pre-trained model
|
||||||
|
2. **Fine-tune if needed** - Add your household's voices
|
||||||
|
3. **Consider 2nd wake word** - Only if you have a specific use case
|
||||||
|
4. **Speaker ID** - Phase 2/3 enhancement, not critical for MVP
|
||||||
|
5. **Keep it simple** - One wake word works great for most users
|
||||||
|
|
||||||
|
**The pre-trained "Hey Mycroft" model saves you 1-2 hours** and works immediately. You can always fine-tune or add custom wake words later!
|
||||||
|
|
||||||
|
**Multiple wake words are cool but not necessary** - Most commercial products use just one. Focus on making one wake word work really well before adding more.
|
||||||
|
|
||||||
|
**Voice adaptation** - Training with multiple voices upfront is simpler than per-user models. Save speaker ID for later if you need personalization.
|
||||||
|
|
||||||
|
## Quick Start with Pre-trained
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# On Heimdall
|
||||||
|
cd ~/precise-models/pretrained
|
||||||
|
wget https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz
|
||||||
|
tar xzf hey-mycroft.tar.gz
|
||||||
|
|
||||||
|
# Test it
|
||||||
|
conda activate precise
|
||||||
|
precise-listen hey-mycroft.net
|
||||||
|
|
||||||
|
# Deploy
|
||||||
|
cd ~/voice-assistant
|
||||||
|
python voice_server.py \
|
||||||
|
--enable-precise \
|
||||||
|
--precise-model ~/precise-models/pretrained/hey-mycroft.net
|
||||||
|
|
||||||
|
# You're done! No training needed!
|
||||||
|
```
|
||||||
|
|
||||||
|
**That's it - you have a working wake word in 5 minutes!** 🎉
|
||||||
411
docs/WAKE_WORD_QUICK_REF.md
Executable file
411
docs/WAKE_WORD_QUICK_REF.md
Executable file
|
|
@ -0,0 +1,411 @@
|
||||||
|
# Wake Word Quick Reference Card
|
||||||
|
|
||||||
|
## 🎯 TL;DR: What Should I Do?
|
||||||
|
|
||||||
|
### Recommendation for Your Setup
|
||||||
|
|
||||||
|
**Week 1:** Use pre-trained "Hey Mycroft"
|
||||||
|
```bash
|
||||||
|
./download_pretrained_models.sh --model hey-mycroft
|
||||||
|
precise-listen ~/precise-models/pretrained/hey-mycroft.net
|
||||||
|
```
|
||||||
|
|
||||||
|
**Week 2-3:** Fine-tune with all family members' voices
|
||||||
|
```bash
|
||||||
|
cd ~/precise-models/hey-mycroft-family
|
||||||
|
precise-train -e 30 custom.net . --from-checkpoint ../pretrained/hey-mycroft.net
|
||||||
|
```
|
||||||
|
|
||||||
|
**Week 4+:** Add speaker identification
|
||||||
|
```bash
|
||||||
|
pip install resemblyzer
|
||||||
|
python enroll_speaker.py --name Alan --duration 20
|
||||||
|
python enroll_speaker.py --name [Family] --duration 20
|
||||||
|
```
|
||||||
|
|
||||||
|
**Month 2+:** Add second wake word (Hey Jarvis for Plex?)
|
||||||
|
```bash
|
||||||
|
./download_pretrained_models.sh --model hey-jarvis
|
||||||
|
# Run both in parallel on server
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📋 Pre-trained Models
|
||||||
|
|
||||||
|
### Available Models (Ready to Use!)
|
||||||
|
|
||||||
|
| Wake Word | Download | Best For |
|
||||||
|
|-----------|----------|----------|
|
||||||
|
| **Hey Mycroft** ⭐ | `--model hey-mycroft` | Default choice, most data |
|
||||||
|
| **Hey Jarvis** | `--model hey-jarvis` | Pop culture, media control |
|
||||||
|
| **Christopher** | `--model christopher` | Unique, less common |
|
||||||
|
| **Hey Ezra** | `--model hey-ezra` | Alternative option |
|
||||||
|
|
||||||
|
### Quick Download
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Download one
|
||||||
|
./download_pretrained_models.sh --model hey-mycroft
|
||||||
|
|
||||||
|
# Download all
|
||||||
|
./download_pretrained_models.sh --test-all
|
||||||
|
|
||||||
|
# Test immediately
|
||||||
|
precise-listen ~/precise-models/pretrained/hey-mycroft.net
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🔢 Multiple Wake Words
|
||||||
|
|
||||||
|
### Option 1: Multiple Models (Server-Side) ⭐ RECOMMENDED
|
||||||
|
|
||||||
|
**What:** Run 2-3 different wake word models simultaneously
|
||||||
|
**Where:** Heimdall (server)
|
||||||
|
**Performance:** ~15-30% CPU for 3 models
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Start with multiple wake words
|
||||||
|
python voice_server.py \
|
||||||
|
--enable-precise \
|
||||||
|
--precise-models "\
|
||||||
|
hey-mycroft:~/models/hey-mycroft.net:0.5,\
|
||||||
|
hey-jarvis:~/models/hey-jarvis.net:0.5"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
- ✅ Can identify which wake word was used
|
||||||
|
- ✅ Different contexts (Mycroft=commands, Jarvis=media)
|
||||||
|
- ✅ Easy to add/remove wake words
|
||||||
|
- ✅ Each can have different sensitivity
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
- ❌ Only works server-side (not on Maix Duino)
|
||||||
|
- ❌ Higher CPU usage (but still reasonable)
|
||||||
|
|
||||||
|
**Use When:**
|
||||||
|
- You want different wake words for different purposes
|
||||||
|
- Server has CPU to spare (yours does!)
|
||||||
|
- Want flexibility to add wake words later
|
||||||
|
|
||||||
|
### Option 2: Single Multi-Phrase Model (Edge-Compatible)
|
||||||
|
|
||||||
|
**What:** One model responds to multiple phrases
|
||||||
|
**Where:** Server OR Maix Duino
|
||||||
|
**Performance:** Same as single model
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Train on multiple phrases
|
||||||
|
cd ~/precise-models/multi-wake
|
||||||
|
# Record "Hey Mycroft" samples → wake-word/
|
||||||
|
# Record "Hey Computer" samples → wake-word/
|
||||||
|
# Record negatives → not-wake-word/
|
||||||
|
precise-train -e 60 multi-wake.net .
|
||||||
|
```
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
- ✅ Single model = less compute
|
||||||
|
- ✅ Works on edge (K210)
|
||||||
|
- ✅ Simple deployment
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
- ❌ Can't tell which wake word was used
|
||||||
|
- ❌ May reduce accuracy
|
||||||
|
- ❌ Higher false positive risk
|
||||||
|
|
||||||
|
**Use When:**
|
||||||
|
- Deploying to Maix Duino (edge)
|
||||||
|
- Want backup wake words
|
||||||
|
- Don't care which was used
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 👥 Multi-User Support
|
||||||
|
|
||||||
|
### Option 1: Inclusive Training ⭐ START HERE
|
||||||
|
|
||||||
|
**What:** One model, all voices
|
||||||
|
**How:** All family members record samples
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd ~/precise-models/family-wake
|
||||||
|
# Alice records 30 samples
|
||||||
|
# Bob records 30 samples
|
||||||
|
# You record 30 samples
|
||||||
|
precise-train -e 60 family-wake.net .
|
||||||
|
```
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
- ✅ Everyone can use it
|
||||||
|
- ✅ Simple deployment
|
||||||
|
- ✅ Single model
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
- ❌ Can't identify who spoke
|
||||||
|
- ❌ No personalization
|
||||||
|
|
||||||
|
**Use When:**
|
||||||
|
- Just getting started
|
||||||
|
- Don't need to know who spoke
|
||||||
|
- Want simplicity
|
||||||
|
|
||||||
|
### Option 2: Speaker Identification (Week 4+)
|
||||||
|
|
||||||
|
**What:** Detect wake word, then identify speaker
|
||||||
|
**How:** Voice embeddings (resemblyzer or pyannote)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Install
|
||||||
|
pip install resemblyzer
|
||||||
|
|
||||||
|
# Enroll users
|
||||||
|
python enroll_speaker.py --name Alan --duration 20
|
||||||
|
python enroll_speaker.py --name Alice --duration 20
|
||||||
|
python enroll_speaker.py --name Bob --duration 20
|
||||||
|
|
||||||
|
# Server identifies speaker automatically
|
||||||
|
```
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
- ✅ Personalized responses
|
||||||
|
- ✅ User-specific permissions
|
||||||
|
- ✅ Better privacy
|
||||||
|
- ✅ Track preferences
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
- ❌ More complex
|
||||||
|
- ❌ Requires enrollment
|
||||||
|
- ❌ +100-200ms latency
|
||||||
|
- ❌ May fail with similar voices
|
||||||
|
|
||||||
|
**Use When:**
|
||||||
|
- Want personalization
|
||||||
|
- Need user-specific commands
|
||||||
|
- Ready for advanced features
|
||||||
|
|
||||||
|
### Option 3: Per-User Wake Words (Advanced)
|
||||||
|
|
||||||
|
**What:** Each person has their own wake word
|
||||||
|
**How:** Multiple models, one per person
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Alice: "Hey Mycroft"
|
||||||
|
# Bob: "Hey Jarvis"
|
||||||
|
# You: "Hey Computer"
|
||||||
|
|
||||||
|
# Run all 3 models in parallel
|
||||||
|
```
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
- ✅ Automatic user ID
|
||||||
|
- ✅ Highest accuracy per user
|
||||||
|
- ✅ Clear separation
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
- ❌ 3x models = 3x CPU
|
||||||
|
- ❌ Users must remember their word
|
||||||
|
- ❌ Server-only (not edge)
|
||||||
|
|
||||||
|
**Use When:**
|
||||||
|
- Need automatic user ID
|
||||||
|
- Have CPU to spare
|
||||||
|
- Users want their own wake word
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎯 Decision Tree
|
||||||
|
|
||||||
|
```
|
||||||
|
START: Want to use voice assistant
|
||||||
|
│
|
||||||
|
├─ Single user or don't care who spoke?
|
||||||
|
│ └─ Use: Inclusive Training (Option 1)
|
||||||
|
│ └─ Download: Hey Mycroft (pre-trained)
|
||||||
|
│
|
||||||
|
├─ Multiple users AND need to know who spoke?
|
||||||
|
│ └─ Use: Speaker Identification (Option 2)
|
||||||
|
│ └─ Start with: Hey Mycroft + resemblyzer
|
||||||
|
│
|
||||||
|
├─ Want different wake words for different purposes?
|
||||||
|
│ └─ Use: Multiple Models (Option 1)
|
||||||
|
│ └─ Download: Hey Mycroft + Hey Jarvis
|
||||||
|
│
|
||||||
|
└─ Deploying to Maix Duino (edge)?
|
||||||
|
└─ Use: Single Multi-Phrase Model (Option 2)
|
||||||
|
└─ Train: Custom model with 2-3 phrases
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📊 Comparison Table
|
||||||
|
|
||||||
|
| Feature | Inclusive | Speaker ID | Per-User Wake | Multiple Wake |
|
||||||
|
|---------|-----------|------------|---------------|---------------|
|
||||||
|
| **Setup Time** | 2 hours | 4 hours | 6 hours | 3 hours |
|
||||||
|
| **Complexity** | ⭐ Easy | ⭐⭐⭐ Medium | ⭐⭐⭐⭐ Hard | ⭐⭐ Easy |
|
||||||
|
| **CPU Usage** | 5-10% | 10-15% | 15-30% | 15-30% |
|
||||||
|
| **Latency** | 100ms | 300ms | 100ms | 100ms |
|
||||||
|
| **User ID** | ❌ No | ✅ Yes | ✅ Yes | ❌ No |
|
||||||
|
| **Edge Deploy** | ✅ Yes | ⚠️ Maybe | ❌ No | ⚠️ Partial |
|
||||||
|
| **Personalize** | ❌ No | ✅ Yes | ✅ Yes | ⚠️ Partial |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🚀 Recommended Timeline
|
||||||
|
|
||||||
|
### Week 1: Get It Working
|
||||||
|
```bash
|
||||||
|
# Use pre-trained Hey Mycroft
|
||||||
|
./download_pretrained_models.sh --model hey-mycroft
|
||||||
|
|
||||||
|
# Test it
|
||||||
|
precise-listen ~/precise-models/pretrained/hey-mycroft.net
|
||||||
|
|
||||||
|
# Deploy to server
|
||||||
|
python voice_server.py --enable-precise \
|
||||||
|
--precise-model ~/precise-models/pretrained/hey-mycroft.net
|
||||||
|
```
|
||||||
|
|
||||||
|
### Week 2-3: Make It Yours
|
||||||
|
```bash
|
||||||
|
# Fine-tune with your family's voices
|
||||||
|
cd ~/precise-models/hey-mycroft-family
|
||||||
|
|
||||||
|
# Have everyone record 20-30 samples
|
||||||
|
precise-collect # Alice
|
||||||
|
precise-collect # Bob
|
||||||
|
precise-collect # You
|
||||||
|
|
||||||
|
# Train
|
||||||
|
precise-train -e 30 custom.net . \
|
||||||
|
--from-checkpoint ../pretrained/hey-mycroft.net
|
||||||
|
```
|
||||||
|
|
||||||
|
### Week 4+: Add Intelligence
|
||||||
|
```bash
|
||||||
|
# Speaker identification
|
||||||
|
pip install resemblyzer
|
||||||
|
python enroll_speaker.py --name Alan --duration 20
|
||||||
|
python enroll_speaker.py --name Alice --duration 20
|
||||||
|
|
||||||
|
# Now server knows who's speaking!
|
||||||
|
```
|
||||||
|
|
||||||
|
### Month 2+: Expand Features
|
||||||
|
```bash
|
||||||
|
# Add second wake word for media control
|
||||||
|
./download_pretrained_models.sh --model hey-jarvis
|
||||||
|
|
||||||
|
# Run both: Mycroft for commands, Jarvis for Plex
|
||||||
|
python voice_server.py --enable-precise \
|
||||||
|
--precise-models "mycroft:hey-mycroft.net:0.5,jarvis:hey-jarvis.net:0.5"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 💡 Pro Tips
|
||||||
|
|
||||||
|
### Wake Word Selection
|
||||||
|
- ✅ **DO:** Choose clear, distinct wake words
|
||||||
|
- ✅ **DO:** Test in your environment
|
||||||
|
- ❌ **DON'T:** Use similar-sounding words
|
||||||
|
- ❌ **DON'T:** Use common phrases
|
||||||
|
|
||||||
|
### Training
|
||||||
|
- ✅ **DO:** Include all intended users
|
||||||
|
- ✅ **DO:** Record in various conditions
|
||||||
|
- ✅ **DO:** Add false positives to training
|
||||||
|
- ❌ **DON'T:** Rush the training process
|
||||||
|
|
||||||
|
### Deployment
|
||||||
|
- ✅ **DO:** Start simple (one wake word)
|
||||||
|
- ✅ **DO:** Test thoroughly before adding features
|
||||||
|
- ✅ **DO:** Monitor false positive rate
|
||||||
|
- ❌ **DON'T:** Deploy too many wake words at once
|
||||||
|
|
||||||
|
### Speaker ID
|
||||||
|
- ✅ **DO:** Use 20+ seconds for enrollment
|
||||||
|
- ✅ **DO:** Re-enroll if accuracy drops
|
||||||
|
- ✅ **DO:** Test threshold values
|
||||||
|
- ❌ **DON'T:** Expect 100% accuracy
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🔧 Quick Commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Download pre-trained model
|
||||||
|
./download_pretrained_models.sh --model hey-mycroft
|
||||||
|
|
||||||
|
# Test model
|
||||||
|
precise-listen ~/precise-models/pretrained/hey-mycroft.net
|
||||||
|
|
||||||
|
# Fine-tune from pre-trained
|
||||||
|
precise-train -e 30 custom.net . \
|
||||||
|
--from-checkpoint ~/precise-models/pretrained/hey-mycroft.net
|
||||||
|
|
||||||
|
# Enroll speaker
|
||||||
|
python enroll_speaker.py --name Alan --duration 20
|
||||||
|
|
||||||
|
# Start with single wake word
|
||||||
|
python voice_server.py --enable-precise \
|
||||||
|
--precise-model hey-mycroft.net
|
||||||
|
|
||||||
|
# Start with multiple wake words
|
||||||
|
python voice_server.py --enable-precise \
|
||||||
|
--precise-models "mycroft:hey-mycroft.net:0.5,jarvis:hey-jarvis.net:0.5"
|
||||||
|
|
||||||
|
# Check status
|
||||||
|
curl http://10.1.10.71:5000/wake-word/status
|
||||||
|
|
||||||
|
# Monitor detections
|
||||||
|
curl http://10.1.10.71:5000/wake-word/detections
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📚 See Also
|
||||||
|
|
||||||
|
- **Full guide:** [ADVANCED_WAKE_WORD_TOPICS.md](ADVANCED_WAKE_WORD_TOPICS.md)
|
||||||
|
- **Training:** [MYCROFT_PRECISE_GUIDE.md](MYCROFT_PRECISE_GUIDE.md)
|
||||||
|
- **Deployment:** [PRECISE_DEPLOYMENT.md](PRECISE_DEPLOYMENT.md)
|
||||||
|
- **Getting started:** [QUICKSTART.md](QUICKSTART.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ❓ FAQ
|
||||||
|
|
||||||
|
**Q: Can I use "Hey Mycroft" right away?**
|
||||||
|
A: Yes! Download with `./download_pretrained_models.sh --model hey-mycroft`
|
||||||
|
|
||||||
|
**Q: How many wake words can I run at once?**
|
||||||
|
A: 2-3 comfortably on server. Maix Duino can handle 1.
|
||||||
|
|
||||||
|
**Q: Can I train my own custom wake word?**
|
||||||
|
A: Yes! See MYCROFT_PRECISE_GUIDE.md Phase 2.
|
||||||
|
|
||||||
|
**Q: Does speaker ID work with multiple wake words?**
|
||||||
|
A: Yes! Wake word detected → Speaker identified → Personalized response.
|
||||||
|
|
||||||
|
**Q: Can I use this on Maix Duino?**
|
||||||
|
A: Server-side (start here), then convert to KMODEL (advanced).
|
||||||
|
|
||||||
|
**Q: How accurate is speaker identification?**
|
||||||
|
A: 85-95% with good enrollment. Re-enroll if accuracy drops.
|
||||||
|
|
||||||
|
**Q: What if someone has a cold?**
|
||||||
|
A: May reduce accuracy temporarily. System should recover when voice returns to normal.
|
||||||
|
|
||||||
|
**Q: Can kids use it?**
|
||||||
|
A: Yes! Include their voices in training or enroll them separately.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Quick Decision:** Start with pre-trained Hey Mycroft. Add features later!
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./download_pretrained_models.sh --model hey-mycroft
|
||||||
|
precise-listen ~/precise-models/pretrained/hey-mycroft.net
|
||||||
|
# It just works! ✨
|
||||||
|
```
|
||||||
347
docs/maix-voice-assistant-architecture.md
Executable file
347
docs/maix-voice-assistant-architecture.md
Executable file
|
|
@ -0,0 +1,347 @@
|
||||||
|
# Maix Duino Voice Assistant - System Architecture
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
Local voice assistant using Sipeed Maix Duino board integrated with Home Assistant, leveraging existing home lab infrastructure for AI processing.
|
||||||
|
|
||||||
|
## Hardware Components
|
||||||
|
|
||||||
|
### Maix Duino Board
|
||||||
|
- **Processor**: K210 dual-core RISC-V @ 400MHz
|
||||||
|
- **AI Accelerator**: KPU for neural network inference
|
||||||
|
- **Audio**: I2S microphone + speaker output
|
||||||
|
- **Connectivity**: ESP32 for WiFi/BLE
|
||||||
|
- **Programming**: MaixPy (MicroPython)
|
||||||
|
|
||||||
|
### Recommended Accessories
|
||||||
|
- I2S MEMS microphone (or microphone array for better pickup)
|
||||||
|
- Small speaker (3-5W) or audio output to existing speakers
|
||||||
|
- USB-C power supply (5V/2A minimum)
|
||||||
|
|
||||||
|
## Software Architecture
|
||||||
|
|
||||||
|
### Edge Layer (Maix Duino)
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────┐
|
||||||
|
│ Maix Duino (MaixPy) │
|
||||||
|
├─────────────────────────────────────┤
|
||||||
|
│ • Wake Word Detection (KPU) │
|
||||||
|
│ • Audio Capture (I2S) │
|
||||||
|
│ • Audio Streaming → Heimdall │
|
||||||
|
│ • Audio Playback ← Heimdall │
|
||||||
|
│ • LED Feedback (listening status) │
|
||||||
|
└─────────────────────────────────────┘
|
||||||
|
↕ WiFi/HTTP/WebSocket
|
||||||
|
┌─────────────────────────────────────┐
|
||||||
|
│ Voice Processing Server │
|
||||||
|
│ (Heimdall - 10.1.10.71) │
|
||||||
|
├─────────────────────────────────────┤
|
||||||
|
│ • Whisper STT (existing setup!) │
|
||||||
|
│ • Intent Recognition (Rasa/custom) │
|
||||||
|
│ • Piper TTS │
|
||||||
|
│ • Home Assistant API Client │
|
||||||
|
└─────────────────────────────────────┘
|
||||||
|
↕ REST API/MQTT
|
||||||
|
┌─────────────────────────────────────┐
|
||||||
|
│ Home Assistant │
|
||||||
|
│ (Your HA instance) │
|
||||||
|
├─────────────────────────────────────┤
|
||||||
|
│ • Device Control │
|
||||||
|
│ • State Management │
|
||||||
|
│ • Automation Triggers │
|
||||||
|
└─────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
## Communication Flow
|
||||||
|
|
||||||
|
### 1. Wake Word Detection (Local)
|
||||||
|
```
|
||||||
|
User says "Hey Assistant"
|
||||||
|
↓
|
||||||
|
Maix Duino KPU detects wake word
|
||||||
|
↓
|
||||||
|
LED turns on (listening mode)
|
||||||
|
↓
|
||||||
|
Start audio streaming to Heimdall
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Speech Processing (Heimdall)
|
||||||
|
```
|
||||||
|
Audio stream received
|
||||||
|
↓
|
||||||
|
Whisper transcribes to text
|
||||||
|
↓
|
||||||
|
Intent parser extracts command
|
||||||
|
↓
|
||||||
|
Query Home Assistant API
|
||||||
|
↓
|
||||||
|
Generate response text
|
||||||
|
↓
|
||||||
|
Piper TTS creates audio
|
||||||
|
↓
|
||||||
|
Stream audio back to Maix Duino
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Playback & Feedback
|
||||||
|
```
|
||||||
|
Receive audio stream
|
||||||
|
↓
|
||||||
|
Play through speaker
|
||||||
|
↓
|
||||||
|
LED indicates completion
|
||||||
|
↓
|
||||||
|
Return to wake word detection
|
||||||
|
```
|
||||||
|
|
||||||
|
## Network Configuration
|
||||||
|
|
||||||
|
### Maix Duino Network Settings
|
||||||
|
- **IP**: 10.1.10.xxx (assign static via DHCP reservation)
|
||||||
|
- **Gateway**: 10.1.10.1
|
||||||
|
- **DNS**: 10.1.10.4 (Pi-hole)
|
||||||
|
|
||||||
|
### Service Endpoints
|
||||||
|
- **Voice Processing Server**: http://10.1.10.71:5000
|
||||||
|
- **Home Assistant**: (your existing HA URL)
|
||||||
|
- **MQTT Broker**: (optional, if using MQTT)
|
||||||
|
|
||||||
|
### Caddy Reverse Proxy Entry
|
||||||
|
Add to `/mnt/project/epona_-_Caddyfile`:
|
||||||
|
```caddy
|
||||||
|
# Voice Assistant API
|
||||||
|
handle /voice-assistant* {
|
||||||
|
uri strip_prefix /voice-assistant
|
||||||
|
reverse_proxy http://10.1.10.71:5000
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Software Stack
|
||||||
|
|
||||||
|
### Maix Duino (MaixPy)
|
||||||
|
- **Firmware**: Latest MaixPy release
|
||||||
|
- **Libraries**:
|
||||||
|
- `Maix.KPU` - Neural network inference
|
||||||
|
- `Maix.I2S` - Audio capture/playback
|
||||||
|
- `socket` - Network communication
|
||||||
|
- `ujson` - JSON handling
|
||||||
|
|
||||||
|
### Heimdall Server (Python)
|
||||||
|
- **Environment**: Create new conda env
|
||||||
|
```bash
|
||||||
|
conda create -n voice-assistant python=3.10
|
||||||
|
conda activate voice-assistant
|
||||||
|
```
|
||||||
|
- **Dependencies**:
|
||||||
|
- `openai-whisper` (already installed!)
|
||||||
|
- `piper-tts` - Text-to-speech
|
||||||
|
- `flask` - REST API server
|
||||||
|
- `requests` - HTTP client
|
||||||
|
- `pyaudio` - Audio handling
|
||||||
|
- `websockets` - Real-time streaming
|
||||||
|
|
||||||
|
### Optional: Intent Recognition
|
||||||
|
- **Rasa** - Full NLU framework (heavier but powerful)
|
||||||
|
- **Simple pattern matching** - Lightweight, start here
|
||||||
|
- **LLM-based** - Use your existing LLM setup on Heimdall
|
||||||
|
|
||||||
|
## Data Flow Examples
|
||||||
|
|
||||||
|
### Example 1: Turn on lights
|
||||||
|
```
|
||||||
|
User: "Hey Assistant, turn on the living room lights"
|
||||||
|
↓
|
||||||
|
Wake word detected → Start recording
|
||||||
|
↓
|
||||||
|
Whisper STT: "turn on the living room lights"
|
||||||
|
↓
|
||||||
|
Intent Parser: {
|
||||||
|
"action": "turn_on",
|
||||||
|
"entity": "light.living_room"
|
||||||
|
}
|
||||||
|
↓
|
||||||
|
Home Assistant API:
|
||||||
|
POST /api/services/light/turn_on
|
||||||
|
{"entity_id": "light.living_room"}
|
||||||
|
↓
|
||||||
|
Response: "Living room lights turned on"
|
||||||
|
↓
|
||||||
|
Piper TTS → Audio playback
|
||||||
|
```
|
||||||
|
|
||||||
|
### Example 2: Get status
|
||||||
|
```
|
||||||
|
User: "What's the temperature?"
|
||||||
|
↓
|
||||||
|
Whisper STT: "what's the temperature"
|
||||||
|
↓
|
||||||
|
Intent Parser: {
|
||||||
|
"action": "get_state",
|
||||||
|
"entity": "sensor.temperature"
|
||||||
|
}
|
||||||
|
↓
|
||||||
|
Home Assistant API:
|
||||||
|
GET /api/states/sensor.temperature
|
||||||
|
↓
|
||||||
|
Response: "The temperature is 72 degrees"
|
||||||
|
↓
|
||||||
|
Piper TTS → Audio playback
|
||||||
|
```
|
||||||
|
|
||||||
|
## Phase 1 Implementation Plan
|
||||||
|
|
||||||
|
### Step 1: Maix Duino Setup (Week 1)
|
||||||
|
- [ ] Flash latest MaixPy firmware
|
||||||
|
- [ ] Test audio input/output
|
||||||
|
- [ ] Implement basic network communication
|
||||||
|
- [ ] Test streaming audio to server
|
||||||
|
|
||||||
|
### Step 2: Server Setup (Week 1-2)
|
||||||
|
- [ ] Create conda environment on Heimdall
|
||||||
|
- [ ] Set up Flask API server
|
||||||
|
- [ ] Integrate Whisper (already have this!)
|
||||||
|
- [ ] Install and test Piper TTS
|
||||||
|
- [ ] Create basic Home Assistant API client
|
||||||
|
|
||||||
|
### Step 3: Wake Word Training (Week 2)
|
||||||
|
- [ ] Record wake word samples
|
||||||
|
- [ ] Train custom wake word model
|
||||||
|
- [ ] Convert model for K210 KPU
|
||||||
|
- [ ] Test on-device detection
|
||||||
|
|
||||||
|
### Step 4: Integration (Week 3)
|
||||||
|
- [ ] Connect all components
|
||||||
|
- [ ] Test end-to-end flow
|
||||||
|
- [ ] Add error handling
|
||||||
|
- [ ] Implement fallbacks
|
||||||
|
|
||||||
|
### Step 5: Enhancement (Week 4+)
|
||||||
|
- [ ] Add more intents
|
||||||
|
- [ ] Improve NLU accuracy
|
||||||
|
- [ ] Add multi-room support
|
||||||
|
- [ ] Implement conversation context
|
||||||
|
|
||||||
|
## Development Tools
|
||||||
|
|
||||||
|
### Testing Wake Word
|
||||||
|
```python
|
||||||
|
# Use existing diarization.py for testing audio quality
|
||||||
|
python3 /path/to/diarization.py test_audio.wav \
|
||||||
|
--format vtt \
|
||||||
|
--model medium
|
||||||
|
```
|
||||||
|
|
||||||
|
### Monitoring
|
||||||
|
- Heimdall logs: `/var/log/voice-assistant/`
|
||||||
|
- Maix Duino serial console: 115200 baud
|
||||||
|
- Home Assistant logs: Standard HA logging
|
||||||
|
|
||||||
|
## Security Considerations
|
||||||
|
|
||||||
|
1. **No external cloud services** - Everything local
|
||||||
|
2. **Network isolation** - Keep on 10.1.10.0/24
|
||||||
|
3. **Authentication** - Use HA long-lived tokens
|
||||||
|
4. **Rate limiting** - Prevent abuse
|
||||||
|
5. **Audio privacy** - Only stream after wake word
|
||||||
|
|
||||||
|
## Resource Requirements
|
||||||
|
|
||||||
|
### Heimdall
|
||||||
|
- **CPU**: Minimal (< 5% idle, spikes during STT)
|
||||||
|
- **RAM**: ~2GB for Whisper medium model
|
||||||
|
- **Storage**: ~5GB for models
|
||||||
|
- **Network**: Low bandwidth (16kHz audio stream)
|
||||||
|
|
||||||
|
### Maix Duino
|
||||||
|
- **Power**: ~1-2W typical
|
||||||
|
- **Storage**: 16MB flash (plenty for wake word model)
|
||||||
|
- **RAM**: 8MB SRAM (sufficient for audio buffering)
|
||||||
|
|
||||||
|
## Alternative Architectures
|
||||||
|
|
||||||
|
### Option A: Fully On-Device (Limited)
|
||||||
|
- Everything on Maix Duino
|
||||||
|
- Very limited vocabulary
|
||||||
|
- No internet required
|
||||||
|
- Lower accuracy
|
||||||
|
|
||||||
|
### Option B: Hybrid (Recommended)
|
||||||
|
- Wake word on Maix Duino
|
||||||
|
- Processing on Heimdall
|
||||||
|
- Best balance of speed/accuracy
|
||||||
|
|
||||||
|
### Option C: Raspberry Pi Alternative
|
||||||
|
- If K210 proves limiting
|
||||||
|
- More processing power
|
||||||
|
- Still local/FOSS
|
||||||
|
- Higher cost
|
||||||
|
|
||||||
|
## Expansion Ideas
|
||||||
|
|
||||||
|
### Future Enhancements
|
||||||
|
1. **Multi-room**: Deploy multiple Maix Duino units
|
||||||
|
2. **Music playback**: Integrate with Plex
|
||||||
|
3. **Timers/Reminders**: Local scheduling
|
||||||
|
4. **Weather**: Pull from local weather station
|
||||||
|
5. **Calendar**: Sync with Nextcloud
|
||||||
|
6. **Intercom**: Room-to-room communication
|
||||||
|
7. **Sound events**: Doorbell, smoke alarm detection
|
||||||
|
|
||||||
|
### Integration with Existing Infrastructure
|
||||||
|
- **Plex**: Voice control for media playback
|
||||||
|
- **qBittorrent**: Status queries, torrent management
|
||||||
|
- **Nextcloud**: Calendar/contact queries
|
||||||
|
- **Matrix**: Send messages via voice
|
||||||
|
|
||||||
|
## Cost Estimate
|
||||||
|
|
||||||
|
- Maix Duino board: ~$20-30 (already have!)
|
||||||
|
- Microphone: ~$5-10 (if not included)
|
||||||
|
- Speaker: ~$10-15 (or use existing)
|
||||||
|
- **Total**: $0-55 (mostly already have)
|
||||||
|
|
||||||
|
Compare to commercial solutions:
|
||||||
|
- Google Home Mini: $50 (requires cloud)
|
||||||
|
- Amazon Echo Dot: $50 (requires cloud)
|
||||||
|
- Apple HomePod Mini: $99 (requires cloud)
|
||||||
|
|
||||||
|
## Success Criteria
|
||||||
|
|
||||||
|
### Minimum Viable Product (MVP)
|
||||||
|
- ✓ Wake word detection < 1 second
|
||||||
|
- ✓ Speech-to-text accuracy > 90%
|
||||||
|
- ✓ Home Assistant command execution
|
||||||
|
- ✓ Response time < 3 seconds total
|
||||||
|
- ✓ All processing local (no cloud)
|
||||||
|
|
||||||
|
### Enhanced Version
|
||||||
|
- ✓ Multi-intent conversations
|
||||||
|
- ✓ Context awareness
|
||||||
|
- ✓ Multiple wake words
|
||||||
|
- ✓ Room-aware responses
|
||||||
|
- ✓ Custom voice training
|
||||||
|
|
||||||
|
## Resources & Documentation
|
||||||
|
|
||||||
|
### Official Documentation
|
||||||
|
- Maix Duino: https://wiki.sipeed.com/hardware/en/maix/
|
||||||
|
- MaixPy: https://maixpy.sipeed.com/
|
||||||
|
- Home Assistant API: https://developers.home-assistant.io/
|
||||||
|
|
||||||
|
### Wake Word Tools
|
||||||
|
- Mycroft Precise: https://github.com/MycroftAI/mycroft-precise
|
||||||
|
- Porcupine: https://github.com/Picovoice/porcupine
|
||||||
|
|
||||||
|
### TTS Options
|
||||||
|
- Piper: https://github.com/rhasspy/piper
|
||||||
|
- Coqui TTS: https://github.com/coqui-ai/TTS
|
||||||
|
|
||||||
|
### Community Projects
|
||||||
|
- Rhasspy: https://rhasspy.readthedocs.io/ (full voice assistant framework)
|
||||||
|
- Willow: https://github.com/toverainc/willow (ESP32-based alternative)
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
|
||||||
|
1. **Test current setup**: Verify Maix Duino boots and can connect to WiFi
|
||||||
|
2. **Audio test**: Record and playback test on the board
|
||||||
|
3. **Server setup**: Create conda environment and install dependencies
|
||||||
|
4. **Simple prototype**: Wake word → beep (no processing yet)
|
||||||
|
5. **Iterate**: Add complexity step by step
|
||||||
348
hardware/maixduino/MICROPYTHON_QUIRKS.md
Executable file
348
hardware/maixduino/MICROPYTHON_QUIRKS.md
Executable file
|
|
@ -0,0 +1,348 @@
|
||||||
|
# MicroPython/MaixPy Quirks and Compatibility Notes
|
||||||
|
|
||||||
|
**Date:** 2025-12-03
|
||||||
|
**MicroPython Version:** v0.6.2-89-gd8901fd22 on 2024-06-17
|
||||||
|
**Hardware:** Sipeed Maixduino (K210)
|
||||||
|
|
||||||
|
This document captures all the compatibility issues and workarounds discovered while developing the voice assistant client for Maixduino.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## String Formatting
|
||||||
|
|
||||||
|
### ❌ F-strings NOT supported
|
||||||
|
```python
|
||||||
|
# WRONG - SyntaxError
|
||||||
|
message = f"IP: {ip}"
|
||||||
|
temperature = f"Temp: {temp}°C"
|
||||||
|
```
|
||||||
|
|
||||||
|
### ✅ Use string concatenation
|
||||||
|
```python
|
||||||
|
# CORRECT
|
||||||
|
message = "IP: " + str(ip)
|
||||||
|
temperature = "Temp: " + str(temp) + "°C"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Conditional Expressions (Ternary Operator)
|
||||||
|
|
||||||
|
### ❌ Inline ternary expressions NOT supported
|
||||||
|
```python
|
||||||
|
# WRONG - SyntaxError
|
||||||
|
plural = "s" if count > 1 else ""
|
||||||
|
message = "Found " + str(count) + " item" + ("s" if count > 1 else "")
|
||||||
|
```
|
||||||
|
|
||||||
|
### ✅ Use explicit if/else blocks
|
||||||
|
```python
|
||||||
|
# CORRECT
|
||||||
|
if count > 1:
|
||||||
|
plural = "s"
|
||||||
|
else:
|
||||||
|
plural = ""
|
||||||
|
message = "Found " + str(count) + " item" + plural
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## String Methods
|
||||||
|
|
||||||
|
### ❌ decode() doesn't accept keyword arguments
|
||||||
|
```python
|
||||||
|
# WRONG - TypeError: function doesn't take keyword arguments
|
||||||
|
text = response.decode('utf-8', errors='ignore')
|
||||||
|
```
|
||||||
|
|
||||||
|
### ✅ Use positional arguments only (or catch exceptions)
|
||||||
|
```python
|
||||||
|
# CORRECT
|
||||||
|
try:
|
||||||
|
text = response.decode('utf-8')
|
||||||
|
except:
|
||||||
|
text = str(response)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Display/LCD Color Format
|
||||||
|
|
||||||
|
### ❌ RGB tuples NOT accepted
|
||||||
|
```python
|
||||||
|
# WRONG - TypeError: can't convert tuple to int
|
||||||
|
COLOR_RED = (255, 0, 0)
|
||||||
|
lcd.draw_string(10, 50, "Hello", COLOR_RED, 0)
|
||||||
|
```
|
||||||
|
|
||||||
|
### ✅ Use bit-packed integers
|
||||||
|
```python
|
||||||
|
# CORRECT - Pack RGB into 16-bit or 24-bit integer
|
||||||
|
def rgb_to_int(r, g, b):
|
||||||
|
return (r << 16) | (g << 8) | b
|
||||||
|
|
||||||
|
COLOR_RED = rgb_to_int(255, 0, 0)
|
||||||
|
lcd.draw_string(10, 50, "Hello", COLOR_RED, 0)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Network - WiFi Module
|
||||||
|
|
||||||
|
### ❌ Standard network.WLAN NOT available
|
||||||
|
```python
|
||||||
|
# WRONG - AttributeError: 'module' object has no attribute 'WLAN'
|
||||||
|
import network
|
||||||
|
nic = network.WLAN(network.STA_IF)
|
||||||
|
```
|
||||||
|
|
||||||
|
### ✅ Use network.ESP32_SPI for Maixduino
|
||||||
|
```python
|
||||||
|
# CORRECT - Requires full pin configuration
|
||||||
|
from network import ESP32_SPI
|
||||||
|
from fpioa_manager import fm
|
||||||
|
|
||||||
|
# Register all 6 SPI pins
|
||||||
|
fm.register(25, fm.fpioa.GPIOHS10, force=True) # CS
|
||||||
|
fm.register(8, fm.fpioa.GPIOHS11, force=True) # RST
|
||||||
|
fm.register(9, fm.fpioa.GPIOHS12, force=True) # RDY
|
||||||
|
fm.register(28, fm.fpioa.GPIOHS13, force=True) # MOSI
|
||||||
|
fm.register(26, fm.fpioa.GPIOHS14, force=True) # MISO
|
||||||
|
fm.register(27, fm.fpioa.GPIOHS15, force=True) # SCLK
|
||||||
|
|
||||||
|
nic = ESP32_SPI(
|
||||||
|
cs=fm.fpioa.GPIOHS10,
|
||||||
|
rst=fm.fpioa.GPIOHS11,
|
||||||
|
rdy=fm.fpioa.GPIOHS12,
|
||||||
|
mosi=fm.fpioa.GPIOHS13,
|
||||||
|
miso=fm.fpioa.GPIOHS14,
|
||||||
|
sclk=fm.fpioa.GPIOHS15
|
||||||
|
)
|
||||||
|
|
||||||
|
nic.connect(SSID, PASSWORD)
|
||||||
|
```
|
||||||
|
|
||||||
|
### ❌ active() method NOT available
|
||||||
|
```python
|
||||||
|
# WRONG - AttributeError: 'ESP32_SPI' object has no attribute 'active'
|
||||||
|
nic.active(True)
|
||||||
|
```
|
||||||
|
|
||||||
|
### ✅ Just use connect() directly
|
||||||
|
```python
|
||||||
|
# CORRECT
|
||||||
|
nic.connect(SSID, PASSWORD)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## I2S Audio
|
||||||
|
|
||||||
|
### ❌ record() doesn't accept size parameter only
|
||||||
|
```python
|
||||||
|
# WRONG - TypeError: object with buffer protocol required
|
||||||
|
chunk = i2s_dev.record(1024)
|
||||||
|
```
|
||||||
|
|
||||||
|
### ✅ Returns Audio object, use to_bytes()
|
||||||
|
```python
|
||||||
|
# CORRECT
|
||||||
|
audio_obj = i2s_dev.record(total_bytes)
|
||||||
|
audio_data = audio_obj.to_bytes()
|
||||||
|
```
|
||||||
|
|
||||||
|
**Note:** Audio data often comes in unexpected formats:
|
||||||
|
- Expected: 16-bit mono PCM
|
||||||
|
- Reality: Often 32-bit or stereo (4x expected size)
|
||||||
|
- Solution: Implement format detection and conversion
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Memory Management
|
||||||
|
|
||||||
|
### Memory is VERY limited (~6MB total, much less available)
|
||||||
|
|
||||||
|
**Problems encountered:**
|
||||||
|
- Creating large bytearrays fails (>100KB can fail)
|
||||||
|
- Multiple allocations cause fragmentation
|
||||||
|
- In-place operations preferred over creating new buffers
|
||||||
|
|
||||||
|
### ❌ Creating new buffers
|
||||||
|
```python
|
||||||
|
# WRONG - MemoryError on large data
|
||||||
|
compressed = bytearray()
|
||||||
|
for i in range(0, len(data), 4):
|
||||||
|
compressed.extend(data[i:i+2]) # Allocates new memory
|
||||||
|
```
|
||||||
|
|
||||||
|
### ✅ Work with smaller chunks or compress during transmission
|
||||||
|
```python
|
||||||
|
# CORRECT - Process in smaller pieces
|
||||||
|
chunk_size = 512
|
||||||
|
for i in range(0, len(data), chunk_size):
|
||||||
|
chunk = data[i:i+chunk_size]
|
||||||
|
process_chunk(chunk) # Handle incrementally
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solutions implemented:**
|
||||||
|
1. Reduce recording duration (3s → 1s)
|
||||||
|
2. Compress audio (μ-law: 50% size reduction)
|
||||||
|
3. Stream transmission in small chunks (512 bytes)
|
||||||
|
4. Add delays between sends to prevent buffer overflow
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## String Operations
|
||||||
|
|
||||||
|
### ❌ Arithmetic in string concatenation
|
||||||
|
```python
|
||||||
|
# WRONG - SyntaxError (sometimes)
|
||||||
|
message = "Count: #" + str(count + 1)
|
||||||
|
```
|
||||||
|
|
||||||
|
### ✅ Separate arithmetic from concatenation
|
||||||
|
```python
|
||||||
|
# CORRECT
|
||||||
|
next_count = count + 1
|
||||||
|
message = "Count: #" + str(next_count)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Bytearray Operations
|
||||||
|
|
||||||
|
### ❌ Item deletion NOT supported
|
||||||
|
```python
|
||||||
|
# WRONG - TypeError: 'bytearray' object doesn't support item deletion
|
||||||
|
del audio_data[expected_size:]
|
||||||
|
```
|
||||||
|
|
||||||
|
### ✅ Create new bytearray with slice
|
||||||
|
```python
|
||||||
|
# CORRECT
|
||||||
|
audio_data = audio_data[:expected_size]
|
||||||
|
# Or create new buffer
|
||||||
|
trimmed = bytearray(expected_size)
|
||||||
|
trimmed[:] = audio_data[:expected_size]
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## HTTP Requests
|
||||||
|
|
||||||
|
### ❌ urequests module NOT available
|
||||||
|
```python
|
||||||
|
# WRONG - ImportError: no module named 'urequests'
|
||||||
|
import urequests
|
||||||
|
response = urequests.post(url, data=data)
|
||||||
|
```
|
||||||
|
|
||||||
|
### ✅ Use raw socket HTTP
|
||||||
|
```python
|
||||||
|
# CORRECT
|
||||||
|
import socket
|
||||||
|
|
||||||
|
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
|
||||||
|
s.connect((host, port))
|
||||||
|
|
||||||
|
# Manual HTTP headers
|
||||||
|
headers = "POST /path HTTP/1.1\r\n"
|
||||||
|
headers += "Host: " + host + "\r\n"
|
||||||
|
headers += "Content-Type: audio/wav\r\n"
|
||||||
|
headers += "Content-Length: " + str(len(data)) + "\r\n"
|
||||||
|
headers += "Connection: close\r\n\r\n"
|
||||||
|
|
||||||
|
s.send(headers.encode())
|
||||||
|
s.send(data)
|
||||||
|
|
||||||
|
response = s.recv(1024)
|
||||||
|
s.close()
|
||||||
|
```
|
||||||
|
|
||||||
|
**Socket I/O errors common:**
|
||||||
|
- `[Errno 5] EIO` - Buffer overflow or disconnect
|
||||||
|
- Solutions:
|
||||||
|
- Send smaller chunks (512-1024 bytes)
|
||||||
|
- Add delays between sends (`time.sleep_ms(10)`)
|
||||||
|
- Enable keepalive if supported
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Best Practices for MaixPy
|
||||||
|
|
||||||
|
1. **Avoid complex expressions** - Break into simple steps
|
||||||
|
2. **Pre-allocate when possible** - Reduce fragmentation
|
||||||
|
3. **Use small buffers** - 512-1024 byte chunks work well
|
||||||
|
4. **Add delays in loops** - Prevent watchdog/buffer issues
|
||||||
|
5. **Explicit type conversions** - Always use `str()`, `int()`, etc.
|
||||||
|
6. **Test incrementally** - Memory errors appear suddenly
|
||||||
|
7. **Monitor serial output** - Errors often give hints
|
||||||
|
8. **Simplify, simplify** - Complexity = bugs in MicroPython
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Testing Methodology
|
||||||
|
|
||||||
|
When porting Python code to MaixPy:
|
||||||
|
|
||||||
|
1. Start with simplest version (hardcoded values)
|
||||||
|
2. Test each function individually via REPL
|
||||||
|
3. Add features incrementally
|
||||||
|
4. Watch for memory errors (usually allocation failures)
|
||||||
|
5. If error occurs, simplify the last change
|
||||||
|
6. Use print statements liberally (no debugger available)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Hardware-Specific Notes
|
||||||
|
|
||||||
|
### Maixduino ESP32 WiFi
|
||||||
|
- Requires manual pin registration
|
||||||
|
- 6 pins must be configured (CS, RST, RDY, MOSI, MISO, SCLK)
|
||||||
|
- Connection can be slow (20+ seconds)
|
||||||
|
- Stability improves with smaller packet sizes
|
||||||
|
|
||||||
|
### I2S Microphone
|
||||||
|
- Returns Audio objects, not raw bytes
|
||||||
|
- Format is often different than configured
|
||||||
|
- May return stereo when mono requested
|
||||||
|
- May return 32-bit when 16-bit requested
|
||||||
|
- Always implement format detection/conversion
|
||||||
|
|
||||||
|
### BOOT Button (GPIO 16)
|
||||||
|
- Active low (0 = pressed, 1 = released)
|
||||||
|
- Requires pull-up configuration
|
||||||
|
- Debounce by waiting for release
|
||||||
|
- Can be used without interrupts (polling is fine)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Resources
|
||||||
|
|
||||||
|
- **MaixPy Documentation:** https://maixpy.sipeed.com/
|
||||||
|
- **K210 Datasheet:** https://canaan.io/product/kendryteai
|
||||||
|
- **ESP32 SPI Firmware:** https://github.com/sipeed/MaixPy_scripts/tree/master/network
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Summary of Successful Patterns
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Audio recording and transmission pipeline
|
||||||
|
1. Record audio → Audio object (128KB for 1 second)
|
||||||
|
2. Convert to bytes → to_bytes() (still 128KB)
|
||||||
|
3. Detect format → Check size vs expected
|
||||||
|
4. Convert to mono 16-bit → In-place copy (32KB)
|
||||||
|
5. Compress with μ-law → 50% reduction (16KB)
|
||||||
|
6. Send in chunks → 512 bytes at a time with delays
|
||||||
|
7. Parse response → Simple string operations
|
||||||
|
|
||||||
|
# Total: ~85% size reduction, fits in memory!
|
||||||
|
```
|
||||||
|
|
||||||
|
This approach works reliably on K210 with ~6MB RAM.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Last Updated:** 2025-12-03
|
||||||
|
**Status:** Fully tested and working
|
||||||
184
hardware/maixduino/README.md
Executable file
184
hardware/maixduino/README.md
Executable file
|
|
@ -0,0 +1,184 @@
|
||||||
|
# Maixduino Scripts
|
||||||
|
|
||||||
|
Scripts to copy/paste into MaixPy IDE for running on the Maix Duino board.
|
||||||
|
|
||||||
|
## Files
|
||||||
|
|
||||||
|
### 1. maix_test_simple.py
|
||||||
|
**Purpose:** Hardware and connectivity test
|
||||||
|
**Use:** Copy/paste into MaixPy IDE to test before deploying full application
|
||||||
|
|
||||||
|
**Tests:**
|
||||||
|
- LCD display functionality
|
||||||
|
- WiFi connection
|
||||||
|
- Network connection to Heimdall server (port 3006)
|
||||||
|
- I2S audio hardware initialization
|
||||||
|
|
||||||
|
**Before running:**
|
||||||
|
1. Edit WiFi credentials (lines 16-17):
|
||||||
|
```python
|
||||||
|
WIFI_SSID = "YourNetworkName"
|
||||||
|
WIFI_PASSWORD = "YourPassword"
|
||||||
|
```
|
||||||
|
2. Verify server URL is correct (line 18):
|
||||||
|
```python
|
||||||
|
SERVER_URL = "http://10.1.10.71:3006"
|
||||||
|
```
|
||||||
|
3. Copy entire file contents
|
||||||
|
4. Paste into MaixPy IDE
|
||||||
|
5. Click RUN button
|
||||||
|
|
||||||
|
**Expected output:**
|
||||||
|
- Display will show test results
|
||||||
|
- Serial console will print detailed progress
|
||||||
|
- Will report OK/FAIL for each test
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 2. maix_voice_client.py
|
||||||
|
**Purpose:** Full voice assistant client
|
||||||
|
**Use:** Copy/paste into MaixPy IDE after test passes
|
||||||
|
|
||||||
|
**Features:**
|
||||||
|
- Wake word detection (placeholder - uses amplitude trigger)
|
||||||
|
- Audio recording after wake word
|
||||||
|
- Sends audio to Heimdall server for processing
|
||||||
|
- Displays transcription and response on LCD
|
||||||
|
- LED feedback for status
|
||||||
|
|
||||||
|
**Before running:**
|
||||||
|
1. Edit WiFi credentials (lines 38-39)
|
||||||
|
2. Verify server URL (line 42)
|
||||||
|
3. Adjust audio settings if needed (lines 45-62)
|
||||||
|
|
||||||
|
**For SD card deployment:**
|
||||||
|
1. Copy this file to SD card as `main.py`
|
||||||
|
2. Board will auto-run on boot
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Deployment Workflow
|
||||||
|
|
||||||
|
### Step 1: Test Hardware (maix_test_simple.py)
|
||||||
|
```
|
||||||
|
1. Edit WiFi settings
|
||||||
|
2. Paste into MaixPy IDE
|
||||||
|
3. Click RUN
|
||||||
|
4. Verify all tests pass
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 2: Deploy Full Client (maix_voice_client.py)
|
||||||
|
**Option A - IDE Testing:**
|
||||||
|
```
|
||||||
|
1. Edit WiFi settings
|
||||||
|
2. Paste into MaixPy IDE
|
||||||
|
3. Click RUN for testing
|
||||||
|
```
|
||||||
|
|
||||||
|
**Option B - Permanent SD Card:**
|
||||||
|
```
|
||||||
|
1. Edit WiFi settings
|
||||||
|
2. Save to SD card as: /sd/main.py
|
||||||
|
3. Reboot board - auto-runs on boot
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Hardware Requirements
|
||||||
|
|
||||||
|
### Maix Duino Board
|
||||||
|
- K210 processor with KPU
|
||||||
|
- LCD display (built-in)
|
||||||
|
- I2S microphone (check connections)
|
||||||
|
- ESP32 WiFi module (built-in)
|
||||||
|
|
||||||
|
### I2S Pin Configuration (Default)
|
||||||
|
```python
|
||||||
|
Pin 20: I2S0_IN_D0 (Data)
|
||||||
|
Pin 19: I2S0_WS (Word Select)
|
||||||
|
Pin 18: I2S0_SCLK (Clock)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Note:** If your microphone uses different pins, edit the pin assignments in the scripts.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### WiFi Won't Connect
|
||||||
|
- Verify SSID and password are correct
|
||||||
|
- Ensure WiFi is 2.4GHz (not 5GHz - Maix doesn't support 5GHz)
|
||||||
|
- Check signal strength
|
||||||
|
- Try moving closer to router
|
||||||
|
|
||||||
|
### Server Connection Fails
|
||||||
|
- Verify Heimdall server is running on port 3006
|
||||||
|
- Check firewall allows port 3006
|
||||||
|
- Ensure Maix is on same network (10.1.10.0/24)
|
||||||
|
- Test from another device: `curl http://10.1.10.71:3006/health`
|
||||||
|
|
||||||
|
### Audio Initialization Fails
|
||||||
|
- Check microphone is properly connected
|
||||||
|
- Verify I2S pins match your hardware
|
||||||
|
- Try alternate pin configuration if needed
|
||||||
|
- Check microphone requires 3.3V (not 5V)
|
||||||
|
|
||||||
|
### Script Errors in MaixPy IDE
|
||||||
|
- Ensure using latest MaixPy firmware
|
||||||
|
- Check for typos when editing WiFi credentials
|
||||||
|
- Verify entire script was copied (check for truncation)
|
||||||
|
- Look at serial console for detailed error messages
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## MaixPy IDE Tips
|
||||||
|
|
||||||
|
### Running Scripts
|
||||||
|
1. Connect board via USB
|
||||||
|
2. Select correct board model: Tools → Select Board
|
||||||
|
3. Click connect button (turns red when connected)
|
||||||
|
4. Paste code into editor
|
||||||
|
5. Click run button (red triangle)
|
||||||
|
6. Watch serial console and LCD for output
|
||||||
|
|
||||||
|
### Stopping Scripts
|
||||||
|
- Click run button again to stop
|
||||||
|
- Or press reset button on board
|
||||||
|
|
||||||
|
### Serial Console
|
||||||
|
- Shows detailed debug output
|
||||||
|
- Useful for troubleshooting
|
||||||
|
- Can copy errors for debugging
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Network Configuration
|
||||||
|
|
||||||
|
- **Heimdall Server:** 10.1.10.71:3006
|
||||||
|
- **Maix Duino:** Gets IP via DHCP (shown on LCD during test)
|
||||||
|
- **Network:** 10.1.10.0/24
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
|
||||||
|
After both scripts work:
|
||||||
|
1. Verify Heimdall server is processing audio
|
||||||
|
2. Test wake word detection
|
||||||
|
3. Integrate with Home Assistant (optional)
|
||||||
|
4. Train custom wake word (optional)
|
||||||
|
5. Deploy to SD card for permanent installation
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Related Documentation
|
||||||
|
|
||||||
|
- **Project overview:** `../PROJECT_SUMMARY.md`
|
||||||
|
- **Heimdall setup:** `../QUICKSTART.md`
|
||||||
|
- **Wake word training:** `../MYCROFT_PRECISE_GUIDE.md`
|
||||||
|
- **Server deployment:** `../docs/PRECISE_DEPLOYMENT.md`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Last Updated:** 2025-12-03
|
||||||
|
**Location:** `/Library/Development/devl/Devops/projects/mycroft-precise/maixduino-scripts/`
|
||||||
376
hardware/maixduino/SESSION_PROGRESS_2025-12-03.md
Executable file
376
hardware/maixduino/SESSION_PROGRESS_2025-12-03.md
Executable file
|
|
@ -0,0 +1,376 @@
|
||||||
|
# Maixduino Voice Assistant - Session Progress
|
||||||
|
|
||||||
|
**Date:** 2025-12-03
|
||||||
|
**Session Duration:** ~4 hours
|
||||||
|
**Goal:** Get audio recording and transcription working on Maixduino → Heimdall server
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎉 Major Achievements
|
||||||
|
|
||||||
|
### ✅ Full Audio Pipeline Working!
|
||||||
|
We successfully built and tested the complete audio capture → compression → transmission → transcription pipeline:
|
||||||
|
|
||||||
|
1. **WiFi Connection** - Maixduino connects to network (10.1.10.98)
|
||||||
|
2. **Audio Recording** - I2S microphone captures audio (MSM261S4030H0 MEMS mic)
|
||||||
|
3. **Format Conversion** - Converts 32-bit stereo to 16-bit mono (4x size reduction)
|
||||||
|
4. **μ-law Compression** - Compresses PCM audio by 50%
|
||||||
|
5. **HTTP Transmission** - Sends compressed WAV to Heimdall server
|
||||||
|
6. **Whisper Transcription** - Server transcribes and returns text
|
||||||
|
7. **LCD Display** - Shows transcription on Maixduino screen
|
||||||
|
8. **Button Loop** - Press BOOT button for repeated recordings
|
||||||
|
|
||||||
|
**Total size reduction:** 128KB → 32KB (mono) → 16KB (compressed) = **87.5% reduction!**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🔧 Technical Accomplishments
|
||||||
|
|
||||||
|
### Audio Recording Pipeline
|
||||||
|
- **Initial Problem:** `i2s_dev.record()` returned immediately (1ms instead of 1000ms)
|
||||||
|
- **Root Cause:** Recording API is asynchronous/non-blocking
|
||||||
|
- **Solution:** Use chunked recording with `wait_record()` blocking calls
|
||||||
|
- **Pattern:**
|
||||||
|
```python
|
||||||
|
for i in range(frame_cnt):
|
||||||
|
audio_chunk = i2s_dev.record(chunk_size)
|
||||||
|
i2s_dev.wait_record() # CRITICAL: blocks until complete
|
||||||
|
chunks.append(audio_chunk.to_bytes())
|
||||||
|
```
|
||||||
|
|
||||||
|
### Memory Management
|
||||||
|
- **K210 has very limited RAM** (~6MB total, much less available)
|
||||||
|
- Successfully handled 128KB → 16KB data transformation without OOM errors
|
||||||
|
- Techniques used:
|
||||||
|
- Record in small chunks (2048 samples)
|
||||||
|
- Stream HTTP transmission (512-byte chunks with delays)
|
||||||
|
- In-place data conversion where possible
|
||||||
|
- Explicit garbage collection hints (`audio_data = None`)
|
||||||
|
|
||||||
|
### Network Communication
|
||||||
|
- **Raw socket HTTP** (no urequests library available)
|
||||||
|
- **Chunked streaming** with flow control (10ms delays)
|
||||||
|
- **Simple WAV format** with μ-law compression (format code 7)
|
||||||
|
- **Robust error handling** with serial output debugging
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🐛 MicroPython/MaixPy Quirks Discovered
|
||||||
|
|
||||||
|
### String Operations
|
||||||
|
- ❌ **F-strings NOT supported** - Must use `"text " + str(var)` concatenation
|
||||||
|
- ❌ **Ternary operators fail** - Use explicit `if/else` blocks instead
|
||||||
|
- ❌ **`split()` needs explicit delimiter** - `text.split(" ")` not `text.split()`
|
||||||
|
- ❌ **Escape sequences problematic** - Avoid `\n` in strings, causes syntax errors
|
||||||
|
|
||||||
|
### Data Types & Methods
|
||||||
|
- ❌ **`decode()` doesn't accept kwargs** - Use `decode('utf-8')` not `decode('utf-8', errors='ignore')`
|
||||||
|
- ❌ **RGB tuples not accepted** - Must convert to packed integers: `(r << 16) | (g << 8) | b`
|
||||||
|
- ❌ **Bytearray item deletion unsupported** - `del arr[n:]` fails, use slicing instead
|
||||||
|
- ❌ **Arithmetic in string concat** - Separate calculations: `next = count + 1; "text" + str(next)`
|
||||||
|
|
||||||
|
### I2S Audio Specific
|
||||||
|
- ❌ **`record()` is non-blocking** - Returns immediately, must use `wait_record()`
|
||||||
|
- ❌ **Audio object not directly iterable** - Must call `.to_bytes()` first
|
||||||
|
- ⚠️ **Data format mismatch** - Hardware returns 32-bit stereo even when configured for 16-bit mono (4x expected size)
|
||||||
|
|
||||||
|
### Network/WiFi
|
||||||
|
- ❌ **`network.WLAN` not available** - Must use `network.ESP32_SPI` with full pin config
|
||||||
|
- ❌ **`active()` method doesn't exist** - Just call `connect()` directly
|
||||||
|
- ⚠️ **Requires ALL 6 pins configured** - CS, RST, RDY, MOSI, MISO, SCLK
|
||||||
|
|
||||||
|
### General Syntax
|
||||||
|
- ⚠️ **`if __name__ == "__main__"` sometimes causes syntax errors** - Safer to just call `main()` directly
|
||||||
|
- ⚠️ **Import statements mid-function can cause syntax errors** - Keep imports at top of file
|
||||||
|
- ⚠️ **Some valid Python causes "invalid syntax" for unknown reasons** - Simplify complex expressions
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📊 Current Status
|
||||||
|
|
||||||
|
### ✅ Working
|
||||||
|
- WiFi connectivity (ESP32 SPI)
|
||||||
|
- I2S audio initialization
|
||||||
|
- Chunked audio recording with `wait_record()`
|
||||||
|
- Audio format detection and conversion (32-bit stereo → 16-bit mono)
|
||||||
|
- μ-law compression (50% size reduction)
|
||||||
|
- HTTP transmission to server (chunked streaming)
|
||||||
|
- Whisper transcription (server-side)
|
||||||
|
- JSON response parsing
|
||||||
|
- LCD display (with word wrapping)
|
||||||
|
- Button-triggered recording loop
|
||||||
|
- Countdown timer before recording
|
||||||
|
|
||||||
|
### ⚠️ Partially Working
|
||||||
|
- **Recording duration** - Currently getting ~0.9 seconds instead of full 1 second
|
||||||
|
- Formula: `frame_cnt = seconds * sample_rate // chunk_size`
|
||||||
|
- Current: `7 frames × (2048/16000) = 0.896s`
|
||||||
|
- May need to increase `frame_cnt` or adjust chunk size
|
||||||
|
|
||||||
|
### ❌ Not Yet Implemented
|
||||||
|
- Mycroft Precise wake word detection
|
||||||
|
- Full voice assistant loop
|
||||||
|
- Command processing
|
||||||
|
- Home Assistant integration
|
||||||
|
- Multi-second recording support
|
||||||
|
- Real-time audio streaming
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🔬 Technical Details
|
||||||
|
|
||||||
|
### Hardware Configuration
|
||||||
|
|
||||||
|
**Maixduino Board:**
|
||||||
|
- Processor: K210 dual-core RISC-V @ 400MHz
|
||||||
|
- RAM: ~6MB total (limited available memory)
|
||||||
|
- WiFi: ESP32 module via SPI
|
||||||
|
- Microphone: MSM261S4030H0 MEMS (onboard)
|
||||||
|
- IP Address: 10.1.10.98
|
||||||
|
|
||||||
|
**I2S Pins:**
|
||||||
|
- Pin 20: I2S0_IN_D0 (data)
|
||||||
|
- Pin 19: I2S0_WS (word select)
|
||||||
|
- Pin 18: I2S0_SCLK (clock)
|
||||||
|
|
||||||
|
**ESP32 SPI Pins:**
|
||||||
|
- Pin 25: CS (chip select)
|
||||||
|
- Pin 8: RST (reset)
|
||||||
|
- Pin 9: RDY (ready)
|
||||||
|
- Pin 28: MOSI (master out)
|
||||||
|
- Pin 26: MISO (master in)
|
||||||
|
- Pin 27: SCLK (clock)
|
||||||
|
|
||||||
|
**GPIO:**
|
||||||
|
- Pin 16: BOOT button (active low, pull-up)
|
||||||
|
|
||||||
|
### Server Configuration
|
||||||
|
|
||||||
|
**Heimdall Server:**
|
||||||
|
- IP: 10.1.10.71
|
||||||
|
- Port: 3006
|
||||||
|
- Framework: Flask
|
||||||
|
- Model: Whisper base
|
||||||
|
- Environment: Conda `whisper_cli`
|
||||||
|
|
||||||
|
**Endpoints:**
|
||||||
|
- `/health` - Health check
|
||||||
|
- `/transcribe` - POST audio for transcription
|
||||||
|
|
||||||
|
### Audio Format
|
||||||
|
|
||||||
|
**Recording:**
|
||||||
|
- Sample Rate: 16kHz
|
||||||
|
- Hardware Output: 32-bit stereo (128KB for 1 second)
|
||||||
|
- After Conversion: 16-bit mono (32KB for 1 second)
|
||||||
|
- After Compression: 8-bit μ-law (16KB for 1 second)
|
||||||
|
|
||||||
|
**WAV Header:**
|
||||||
|
- Format Code: 7 (μ-law)
|
||||||
|
- Channels: 1 (mono)
|
||||||
|
- Sample Rate: 16000 Hz
|
||||||
|
- Bits per Sample: 8
|
||||||
|
- Includes `fact` chunk (required for μ-law)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📝 Code Files
|
||||||
|
|
||||||
|
### Main Script
|
||||||
|
**File:** `/Library/Development/devl/Devops/projects/mycroft-precise/maixduino-scripts/maix_simple_record_test.py`
|
||||||
|
|
||||||
|
**Key Functions:**
|
||||||
|
- `init_wifi()` - ESP32 SPI WiFi connection
|
||||||
|
- `init_audio()` - I2S microphone setup
|
||||||
|
- `record_audio()` - Chunked recording with `wait_record()`
|
||||||
|
- `convert_to_mono_16bit()` - Format conversion (32-bit stereo → 16-bit mono)
|
||||||
|
- `compress_ulaw()` - μ-law compression
|
||||||
|
- `create_wav_header()` - WAV file header generation
|
||||||
|
- `send_to_server()` - HTTP POST with chunked streaming
|
||||||
|
- `display_transcription()` - LCD output with word wrapping
|
||||||
|
- `main()` - Button loop for repeated recordings
|
||||||
|
|
||||||
|
### Server Script
|
||||||
|
**File:** `/devl/voice-assistant/simple_transcribe_server.py`
|
||||||
|
|
||||||
|
**Features:**
|
||||||
|
- Accepts raw WAV or multipart uploads
|
||||||
|
- Whisper base model transcription
|
||||||
|
- JSON response with transcription text
|
||||||
|
- Handles μ-law compressed audio
|
||||||
|
|
||||||
|
### Documentation
|
||||||
|
**File:** `/Library/Development/devl/Devops/projects/mycroft-precise/maixduino-scripts/MICROPYTHON_QUIRKS.md`
|
||||||
|
|
||||||
|
Complete reference of all MicroPython compatibility issues discovered during development.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎯 Next Steps
|
||||||
|
|
||||||
|
### Immediate (Tonight)
|
||||||
|
1. ✅ Switch to Linux laptop with direct serial access
|
||||||
|
2. ⏭️ Tune recording duration to get full 1 second
|
||||||
|
- Try `frame_cnt = 8` instead of 7
|
||||||
|
- Or adjust chunk size to get exact timing
|
||||||
|
3. ⏭️ Test transcription quality with proper-length recordings
|
||||||
|
|
||||||
|
### Short Term (This Week)
|
||||||
|
1. Increase recording duration to 2-3 seconds for better transcription
|
||||||
|
2. Test memory limits with longer recordings
|
||||||
|
3. Optimize compression/transmission for speed
|
||||||
|
4. Add visual feedback during transmission
|
||||||
|
|
||||||
|
### Medium Term (Next Week)
|
||||||
|
1. Install Mycroft Precise in `whisper_cli` environment
|
||||||
|
2. Test "hey mycroft" wake word detection on server
|
||||||
|
3. Integrate wake word into recording loop
|
||||||
|
4. Add command processing and Home Assistant integration
|
||||||
|
|
||||||
|
### Long Term (Future)
|
||||||
|
1. Explore edge wake word detection (Precise on K210)
|
||||||
|
2. Multi-device deployment
|
||||||
|
3. Continuous listening mode
|
||||||
|
4. Voice profiles and speaker identification
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🐛 Known Issues
|
||||||
|
|
||||||
|
### Recording Duration
|
||||||
|
- **Issue:** Recording is ~0.9 seconds instead of 1.0 seconds
|
||||||
|
- **Cause:** Integer division `16000 // 2048 = 7.8` rounds down to 7 frames
|
||||||
|
- **Impact:** Minor - transcription still works
|
||||||
|
- **Fix:** Increase `frame_cnt` to 8 or adjust chunk size
|
||||||
|
|
||||||
|
### Data Format Mismatch
|
||||||
|
- **Issue:** Hardware returns 4x expected data (128KB vs 32KB)
|
||||||
|
- **Cause:** I2S outputting 32-bit stereo despite 16-bit mono config
|
||||||
|
- **Impact:** None - conversion function handles it
|
||||||
|
- **Status:** Working as intended
|
||||||
|
|
||||||
|
### Syntax Error Sensitivity
|
||||||
|
- **Issue:** Some valid Python causes "invalid syntax" in MicroPython
|
||||||
|
- **Patterns:** Import statements mid-function, certain arithmetic expressions
|
||||||
|
- **Workaround:** Simplify code, avoid complex expressions
|
||||||
|
- **Status:** Documented in MICROPYTHON_QUIRKS.md
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 💡 Key Learnings
|
||||||
|
|
||||||
|
### I2S Recording Pattern
|
||||||
|
The correct pattern for MaixPy I2S recording:
|
||||||
|
```python
|
||||||
|
chunk_size = 2048
|
||||||
|
frame_cnt = seconds * sample_rate // chunk_size
|
||||||
|
|
||||||
|
for i in range(frame_cnt):
|
||||||
|
audio_chunk = i2s_dev.record(chunk_size)
|
||||||
|
i2s_dev.wait_record() # BLOCKS until recording complete
|
||||||
|
data.append(audio_chunk.to_bytes())
|
||||||
|
```
|
||||||
|
|
||||||
|
**Critical:** `wait_record()` is REQUIRED or recording returns immediately!
|
||||||
|
|
||||||
|
### Memory Management
|
||||||
|
K210 has very limited RAM. Successful strategies:
|
||||||
|
- Work in small chunks (512-2048 bytes)
|
||||||
|
- Stream data instead of buffering
|
||||||
|
- Free variables explicitly when done
|
||||||
|
- Avoid creating large intermediate buffers
|
||||||
|
|
||||||
|
### MicroPython Compatibility
|
||||||
|
MicroPython is NOT Python. Many standard features missing:
|
||||||
|
- F-strings, ternary operators, keyword arguments
|
||||||
|
- Some string methods, complex expressions
|
||||||
|
- Standard libraries (urequests, json parsing)
|
||||||
|
|
||||||
|
**Rule:** Test incrementally, simplify everything, check quirks doc.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📚 Resources Used
|
||||||
|
|
||||||
|
### Documentation
|
||||||
|
- [MaixPy I2S API Reference](https://wiki.sipeed.com/soft/maixpy/en/api_reference/Maix/i2s.html)
|
||||||
|
- [MaixPy I2S Usage Guide](https://wiki.sipeed.com/soft/maixpy/en/modules/on_chip/i2s.html)
|
||||||
|
- [Maixduino Hardware Wiki](https://wiki.sipeed.com/hardware/en/maix/maixpy_develop_kit_board/maix_duino.html)
|
||||||
|
|
||||||
|
### Code Examples
|
||||||
|
- [Official record_wav.py](https://github.com/sipeed/MaixPy-v1_scripts/blob/master/multimedia/audio/record_wav.py)
|
||||||
|
- [MaixPy Scripts Repository](https://github.com/sipeed/MaixPy-v1_scripts)
|
||||||
|
|
||||||
|
### Tools
|
||||||
|
- MaixPy IDE (copy/paste to board)
|
||||||
|
- Serial monitor (debugging)
|
||||||
|
- Heimdall server (Whisper transcription)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🔄 Ready for Next Session
|
||||||
|
|
||||||
|
### Current State
|
||||||
|
- ✅ Code is working and stable
|
||||||
|
- ✅ Can record, compress, transmit, transcribe, display
|
||||||
|
- ✅ Button loop allows repeated testing
|
||||||
|
- ⚠️ Recording duration slightly short (~0.9s)
|
||||||
|
|
||||||
|
### Files Ready
|
||||||
|
- `/Library/Development/devl/Devops/projects/mycroft-precise/maixduino-scripts/maix_simple_record_test.py`
|
||||||
|
- `/Library/Development/devl/Devops/projects/mycroft-precise/maixduino-scripts/MICROPYTHON_QUIRKS.md`
|
||||||
|
- `/devl/voice-assistant/simple_transcribe_server.py`
|
||||||
|
|
||||||
|
### For Serial Access Session
|
||||||
|
1. Connect Maixduino via USB to Linux laptop
|
||||||
|
2. Install pyserial: `pip install pyserial`
|
||||||
|
3. Find device: `ls /dev/ttyUSB*` or `/dev/ttyACM*`
|
||||||
|
4. Connect: `screen /dev/ttyUSB0 115200` or use MaixPy IDE
|
||||||
|
5. Can directly modify code, test immediately, see serial output
|
||||||
|
|
||||||
|
### Quick Test Commands
|
||||||
|
```python
|
||||||
|
# Test WiFi
|
||||||
|
from network import ESP32_SPI
|
||||||
|
# ... (full init code in maix_test_simple.py)
|
||||||
|
|
||||||
|
# Test I2S
|
||||||
|
from Maix import I2S
|
||||||
|
rx = I2S(I2S.DEVICE_0)
|
||||||
|
# ...
|
||||||
|
|
||||||
|
# Test recording
|
||||||
|
audio = rx.record(2048)
|
||||||
|
rx.wait_record()
|
||||||
|
print(len(audio.to_bytes()))
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎊 Success Metrics
|
||||||
|
|
||||||
|
Today we achieved:
|
||||||
|
- ✅ WiFi connection working
|
||||||
|
- ✅ Audio recording working (with proper blocking)
|
||||||
|
- ✅ Format conversion working (4x reduction)
|
||||||
|
- ✅ Compression working (2x reduction)
|
||||||
|
- ✅ Network transmission working (chunked streaming)
|
||||||
|
- ✅ Server transcription working
|
||||||
|
- ✅ Display output working
|
||||||
|
- ✅ Button loop working
|
||||||
|
- ✅ End-to-end pipeline complete!
|
||||||
|
|
||||||
|
**Total:** 9/9 core features working! 🚀
|
||||||
|
|
||||||
|
Minor tuning needed, but the foundation is solid and ready for wake word integration.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Session Summary:** Massive progress! From zero to working audio transcription pipeline in one session. Overcame significant MicroPython compatibility challenges and memory limitations. Ready for next phase: wake word detection.
|
||||||
|
|
||||||
|
**Status:** ✅ Ready for Linux serial access and fine-tuning
|
||||||
|
**Next Session:** Tune recording duration, then integrate Mycroft Precise wake word detection
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*End of Session Report - 2025-12-03*
|
||||||
41
hardware/maixduino/maix_debug_wifi.py
Executable file
41
hardware/maixduino/maix_debug_wifi.py
Executable file
|
|
@ -0,0 +1,41 @@
|
||||||
|
# Debug script to discover WiFi module methods
|
||||||
|
# This will help us figure out the correct API
|
||||||
|
|
||||||
|
import lcd
|
||||||
|
|
||||||
|
lcd.init()
|
||||||
|
lcd.clear()
|
||||||
|
|
||||||
|
print("=" * 40)
|
||||||
|
print("WiFi Module Debug")
|
||||||
|
print("=" * 40)
|
||||||
|
|
||||||
|
# Try to import WiFi module
|
||||||
|
try:
|
||||||
|
from network_esp32 import wifi
|
||||||
|
print("SUCCESS: Imported network_esp32.wifi")
|
||||||
|
lcd.draw_string(10, 10, "WiFi module found!", 0xFFFF, 0x0000)
|
||||||
|
|
||||||
|
# List all attributes/methods
|
||||||
|
print("\nAvailable methods:")
|
||||||
|
lcd.draw_string(10, 30, "Checking methods...", 0xFFFF, 0x0000)
|
||||||
|
|
||||||
|
attrs = dir(wifi)
|
||||||
|
y = 50
|
||||||
|
for i, attr in enumerate(attrs):
|
||||||
|
if not attr.startswith('_'):
|
||||||
|
print(" - " + attr)
|
||||||
|
if i < 10: # Only show first 10 on screen
|
||||||
|
lcd.draw_string(10, y, attr[:20], 0x07E0, 0x0000)
|
||||||
|
y += 15
|
||||||
|
|
||||||
|
print("\nTotal methods: " + str(len(attrs)))
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print("ERROR importing wifi: " + str(e))
|
||||||
|
lcd.draw_string(10, 10, "WiFi import failed!", 0xF800, 0x0000)
|
||||||
|
lcd.draw_string(10, 30, str(e)[:30], 0xF800, 0x0000)
|
||||||
|
|
||||||
|
print("\n" + "=" * 40)
|
||||||
|
print("Debug complete - check serial output")
|
||||||
|
print("=" * 40)
|
||||||
51
hardware/maixduino/maix_discover_modules.py
Executable file
51
hardware/maixduino/maix_discover_modules.py
Executable file
|
|
@ -0,0 +1,51 @@
|
||||||
|
# Discover what network/WiFi modules are actually available
|
||||||
|
import lcd
|
||||||
|
import sys
|
||||||
|
|
||||||
|
lcd.init()
|
||||||
|
lcd.clear()
|
||||||
|
|
||||||
|
print("=" * 40)
|
||||||
|
print("Module Discovery")
|
||||||
|
print("=" * 40)
|
||||||
|
|
||||||
|
# Try different possible module names
|
||||||
|
modules_to_try = [
|
||||||
|
"network",
|
||||||
|
"network_esp32",
|
||||||
|
"network_esp8285",
|
||||||
|
"esp32_spi",
|
||||||
|
"esp8285",
|
||||||
|
"wifi",
|
||||||
|
"ESP32_SPI",
|
||||||
|
"WIFI"
|
||||||
|
]
|
||||||
|
|
||||||
|
found = []
|
||||||
|
y = 10
|
||||||
|
|
||||||
|
for module_name in modules_to_try:
|
||||||
|
try:
|
||||||
|
mod = __import__(module_name)
|
||||||
|
msg = "FOUND: " + module_name
|
||||||
|
print(msg)
|
||||||
|
lcd.draw_string(10, y, msg[:25], 0x07E0, 0x0000) # Green
|
||||||
|
y += 15
|
||||||
|
found.append(module_name)
|
||||||
|
|
||||||
|
# Show methods
|
||||||
|
print(" Methods: " + str(dir(mod)))
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
msg = "NONE: " + module_name
|
||||||
|
print(msg + " (" + str(e) + ")")
|
||||||
|
|
||||||
|
print("\n" + "=" * 40)
|
||||||
|
if found:
|
||||||
|
print("Found modules: " + str(found))
|
||||||
|
lcd.draw_string(10, y + 20, "Found: " + str(len(found)), 0xFFFF, 0x0000)
|
||||||
|
else:
|
||||||
|
print("No WiFi modules found!")
|
||||||
|
lcd.draw_string(10, y + 20, "No WiFi found!", 0xF800, 0x0000)
|
||||||
|
|
||||||
|
print("=" * 40)
|
||||||
461
hardware/maixduino/maix_simple_record_test.py
Normal file
461
hardware/maixduino/maix_simple_record_test.py
Normal file
|
|
@ -0,0 +1,461 @@
|
||||||
|
# Simple Audio Recording and Transcription Test
|
||||||
|
# Record audio for 3 seconds, send to server, display transcription
|
||||||
|
#
|
||||||
|
# This tests the full audio pipeline without wake word detection
|
||||||
|
|
||||||
|
import time
|
||||||
|
import lcd
|
||||||
|
import socket
|
||||||
|
import struct
|
||||||
|
from Maix import GPIO, I2S
|
||||||
|
from fpioa_manager import fm
|
||||||
|
|
||||||
|
# ===== CONFIGURATION =====
|
||||||
|
# Load credentials from secrets.py (gitignored)
|
||||||
|
try:
|
||||||
|
from secrets import SECRETS
|
||||||
|
except ImportError:
|
||||||
|
SECRETS = {}
|
||||||
|
|
||||||
|
WIFI_SSID = "Tell My WiFi Love Her"
|
||||||
|
WIFI_PASSWORD = SECRETS.get("wifi_password", "") # set in secrets.py
|
||||||
|
SERVER_HOST = "10.1.10.71"
|
||||||
|
SERVER_PORT = 3006
|
||||||
|
RECORD_SECONDS = 1 # Reduced to 1 second to save memory
|
||||||
|
SAMPLE_RATE = 16000
|
||||||
|
# ==========================
|
||||||
|
|
||||||
|
# Colors
|
||||||
|
def rgb_to_int(r, g, b):
|
||||||
|
return (r << 16) | (g << 8) | b
|
||||||
|
|
||||||
|
COLOR_BLACK = 0
|
||||||
|
COLOR_WHITE = rgb_to_int(255, 255, 255)
|
||||||
|
COLOR_RED = rgb_to_int(255, 0, 0)
|
||||||
|
COLOR_GREEN = rgb_to_int(0, 255, 0)
|
||||||
|
COLOR_BLUE = rgb_to_int(0, 0, 255)
|
||||||
|
COLOR_YELLOW = rgb_to_int(255, 255, 0)
|
||||||
|
COLOR_CYAN = 0x00FFFF # Cyan: rgb_to_int(0, 255, 255)
|
||||||
|
|
||||||
|
def display_msg(msg, color=COLOR_WHITE, y=50, clear=False):
|
||||||
|
"""Display message on LCD"""
|
||||||
|
if clear:
|
||||||
|
lcd.clear(COLOR_BLACK)
|
||||||
|
lcd.draw_string(10, y, msg[:30], color, COLOR_BLACK)
|
||||||
|
print(msg)
|
||||||
|
|
||||||
|
def init_wifi():
|
||||||
|
"""Initialize WiFi connection"""
|
||||||
|
from network import ESP32_SPI
|
||||||
|
|
||||||
|
lcd.init()
|
||||||
|
lcd.clear(COLOR_BLACK)
|
||||||
|
display_msg("Connecting WiFi...", COLOR_BLUE, 10)
|
||||||
|
|
||||||
|
# Register ESP32 SPI pins
|
||||||
|
fm.register(25, fm.fpioa.GPIOHS10, force=True) # CS
|
||||||
|
fm.register(8, fm.fpioa.GPIOHS11, force=True) # RST
|
||||||
|
fm.register(9, fm.fpioa.GPIOHS12, force=True) # RDY
|
||||||
|
fm.register(28, fm.fpioa.GPIOHS13, force=True) # MOSI
|
||||||
|
fm.register(26, fm.fpioa.GPIOHS14, force=True) # MISO
|
||||||
|
fm.register(27, fm.fpioa.GPIOHS15, force=True) # SCLK
|
||||||
|
|
||||||
|
nic = ESP32_SPI(
|
||||||
|
cs=fm.fpioa.GPIOHS10, rst=fm.fpioa.GPIOHS11, rdy=fm.fpioa.GPIOHS12,
|
||||||
|
mosi=fm.fpioa.GPIOHS13, miso=fm.fpioa.GPIOHS14, sclk=fm.fpioa.GPIOHS15
|
||||||
|
)
|
||||||
|
|
||||||
|
nic.connect(WIFI_SSID, WIFI_PASSWORD)
|
||||||
|
|
||||||
|
# Wait for connection
|
||||||
|
timeout = 20
|
||||||
|
while timeout > 0:
|
||||||
|
time.sleep(1)
|
||||||
|
if nic.isconnected():
|
||||||
|
ip = nic.ifconfig()[0]
|
||||||
|
display_msg("WiFi OK: " + str(ip), COLOR_GREEN, 30)
|
||||||
|
return nic
|
||||||
|
timeout -= 1
|
||||||
|
|
||||||
|
display_msg("WiFi FAILED!", COLOR_RED, 30)
|
||||||
|
return None
|
||||||
|
|
||||||
|
def init_audio():
|
||||||
|
"""Initialize I2S audio"""
|
||||||
|
display_msg("Init audio...", COLOR_BLUE, 50)
|
||||||
|
|
||||||
|
# Register I2S pins
|
||||||
|
fm.register(20, fm.fpioa.I2S0_IN_D0, force=True)
|
||||||
|
fm.register(19, fm.fpioa.I2S0_WS, force=True)
|
||||||
|
fm.register(18, fm.fpioa.I2S0_SCLK, force=True)
|
||||||
|
|
||||||
|
# Initialize I2S
|
||||||
|
rx = I2S(I2S.DEVICE_0)
|
||||||
|
rx.channel_config(rx.CHANNEL_0, rx.RECEIVER, align_mode=I2S.STANDARD_MODE)
|
||||||
|
rx.set_sample_rate(SAMPLE_RATE)
|
||||||
|
|
||||||
|
display_msg("Audio OK!", COLOR_GREEN, 70)
|
||||||
|
return rx
|
||||||
|
|
||||||
|
def convert_to_mono_16bit(audio_data):
|
||||||
|
"""Convert audio to mono 16-bit by returning a slice"""
|
||||||
|
expected_size = SAMPLE_RATE * RECORD_SECONDS * 2 # 16-bit mono
|
||||||
|
actual_size = len(audio_data)
|
||||||
|
|
||||||
|
print("Expected size: " + str(expected_size) + ", Actual: " + str(actual_size))
|
||||||
|
|
||||||
|
# If we got 4x the expected data, downsample to mono
|
||||||
|
if actual_size == expected_size * 4:
|
||||||
|
print("Extracting mono from stereo/32-bit...")
|
||||||
|
# Create new buffer with only the data we need (every 4th pair of bytes)
|
||||||
|
mono_data = bytearray(expected_size)
|
||||||
|
write_pos = 0
|
||||||
|
# Read every 4 bytes, take first 2 bytes only
|
||||||
|
for read_pos in range(0, actual_size, 4):
|
||||||
|
if write_pos + 1 < expected_size and read_pos + 1 < actual_size:
|
||||||
|
mono_data[write_pos] = audio_data[read_pos]
|
||||||
|
mono_data[write_pos + 1] = audio_data[read_pos + 1]
|
||||||
|
write_pos += 2
|
||||||
|
|
||||||
|
# Free original buffer explicitly
|
||||||
|
audio_data = None
|
||||||
|
return mono_data
|
||||||
|
|
||||||
|
# If we got 2x the expected data, extract mono
|
||||||
|
elif actual_size == expected_size * 2:
|
||||||
|
print("Extracting mono from stereo...")
|
||||||
|
mono_data = bytearray(expected_size)
|
||||||
|
write_pos = 0
|
||||||
|
for read_pos in range(0, actual_size, 4):
|
||||||
|
if write_pos + 1 < expected_size and read_pos + 1 < actual_size:
|
||||||
|
mono_data[write_pos] = audio_data[read_pos]
|
||||||
|
mono_data[write_pos + 1] = audio_data[read_pos + 1]
|
||||||
|
write_pos += 2
|
||||||
|
|
||||||
|
# Free original
|
||||||
|
audio_data = None
|
||||||
|
return mono_data
|
||||||
|
|
||||||
|
# Otherwise assume it's already correct format
|
||||||
|
print("Audio data appears to be correct format")
|
||||||
|
return audio_data
|
||||||
|
|
||||||
|
def record_audio(i2s_dev, seconds):
|
||||||
|
"""Record audio for specified seconds using chunked recording with wait"""
|
||||||
|
# Clear screen and show big recording indicator
|
||||||
|
lcd.clear(COLOR_BLACK)
|
||||||
|
|
||||||
|
# Show large "RECORDING" text
|
||||||
|
display_msg("*** RECORDING ***", COLOR_RED, 60)
|
||||||
|
display_msg("Speak now!", COLOR_YELLOW, 100)
|
||||||
|
display_msg("(listening...)", COLOR_WHITE, 130)
|
||||||
|
|
||||||
|
chunk_size = 2048
|
||||||
|
channels = 1
|
||||||
|
|
||||||
|
# Calculate number of chunks needed
|
||||||
|
frame_cnt = seconds * SAMPLE_RATE // chunk_size
|
||||||
|
print("Recording " + str(frame_cnt) + " frames...")
|
||||||
|
|
||||||
|
# Recording loop with wait
|
||||||
|
all_chunks = []
|
||||||
|
for i in range(frame_cnt):
|
||||||
|
# Start recording this chunk
|
||||||
|
audio_chunk = i2s_dev.record(chunk_size * channels)
|
||||||
|
|
||||||
|
# CRITICAL: Wait for recording to complete
|
||||||
|
i2s_dev.wait_record()
|
||||||
|
|
||||||
|
# Convert to bytes and store
|
||||||
|
chunk_bytes = audio_chunk.to_bytes()
|
||||||
|
all_chunks.append(chunk_bytes)
|
||||||
|
|
||||||
|
# Combine all chunks
|
||||||
|
print("Combining " + str(len(all_chunks)) + " chunks...")
|
||||||
|
audio_data = bytearray()
|
||||||
|
for chunk in all_chunks:
|
||||||
|
audio_data.extend(chunk)
|
||||||
|
|
||||||
|
print("Recorded " + str(len(audio_data)) + " bytes")
|
||||||
|
|
||||||
|
# Convert to mono 16-bit if needed
|
||||||
|
audio_data = convert_to_mono_16bit(audio_data)
|
||||||
|
print("Final size: " + str(len(audio_data)) + " bytes")
|
||||||
|
|
||||||
|
return audio_data
|
||||||
|
|
||||||
|
def compress_ulaw(data):
|
||||||
|
"""Compress 16-bit PCM to 8-bit μ-law (50% size reduction)"""
|
||||||
|
# μ-law compression lookup table (simplified)
|
||||||
|
BIAS = 0x84
|
||||||
|
CLIP = 32635
|
||||||
|
|
||||||
|
compressed = bytearray()
|
||||||
|
|
||||||
|
# Process 16-bit samples (2 bytes each)
|
||||||
|
for i in range(0, len(data), 2):
|
||||||
|
# Get 16-bit sample (little endian)
|
||||||
|
sample = struct.unpack('<h', data[i:i+2])[0]
|
||||||
|
|
||||||
|
# Get sign and magnitude
|
||||||
|
sign = 0x80 if sample < 0 else 0x00
|
||||||
|
if sample < 0:
|
||||||
|
sample = -sample
|
||||||
|
if sample > CLIP:
|
||||||
|
sample = CLIP
|
||||||
|
|
||||||
|
# Add bias
|
||||||
|
sample = sample + BIAS
|
||||||
|
|
||||||
|
# Find exponent (position of highest bit)
|
||||||
|
exponent = 7
|
||||||
|
for exp in range(7, -1, -1):
|
||||||
|
if sample & (1 << (exp + 7)):
|
||||||
|
exponent = exp
|
||||||
|
break
|
||||||
|
|
||||||
|
# Get mantissa (top 4 bits after exponent)
|
||||||
|
mantissa = (sample >> (exponent + 3)) & 0x0F
|
||||||
|
|
||||||
|
# Combine: sign (1 bit) + exponent (3 bits) + mantissa (4 bits)
|
||||||
|
ulaw_byte = sign | (exponent << 4) | mantissa
|
||||||
|
|
||||||
|
# Invert bits (μ-law standard)
|
||||||
|
compressed.append(ulaw_byte ^ 0xFF)
|
||||||
|
|
||||||
|
return compressed
|
||||||
|
|
||||||
|
def create_wav_header(data_size, sample_rate=16000, is_ulaw=False):
|
||||||
|
"""Create WAV file header"""
|
||||||
|
header = bytearray()
|
||||||
|
|
||||||
|
# RIFF header
|
||||||
|
header.extend(b'RIFF')
|
||||||
|
header.extend(struct.pack('<I', 50 + data_size)) # Larger header for μ-law
|
||||||
|
header.extend(b'WAVE')
|
||||||
|
|
||||||
|
# fmt chunk
|
||||||
|
header.extend(b'fmt ')
|
||||||
|
header.extend(struct.pack('<I', 18)) # Chunk size (with extension)
|
||||||
|
header.extend(struct.pack('<H', 7 if is_ulaw else 1)) # 7=μ-law, 1=PCM
|
||||||
|
header.extend(struct.pack('<H', 1)) # Mono
|
||||||
|
header.extend(struct.pack('<I', sample_rate))
|
||||||
|
header.extend(struct.pack('<I', sample_rate * (1 if is_ulaw else 2))) # Byte rate
|
||||||
|
header.extend(struct.pack('<H', 1 if is_ulaw else 2)) # Block align
|
||||||
|
header.extend(struct.pack('<H', 8 if is_ulaw else 16)) # Bits per sample
|
||||||
|
header.extend(struct.pack('<H', 0)) # Extension size
|
||||||
|
|
||||||
|
# fact chunk (required for μ-law)
|
||||||
|
if is_ulaw:
|
||||||
|
header.extend(b'fact')
|
||||||
|
header.extend(struct.pack('<I', 4))
|
||||||
|
header.extend(struct.pack('<I', data_size)) # Sample count
|
||||||
|
|
||||||
|
# data chunk
|
||||||
|
header.extend(b'data')
|
||||||
|
header.extend(struct.pack('<I', data_size))
|
||||||
|
|
||||||
|
return header
|
||||||
|
|
||||||
|
def send_to_server(audio_data):
|
||||||
|
"""Send audio to server and get transcription"""
|
||||||
|
lcd.clear(COLOR_BLACK)
|
||||||
|
display_msg("Processing...", COLOR_BLUE, 60)
|
||||||
|
display_msg("Compressing audio", COLOR_WHITE, 100)
|
||||||
|
print("Sending to server...")
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Compress audio using μ-law (50% size reduction)
|
||||||
|
print("Compressing audio...")
|
||||||
|
compressed_data = compress_ulaw(audio_data)
|
||||||
|
print("Compressed: " + str(len(audio_data)) + " -> " + str(len(compressed_data)) + " bytes")
|
||||||
|
|
||||||
|
# Update display
|
||||||
|
display_msg("Sending to server", COLOR_WHITE, 130)
|
||||||
|
|
||||||
|
# Create WAV file with μ-law format
|
||||||
|
wav_header = create_wav_header(len(compressed_data), is_ulaw=True)
|
||||||
|
wav_size = len(wav_header) + len(compressed_data)
|
||||||
|
|
||||||
|
# Simple HTTP POST with raw WAV data
|
||||||
|
headers = "POST /transcribe HTTP/1.1\r\n"
|
||||||
|
headers += "Host: " + SERVER_HOST + "\r\n"
|
||||||
|
headers += "Content-Type: audio/wav\r\n"
|
||||||
|
headers += "Content-Length: " + str(wav_size) + "\r\n"
|
||||||
|
headers += "Connection: close\r\n\r\n"
|
||||||
|
|
||||||
|
# Connect with better socket settings
|
||||||
|
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
|
||||||
|
s.settimeout(30)
|
||||||
|
|
||||||
|
# Try to set socket options for better stability
|
||||||
|
try:
|
||||||
|
s.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
|
||||||
|
except:
|
||||||
|
pass # Some MicroPython builds don't support this
|
||||||
|
|
||||||
|
print("Connecting to " + SERVER_HOST + ":" + str(SERVER_PORT))
|
||||||
|
s.connect((SERVER_HOST, SERVER_PORT))
|
||||||
|
|
||||||
|
# Send headers
|
||||||
|
print("Sending headers...")
|
||||||
|
sent = s.send(headers.encode())
|
||||||
|
print("Sent " + str(sent) + " bytes of headers")
|
||||||
|
|
||||||
|
# Send WAV header
|
||||||
|
print("Sending WAV header...")
|
||||||
|
sent = s.send(wav_header)
|
||||||
|
print("Sent " + str(sent) + " bytes of WAV header")
|
||||||
|
|
||||||
|
# Send audio data in small chunks with delay
|
||||||
|
print("Sending audio data (" + str(len(compressed_data)) + " bytes)...")
|
||||||
|
chunk_size = 512 # Even smaller chunks for stability
|
||||||
|
total_chunks = (len(compressed_data) + chunk_size - 1) // chunk_size
|
||||||
|
|
||||||
|
bytes_sent = 0
|
||||||
|
for i in range(0, len(compressed_data), chunk_size):
|
||||||
|
chunk = compressed_data[i:i+chunk_size]
|
||||||
|
try:
|
||||||
|
sent = s.send(chunk)
|
||||||
|
bytes_sent += sent
|
||||||
|
chunk_num = i // chunk_size + 1
|
||||||
|
if chunk_num % 10 == 0: # Progress update every 10 chunks
|
||||||
|
print("Sent " + str(bytes_sent) + "/" + str(len(compressed_data)) + " bytes")
|
||||||
|
# Small delay to let socket buffer drain
|
||||||
|
time.sleep_ms(10)
|
||||||
|
except Exception as e:
|
||||||
|
print("Send error at byte " + str(bytes_sent) + ": " + str(e))
|
||||||
|
raise
|
||||||
|
|
||||||
|
print("All data sent! Total: " + str(bytes_sent) + " bytes")
|
||||||
|
|
||||||
|
# Update display for waiting
|
||||||
|
lcd.clear(COLOR_BLACK)
|
||||||
|
display_msg("Transcribing...", COLOR_CYAN, 60)
|
||||||
|
display_msg("Please wait", COLOR_WHITE, 100)
|
||||||
|
|
||||||
|
# Read response
|
||||||
|
response = b""
|
||||||
|
while True:
|
||||||
|
chunk = s.recv(1024)
|
||||||
|
if not chunk:
|
||||||
|
break
|
||||||
|
response += chunk
|
||||||
|
|
||||||
|
s.close()
|
||||||
|
|
||||||
|
# Parse response (MicroPython decode doesn't accept keyword args)
|
||||||
|
try:
|
||||||
|
response_str = response.decode('utf-8')
|
||||||
|
except:
|
||||||
|
response_str = str(response)
|
||||||
|
print("Response: " + response_str[:200])
|
||||||
|
|
||||||
|
# Extract JSON from response
|
||||||
|
if '{"' in response_str:
|
||||||
|
json_start = response_str.index('{"')
|
||||||
|
json_str = response_str[json_start:]
|
||||||
|
|
||||||
|
# Simple JSON parsing (MicroPython doesn't have json module)
|
||||||
|
if '"text":' in json_str:
|
||||||
|
text_start = json_str.index('"text":') + 7
|
||||||
|
text_str = json_str[text_start:]
|
||||||
|
# Find the value between quotes
|
||||||
|
if '"' in text_str:
|
||||||
|
quote_start = text_str.index('"') + 1
|
||||||
|
quote_end = text_str.index('"', quote_start)
|
||||||
|
transcription = text_str[quote_start:quote_end]
|
||||||
|
return transcription
|
||||||
|
|
||||||
|
return "Error parsing response"
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print("Error: " + str(e))
|
||||||
|
return "Error: " + str(e)
|
||||||
|
|
||||||
|
def display_transcription(text):
|
||||||
|
"""Display transcription on LCD"""
|
||||||
|
lcd.clear(COLOR_BLACK)
|
||||||
|
display_msg("TRANSCRIPTION:", COLOR_GREEN, 10)
|
||||||
|
|
||||||
|
# Simple line splitting every 20 chars
|
||||||
|
y = 40
|
||||||
|
while len(text) > 0:
|
||||||
|
chunk = text[:20]
|
||||||
|
display_msg(chunk, COLOR_WHITE, y)
|
||||||
|
text = text[20:]
|
||||||
|
y += 20
|
||||||
|
if y > 200:
|
||||||
|
break
|
||||||
|
|
||||||
|
print("Transcription: " + text)
|
||||||
|
|
||||||
|
def main():
|
||||||
|
"""Main program with loop for multiple recordings"""
|
||||||
|
print("=" * 40)
|
||||||
|
print("Simple Audio Recording Test")
|
||||||
|
print("=" * 40)
|
||||||
|
|
||||||
|
# Initialize
|
||||||
|
nic = init_wifi()
|
||||||
|
if not nic:
|
||||||
|
return
|
||||||
|
|
||||||
|
i2s = init_audio()
|
||||||
|
|
||||||
|
# Setup button (boot button on GPIO 16)
|
||||||
|
fm.register(16, fm.fpioa.GPIOHS0, force=True)
|
||||||
|
button = GPIO(GPIO.GPIOHS0, GPIO.IN, GPIO.PULL_UP)
|
||||||
|
|
||||||
|
display_msg("Ready!", COLOR_GREEN, 110, clear=True)
|
||||||
|
display_msg("Press BOOT button", COLOR_WHITE, 130)
|
||||||
|
display_msg("to record", COLOR_WHITE, 150)
|
||||||
|
print("Press BOOT button to record, or Ctrl+C to exit")
|
||||||
|
|
||||||
|
recording_count = 0
|
||||||
|
|
||||||
|
# Main loop
|
||||||
|
while True:
|
||||||
|
# Wait for button press (button is active low)
|
||||||
|
if button.value() == 0:
|
||||||
|
recording_count += 1
|
||||||
|
print("\n--- Recording #" + str(recording_count) + " ---")
|
||||||
|
|
||||||
|
# Debounce - wait for button release
|
||||||
|
while button.value() == 0:
|
||||||
|
time.sleep_ms(10)
|
||||||
|
|
||||||
|
# Give user time to prepare (countdown)
|
||||||
|
lcd.clear(COLOR_BLACK)
|
||||||
|
display_msg("GET READY!", COLOR_YELLOW, 80)
|
||||||
|
display_msg("3...", COLOR_WHITE, 120)
|
||||||
|
time.sleep(1)
|
||||||
|
display_msg("2...", COLOR_WHITE, 140)
|
||||||
|
time.sleep(1)
|
||||||
|
display_msg("1...", COLOR_WHITE, 160)
|
||||||
|
time.sleep(1)
|
||||||
|
|
||||||
|
# Record
|
||||||
|
audio_data = record_audio(i2s, RECORD_SECONDS)
|
||||||
|
|
||||||
|
# Send to server
|
||||||
|
transcription = send_to_server(audio_data)
|
||||||
|
|
||||||
|
# Display result
|
||||||
|
display_transcription(transcription)
|
||||||
|
|
||||||
|
# Wait a bit before showing ready again
|
||||||
|
time.sleep(2)
|
||||||
|
|
||||||
|
# Show ready for next recording
|
||||||
|
display_msg("Ready!", COLOR_GREEN, 110, clear=True)
|
||||||
|
display_msg("Press BOOT button", COLOR_WHITE, 130)
|
||||||
|
next_count = recording_count + 1
|
||||||
|
display_msg("to record (#" + str(next_count) + ")", COLOR_WHITE, 150)
|
||||||
|
print("Ready for next recording. Press BOOT button.")
|
||||||
|
|
||||||
|
time.sleep_ms(50) # Small delay to reduce CPU usage
|
||||||
|
|
||||||
|
# Run main
|
||||||
|
main()
|
||||||
|
|
||||||
252
hardware/maixduino/maix_test_simple.py
Normal file
252
hardware/maixduino/maix_test_simple.py
Normal file
|
|
@ -0,0 +1,252 @@
|
||||||
|
# Maix Duino - Simple Test Script
|
||||||
|
# Copy/paste this into MaixPy IDE and click RUN
|
||||||
|
#
|
||||||
|
# This script tests:
|
||||||
|
# 1. LCD display
|
||||||
|
# 2. WiFi connectivity
|
||||||
|
# 3. Network connection to Heimdall server
|
||||||
|
# 4. I2S audio initialization (without recording yet)
|
||||||
|
|
||||||
|
import time
|
||||||
|
import lcd
|
||||||
|
from Maix import GPIO, I2S
|
||||||
|
from fpioa_manager import fm
|
||||||
|
|
||||||
|
# Import the correct network module
|
||||||
|
try:
|
||||||
|
import network
|
||||||
|
# Create ESP32_SPI instance (for Maix Duino with ESP32)
|
||||||
|
nic = None # Will be initialized in test_wifi
|
||||||
|
except Exception as e:
|
||||||
|
print("Network module import error: " + str(e))
|
||||||
|
nic = None
|
||||||
|
|
||||||
|
# ===== CONFIGURATION - EDIT THESE =====
|
||||||
|
# Load credentials from secrets.py (gitignored)
|
||||||
|
try:
|
||||||
|
from secrets import SECRETS
|
||||||
|
except ImportError:
|
||||||
|
SECRETS = {}
|
||||||
|
|
||||||
|
WIFI_SSID = "Tell My WiFi Love Her" # <<< CHANGE THIS
|
||||||
|
WIFI_PASSWORD = SECRETS.get("wifi_password", "") # set in secrets.py # <<< CHANGE THIS
|
||||||
|
SERVER_URL = "http://10.1.10.71:3006" # Heimdall voice server
|
||||||
|
# =======================================
|
||||||
|
|
||||||
|
# Colors (as tuples for easy reference)
|
||||||
|
COLOR_BLACK = (0, 0, 0)
|
||||||
|
COLOR_WHITE = (255, 255, 255)
|
||||||
|
COLOR_RED = (255, 0, 0)
|
||||||
|
COLOR_GREEN = (0, 255, 0)
|
||||||
|
COLOR_BLUE = (0, 0, 255)
|
||||||
|
COLOR_YELLOW = (255, 255, 0)
|
||||||
|
|
||||||
|
def display_msg(msg, color=COLOR_WHITE, y=50):
|
||||||
|
"""Display message on LCD"""
|
||||||
|
# lcd.draw_string needs RGB as separate ints: lcd.draw_string(x, y, text, color_int, bg_color_int)
|
||||||
|
# Convert RGB tuple to single integer: (R << 16) | (G << 8) | B
|
||||||
|
color_int = (color[0] << 16) | (color[1] << 8) | color[2]
|
||||||
|
bg_int = 0 # Black background
|
||||||
|
lcd.draw_string(10, y, msg, color_int, bg_int)
|
||||||
|
print(msg)
|
||||||
|
|
||||||
|
def test_lcd():
|
||||||
|
"""Test LCD display"""
|
||||||
|
lcd.init()
|
||||||
|
lcd.clear(COLOR_BLACK)
|
||||||
|
display_msg("MaixDuino Test", COLOR_YELLOW, 10)
|
||||||
|
display_msg("Initializing...", COLOR_WHITE, 30)
|
||||||
|
time.sleep(1)
|
||||||
|
return True
|
||||||
|
|
||||||
|
def test_wifi():
|
||||||
|
"""Test WiFi connection"""
|
||||||
|
global nic
|
||||||
|
display_msg("Connecting WiFi...", COLOR_BLUE, 50)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Initialize ESP32_SPI network interface
|
||||||
|
print("Initializing ESP32_SPI...")
|
||||||
|
|
||||||
|
# Create network interface instance with Maix Duino pins
|
||||||
|
# Maix Duino ESP32 default pins:
|
||||||
|
# CS=25, RST=8, RDY=9, MOSI=28, MISO=26, SCLK=27
|
||||||
|
from network import ESP32_SPI
|
||||||
|
from fpioa_manager import fm
|
||||||
|
from Maix import GPIO
|
||||||
|
|
||||||
|
# Register pins for ESP32 SPI communication
|
||||||
|
fm.register(25, fm.fpioa.GPIOHS10, force=True) # CS
|
||||||
|
fm.register(8, fm.fpioa.GPIOHS11, force=True) # RST
|
||||||
|
fm.register(9, fm.fpioa.GPIOHS12, force=True) # RDY
|
||||||
|
fm.register(28, fm.fpioa.GPIOHS13, force=True) # MOSI
|
||||||
|
fm.register(26, fm.fpioa.GPIOHS14, force=True) # MISO
|
||||||
|
fm.register(27, fm.fpioa.GPIOHS15, force=True) # SCLK
|
||||||
|
|
||||||
|
nic = ESP32_SPI(
|
||||||
|
cs=fm.fpioa.GPIOHS10,
|
||||||
|
rst=fm.fpioa.GPIOHS11,
|
||||||
|
rdy=fm.fpioa.GPIOHS12,
|
||||||
|
mosi=fm.fpioa.GPIOHS13,
|
||||||
|
miso=fm.fpioa.GPIOHS14,
|
||||||
|
sclk=fm.fpioa.GPIOHS15
|
||||||
|
)
|
||||||
|
|
||||||
|
print("Connecting to " + WIFI_SSID + "...")
|
||||||
|
|
||||||
|
# Connect to WiFi (no need to call active() first)
|
||||||
|
nic.connect(WIFI_SSID, WIFI_PASSWORD)
|
||||||
|
|
||||||
|
# Wait for connection
|
||||||
|
timeout = 20
|
||||||
|
while timeout > 0:
|
||||||
|
time.sleep(1)
|
||||||
|
timeout -= 1
|
||||||
|
|
||||||
|
if nic.isconnected():
|
||||||
|
# Successfully connected!
|
||||||
|
ip_info = nic.ifconfig()
|
||||||
|
ip = ip_info[0] if ip_info else "Unknown"
|
||||||
|
display_msg("WiFi OK!", COLOR_GREEN, 70)
|
||||||
|
display_msg("IP: " + str(ip), COLOR_WHITE, 90)
|
||||||
|
print("Connected! IP: " + str(ip))
|
||||||
|
time.sleep(2)
|
||||||
|
return True
|
||||||
|
else:
|
||||||
|
print("Waiting... " + str(timeout) + "s")
|
||||||
|
|
||||||
|
# Timeout reached
|
||||||
|
display_msg("WiFi FAILED!", COLOR_RED, 70)
|
||||||
|
print("Connection timeout")
|
||||||
|
return False
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
display_msg("WiFi error!", COLOR_RED, 70)
|
||||||
|
print("WiFi error: " + str(e))
|
||||||
|
import sys
|
||||||
|
sys.print_exception(e)
|
||||||
|
return False
|
||||||
|
|
||||||
|
def test_server():
|
||||||
|
"""Test connection to Heimdall server"""
|
||||||
|
display_msg("Testing server...", COLOR_BLUE, 110)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Try socket connection to server
|
||||||
|
import socket
|
||||||
|
|
||||||
|
url = SERVER_URL + "/health"
|
||||||
|
print("Trying: " + url)
|
||||||
|
|
||||||
|
# Parse URL to get host and port
|
||||||
|
host = "10.1.10.71"
|
||||||
|
port = 3006
|
||||||
|
|
||||||
|
# Create socket
|
||||||
|
s = socket.socket()
|
||||||
|
s.settimeout(5)
|
||||||
|
|
||||||
|
print("Connecting to " + host + ":" + str(port))
|
||||||
|
s.connect((host, port))
|
||||||
|
|
||||||
|
# Send HTTP GET request
|
||||||
|
request = "GET /health HTTP/1.1\r\nHost: " + host + "\r\nConnection: close\r\n\r\n"
|
||||||
|
s.send(request.encode())
|
||||||
|
|
||||||
|
# Read response
|
||||||
|
response = s.recv(1024).decode()
|
||||||
|
s.close()
|
||||||
|
|
||||||
|
print("Server response received")
|
||||||
|
|
||||||
|
if "200" in response or "OK" in response:
|
||||||
|
display_msg("Server OK!", COLOR_GREEN, 130)
|
||||||
|
print("Server is reachable!")
|
||||||
|
time.sleep(2)
|
||||||
|
return True
|
||||||
|
else:
|
||||||
|
display_msg("Server responded", COLOR_YELLOW, 130)
|
||||||
|
print("Response: " + response[:100])
|
||||||
|
return True # Still counts as success if we got a response
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
display_msg("Server FAILED!", COLOR_RED, 130)
|
||||||
|
error_msg = str(e)[:30]
|
||||||
|
display_msg(error_msg, COLOR_RED, 150)
|
||||||
|
print("Server connection failed: " + str(e))
|
||||||
|
return False
|
||||||
|
|
||||||
|
def test_audio():
|
||||||
|
"""Test I2S audio initialization"""
|
||||||
|
display_msg("Testing audio...", COLOR_BLUE, 170)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Register I2S pins (Maix Duino pinout)
|
||||||
|
fm.register(20, fm.fpioa.I2S0_IN_D0, force=True)
|
||||||
|
fm.register(19, fm.fpioa.I2S0_WS, force=True)
|
||||||
|
fm.register(18, fm.fpioa.I2S0_SCLK, force=True)
|
||||||
|
|
||||||
|
# Initialize I2S
|
||||||
|
rx = I2S(I2S.DEVICE_0)
|
||||||
|
rx.channel_config(rx.CHANNEL_0, rx.RECEIVER, align_mode=I2S.STANDARD_MODE)
|
||||||
|
rx.set_sample_rate(16000)
|
||||||
|
|
||||||
|
display_msg("Audio OK!", COLOR_GREEN, 190)
|
||||||
|
print("I2S initialized: " + str(rx))
|
||||||
|
time.sleep(2)
|
||||||
|
return True
|
||||||
|
except Exception as e:
|
||||||
|
display_msg("Audio FAILED!", COLOR_RED, 190)
|
||||||
|
print("Audio init failed: " + str(e))
|
||||||
|
return False
|
||||||
|
|
||||||
|
def main():
|
||||||
|
"""Run all tests"""
|
||||||
|
print("=" * 40)
|
||||||
|
print("MaixDuino Voice Assistant Test")
|
||||||
|
print("=" * 40)
|
||||||
|
|
||||||
|
# Test LCD
|
||||||
|
if not test_lcd():
|
||||||
|
print("LCD test failed!")
|
||||||
|
return
|
||||||
|
|
||||||
|
# Test WiFi
|
||||||
|
if not test_wifi():
|
||||||
|
print("WiFi test failed!")
|
||||||
|
red_int = (255 << 16) | (0 << 8) | 0 # Red color
|
||||||
|
lcd.draw_string(10, 210, "STOPPED - Check WiFi", red_int, 0)
|
||||||
|
return
|
||||||
|
|
||||||
|
# Test server connection
|
||||||
|
server_ok = test_server()
|
||||||
|
|
||||||
|
# Test audio
|
||||||
|
audio_ok = test_audio()
|
||||||
|
|
||||||
|
# Summary
|
||||||
|
lcd.clear(COLOR_BLACK)
|
||||||
|
display_msg("=== TEST RESULTS ===", COLOR_YELLOW, 10)
|
||||||
|
display_msg("LCD: OK", COLOR_GREEN, 40)
|
||||||
|
display_msg("WiFi: OK", COLOR_GREEN, 60)
|
||||||
|
|
||||||
|
if server_ok:
|
||||||
|
display_msg("Server: OK", COLOR_GREEN, 80)
|
||||||
|
else:
|
||||||
|
display_msg("Server: FAIL", COLOR_RED, 80)
|
||||||
|
|
||||||
|
if audio_ok:
|
||||||
|
display_msg("Audio: OK", COLOR_GREEN, 100)
|
||||||
|
else:
|
||||||
|
display_msg("Audio: FAIL", COLOR_RED, 100)
|
||||||
|
|
||||||
|
if server_ok and audio_ok:
|
||||||
|
display_msg("Ready for voice app!", COLOR_GREEN, 140)
|
||||||
|
else:
|
||||||
|
display_msg("Fix errors first", COLOR_YELLOW, 140)
|
||||||
|
|
||||||
|
print("\nTest complete!")
|
||||||
|
|
||||||
|
# Run the test
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
465
hardware/maixduino/maix_voice_client.py
Executable file
465
hardware/maixduino/maix_voice_client.py
Executable file
|
|
@ -0,0 +1,465 @@
|
||||||
|
# Maix Duino Voice Assistant Client
|
||||||
|
# Path: maix_voice_client.py (upload to Maix Duino SD card)
|
||||||
|
#
|
||||||
|
# Purpose and usage:
|
||||||
|
# This script runs on the Maix Duino board and handles:
|
||||||
|
# - Wake word detection using KPU
|
||||||
|
# - Audio capture from I2S microphone
|
||||||
|
# - Streaming audio to voice processing server
|
||||||
|
# - Playing back TTS responses
|
||||||
|
# - LED feedback for user interaction
|
||||||
|
#
|
||||||
|
# Requirements:
|
||||||
|
# - MaixPy firmware (latest version)
|
||||||
|
# - I2S microphone connected
|
||||||
|
# - Speaker or audio output connected
|
||||||
|
# - WiFi configured (see config below)
|
||||||
|
#
|
||||||
|
# Upload to board:
|
||||||
|
# 1. Copy this file to SD card as boot.py or main.py
|
||||||
|
# 2. Update WiFi credentials below
|
||||||
|
# 3. Update server URL to your Heimdall IP
|
||||||
|
# 4. Power cycle the board
|
||||||
|
|
||||||
|
import time
|
||||||
|
import audio
|
||||||
|
import image
|
||||||
|
from Maix import GPIO
|
||||||
|
from fpioa_manager import fm
|
||||||
|
from machine import I2S
|
||||||
|
import KPU as kpu
|
||||||
|
import sensor
|
||||||
|
import lcd
|
||||||
|
import gc
|
||||||
|
|
||||||
|
# ----- Configuration -----
|
||||||
|
|
||||||
|
# WiFi Settings
|
||||||
|
WIFI_SSID = "YourSSID"
|
||||||
|
WIFI_PASSWORD = "YourPassword"
|
||||||
|
|
||||||
|
# Server Settings
|
||||||
|
VOICE_SERVER_URL = "http://10.1.10.71:5000"
|
||||||
|
PROCESS_ENDPOINT = "/process"
|
||||||
|
|
||||||
|
# Audio Settings
|
||||||
|
SAMPLE_RATE = 16000 # 16kHz for Whisper
|
||||||
|
CHANNELS = 1 # Mono
|
||||||
|
SAMPLE_WIDTH = 2 # 16-bit
|
||||||
|
CHUNK_SIZE = 1024
|
||||||
|
|
||||||
|
# Wake Word Settings
|
||||||
|
WAKE_WORD_THRESHOLD = 0.7 # Confidence threshold (0.0-1.0)
|
||||||
|
WAKE_WORD_MODEL = "/sd/models/wake_word.kmodel" # Path to wake word model
|
||||||
|
|
||||||
|
# LED Pin for feedback
|
||||||
|
LED_PIN = 13 # Onboard LED (adjust if needed)
|
||||||
|
|
||||||
|
# Recording Settings
|
||||||
|
MAX_RECORD_TIME = 10 # Maximum seconds to record after wake word
|
||||||
|
SILENCE_THRESHOLD = 500 # Amplitude threshold for silence detection
|
||||||
|
SILENCE_DURATION = 2 # Seconds of silence before stopping recording
|
||||||
|
|
||||||
|
# ----- Color definitions for LCD -----
|
||||||
|
COLOR_RED = (255, 0, 0)
|
||||||
|
COLOR_GREEN = (0, 255, 0)
|
||||||
|
COLOR_BLUE = (0, 0, 255)
|
||||||
|
COLOR_YELLOW = (255, 255, 0)
|
||||||
|
COLOR_BLACK = (0, 0, 0)
|
||||||
|
COLOR_WHITE = (255, 255, 255)
|
||||||
|
|
||||||
|
# ----- Global Variables -----
|
||||||
|
led = None
|
||||||
|
i2s_dev = None
|
||||||
|
kpu_task = None
|
||||||
|
listening = False
|
||||||
|
|
||||||
|
|
||||||
|
def init_hardware():
|
||||||
|
"""Initialize hardware components"""
|
||||||
|
global led, i2s_dev
|
||||||
|
|
||||||
|
# Initialize LED
|
||||||
|
fm.register(LED_PIN, fm.fpioa.GPIO0)
|
||||||
|
led = GPIO(GPIO.GPIO0, GPIO.OUT)
|
||||||
|
led.value(0) # Turn off initially
|
||||||
|
|
||||||
|
# Initialize LCD
|
||||||
|
lcd.init()
|
||||||
|
lcd.clear(COLOR_BLACK)
|
||||||
|
lcd.draw_string(lcd.width()//2 - 50, lcd.height()//2,
|
||||||
|
"Initializing...",
|
||||||
|
lcd.WHITE, lcd.BLACK)
|
||||||
|
|
||||||
|
# Initialize I2S for audio (microphone)
|
||||||
|
# Note: Pin configuration may vary based on your specific hardware
|
||||||
|
fm.register(20, fm.fpioa.I2S0_IN_D0)
|
||||||
|
fm.register(19, fm.fpioa.I2S0_WS)
|
||||||
|
fm.register(18, fm.fpioa.I2S0_SCLK)
|
||||||
|
|
||||||
|
i2s_dev = I2S(I2S.DEVICE_0)
|
||||||
|
i2s_dev.channel_config(I2S.CHANNEL_0, I2S.RECEIVER,
|
||||||
|
align_mode=I2S.STANDARD_MODE,
|
||||||
|
data_width=I2S.RESOLUTION_16_BIT)
|
||||||
|
i2s_dev.set_sample_rate(SAMPLE_RATE)
|
||||||
|
|
||||||
|
print("Hardware initialized")
|
||||||
|
|
||||||
|
|
||||||
|
def init_network():
|
||||||
|
"""Initialize WiFi connection"""
|
||||||
|
import network
|
||||||
|
|
||||||
|
lcd.clear(COLOR_BLACK)
|
||||||
|
lcd.draw_string(10, 50, "Connecting to WiFi...", COLOR_WHITE, COLOR_BLACK)
|
||||||
|
|
||||||
|
wlan = network.WLAN(network.STA_IF)
|
||||||
|
wlan.active(True)
|
||||||
|
|
||||||
|
if not wlan.isconnected():
|
||||||
|
print(f"Connecting to {WIFI_SSID}...")
|
||||||
|
wlan.connect(WIFI_SSID, WIFI_PASSWORD)
|
||||||
|
|
||||||
|
# Wait for connection
|
||||||
|
timeout = 20
|
||||||
|
while not wlan.isconnected() and timeout > 0:
|
||||||
|
time.sleep(1)
|
||||||
|
timeout -= 1
|
||||||
|
print(f"Waiting for connection... {timeout}s")
|
||||||
|
|
||||||
|
if not wlan.isconnected():
|
||||||
|
print("Failed to connect to WiFi")
|
||||||
|
lcd.clear(COLOR_BLACK)
|
||||||
|
lcd.draw_string(10, 50, "WiFi Failed!", COLOR_RED, COLOR_BLACK)
|
||||||
|
return False
|
||||||
|
|
||||||
|
print("Network connected:", wlan.ifconfig())
|
||||||
|
lcd.clear(COLOR_BLACK)
|
||||||
|
lcd.draw_string(10, 50, "WiFi Connected", COLOR_GREEN, COLOR_BLACK)
|
||||||
|
lcd.draw_string(10, 70, f"IP: {wlan.ifconfig()[0]}", COLOR_WHITE, COLOR_BLACK)
|
||||||
|
time.sleep(2)
|
||||||
|
|
||||||
|
return True
|
||||||
|
|
||||||
|
|
||||||
|
def load_wake_word_model():
|
||||||
|
"""Load wake word detection model"""
|
||||||
|
global kpu_task
|
||||||
|
|
||||||
|
try:
|
||||||
|
# This is a placeholder - you'll need to train and convert a wake word model
|
||||||
|
# For now, we'll skip KPU wake word and use a simpler approach
|
||||||
|
print("Wake word model loading skipped (implement after model training)")
|
||||||
|
return True
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Failed to load wake word model: {e}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def detect_wake_word():
|
||||||
|
"""
|
||||||
|
Detect wake word in audio stream
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
True if wake word detected, False otherwise
|
||||||
|
|
||||||
|
Note: This is a simplified version. For production, you should:
|
||||||
|
1. Train a wake word model using Mycroft Precise or similar
|
||||||
|
2. Convert the model to .kmodel format for K210
|
||||||
|
3. Load and run inference using KPU
|
||||||
|
|
||||||
|
For now, we'll use a simple amplitude-based trigger
|
||||||
|
"""
|
||||||
|
# Simple amplitude-based detection (placeholder)
|
||||||
|
# Replace with actual KPU inference
|
||||||
|
|
||||||
|
audio_data = i2s_dev.record(CHUNK_SIZE)
|
||||||
|
|
||||||
|
if audio_data:
|
||||||
|
# Calculate amplitude
|
||||||
|
amplitude = 0
|
||||||
|
for i in range(0, len(audio_data), 2):
|
||||||
|
sample = int.from_bytes(audio_data[i:i+2], 'little', True)
|
||||||
|
amplitude += abs(sample)
|
||||||
|
|
||||||
|
amplitude = amplitude / (len(audio_data) // 2)
|
||||||
|
|
||||||
|
# Simple threshold detection (replace with KPU inference)
|
||||||
|
if amplitude > 3000: # Adjust threshold based on your microphone
|
||||||
|
return True
|
||||||
|
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def record_audio(max_duration=MAX_RECORD_TIME):
|
||||||
|
"""
|
||||||
|
Record audio until silence or max duration
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
bytes: Recorded audio data in WAV format
|
||||||
|
"""
|
||||||
|
print(f"Recording audio (max {max_duration}s)...")
|
||||||
|
|
||||||
|
audio_buffer = bytearray()
|
||||||
|
start_time = time.time()
|
||||||
|
silence_start = None
|
||||||
|
|
||||||
|
# Record in chunks
|
||||||
|
while True:
|
||||||
|
elapsed = time.time() - start_time
|
||||||
|
|
||||||
|
# Check max duration
|
||||||
|
if elapsed > max_duration:
|
||||||
|
print("Max recording duration reached")
|
||||||
|
break
|
||||||
|
|
||||||
|
# Record chunk
|
||||||
|
chunk = i2s_dev.record(CHUNK_SIZE)
|
||||||
|
|
||||||
|
if chunk:
|
||||||
|
audio_buffer.extend(chunk)
|
||||||
|
|
||||||
|
# Calculate amplitude for silence detection
|
||||||
|
amplitude = 0
|
||||||
|
for i in range(0, len(chunk), 2):
|
||||||
|
sample = int.from_bytes(chunk[i:i+2], 'little', True)
|
||||||
|
amplitude += abs(sample)
|
||||||
|
|
||||||
|
amplitude = amplitude / (len(chunk) // 2)
|
||||||
|
|
||||||
|
# Silence detection
|
||||||
|
if amplitude < SILENCE_THRESHOLD:
|
||||||
|
if silence_start is None:
|
||||||
|
silence_start = time.time()
|
||||||
|
elif time.time() - silence_start > SILENCE_DURATION:
|
||||||
|
print("Silence detected, stopping recording")
|
||||||
|
break
|
||||||
|
else:
|
||||||
|
silence_start = None
|
||||||
|
|
||||||
|
# Update LCD with recording time
|
||||||
|
if int(elapsed) % 1 == 0:
|
||||||
|
lcd.clear(COLOR_BLACK)
|
||||||
|
lcd.draw_string(10, 50, f"Recording... {int(elapsed)}s",
|
||||||
|
COLOR_RED, COLOR_BLACK)
|
||||||
|
|
||||||
|
print(f"Recorded {len(audio_buffer)} bytes")
|
||||||
|
|
||||||
|
# Convert to WAV format
|
||||||
|
return create_wav(audio_buffer)
|
||||||
|
|
||||||
|
|
||||||
|
def create_wav(audio_data):
|
||||||
|
"""Create WAV file header and combine with audio data"""
|
||||||
|
import struct
|
||||||
|
|
||||||
|
# WAV header
|
||||||
|
sample_rate = SAMPLE_RATE
|
||||||
|
channels = CHANNELS
|
||||||
|
sample_width = SAMPLE_WIDTH
|
||||||
|
data_size = len(audio_data)
|
||||||
|
|
||||||
|
# RIFF header
|
||||||
|
wav = bytearray(b'RIFF')
|
||||||
|
wav.extend(struct.pack('<I', 36 + data_size)) # File size - 8
|
||||||
|
wav.extend(b'WAVE')
|
||||||
|
|
||||||
|
# fmt chunk
|
||||||
|
wav.extend(b'fmt ')
|
||||||
|
wav.extend(struct.pack('<I', 16)) # fmt chunk size
|
||||||
|
wav.extend(struct.pack('<H', 1)) # PCM format
|
||||||
|
wav.extend(struct.pack('<H', channels))
|
||||||
|
wav.extend(struct.pack('<I', sample_rate))
|
||||||
|
wav.extend(struct.pack('<I', sample_rate * channels * sample_width))
|
||||||
|
wav.extend(struct.pack('<H', channels * sample_width))
|
||||||
|
wav.extend(struct.pack('<H', sample_width * 8))
|
||||||
|
|
||||||
|
# data chunk
|
||||||
|
wav.extend(b'data')
|
||||||
|
wav.extend(struct.pack('<I', data_size))
|
||||||
|
wav.extend(audio_data)
|
||||||
|
|
||||||
|
return bytes(wav)
|
||||||
|
|
||||||
|
|
||||||
|
def send_audio_to_server(audio_data):
|
||||||
|
"""
|
||||||
|
Send audio to voice processing server and get response
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
dict: Response from server or None on failure
|
||||||
|
"""
|
||||||
|
import urequests
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Prepare multipart form data
|
||||||
|
url = f"{VOICE_SERVER_URL}{PROCESS_ENDPOINT}"
|
||||||
|
|
||||||
|
print(f"Sending audio to {url}...")
|
||||||
|
lcd.clear(COLOR_BLACK)
|
||||||
|
lcd.draw_string(10, 50, "Processing...", COLOR_YELLOW, COLOR_BLACK)
|
||||||
|
|
||||||
|
# Send POST request with audio file
|
||||||
|
# Note: MaixPy's urequests doesn't support multipart, so we need a workaround
|
||||||
|
# For now, send raw audio with appropriate headers
|
||||||
|
headers = {
|
||||||
|
'Content-Type': 'audio/wav',
|
||||||
|
}
|
||||||
|
|
||||||
|
response = urequests.post(url, data=audio_data, headers=headers)
|
||||||
|
|
||||||
|
if response.status_code == 200:
|
||||||
|
result = response.json()
|
||||||
|
response.close()
|
||||||
|
return result
|
||||||
|
else:
|
||||||
|
print(f"Server error: {response.status_code}")
|
||||||
|
response.close()
|
||||||
|
return None
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error sending audio: {e}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def display_response(response_text):
|
||||||
|
"""Display response on LCD"""
|
||||||
|
lcd.clear(COLOR_BLACK)
|
||||||
|
|
||||||
|
# Word wrap for LCD
|
||||||
|
words = response_text.split()
|
||||||
|
lines = []
|
||||||
|
current_line = ""
|
||||||
|
|
||||||
|
for word in words:
|
||||||
|
test_line = current_line + word + " "
|
||||||
|
if len(test_line) * 8 > lcd.width() - 20: # Rough character width
|
||||||
|
if current_line:
|
||||||
|
lines.append(current_line.strip())
|
||||||
|
current_line = word + " "
|
||||||
|
else:
|
||||||
|
current_line = test_line
|
||||||
|
|
||||||
|
if current_line:
|
||||||
|
lines.append(current_line.strip())
|
||||||
|
|
||||||
|
# Display lines
|
||||||
|
y = 30
|
||||||
|
for line in lines[:5]: # Max 5 lines
|
||||||
|
lcd.draw_string(10, y, line, COLOR_GREEN, COLOR_BLACK)
|
||||||
|
y += 20
|
||||||
|
|
||||||
|
|
||||||
|
def set_led(state):
|
||||||
|
"""Control LED state"""
|
||||||
|
if led:
|
||||||
|
led.value(1 if state else 0)
|
||||||
|
|
||||||
|
|
||||||
|
def main_loop():
|
||||||
|
"""Main voice assistant loop"""
|
||||||
|
global listening
|
||||||
|
|
||||||
|
# Show ready status
|
||||||
|
lcd.clear(COLOR_BLACK)
|
||||||
|
lcd.draw_string(10, lcd.height()//2 - 10, "Say wake word...",
|
||||||
|
COLOR_BLUE, COLOR_BLACK)
|
||||||
|
|
||||||
|
print("Voice assistant ready. Listening for wake word...")
|
||||||
|
|
||||||
|
while True:
|
||||||
|
try:
|
||||||
|
# Listen for wake word
|
||||||
|
if detect_wake_word():
|
||||||
|
print("Wake word detected!")
|
||||||
|
|
||||||
|
# Visual feedback
|
||||||
|
set_led(True)
|
||||||
|
lcd.clear(COLOR_BLACK)
|
||||||
|
lcd.draw_string(10, 50, "Listening...", COLOR_RED, COLOR_BLACK)
|
||||||
|
|
||||||
|
# Small delay to skip the wake word itself
|
||||||
|
time.sleep(0.5)
|
||||||
|
|
||||||
|
# Record command
|
||||||
|
audio_data = record_audio()
|
||||||
|
|
||||||
|
# Send to server
|
||||||
|
response = send_audio_to_server(audio_data)
|
||||||
|
|
||||||
|
if response and response.get('success'):
|
||||||
|
transcription = response.get('transcription', '')
|
||||||
|
response_text = response.get('response', 'No response')
|
||||||
|
|
||||||
|
print(f"You said: {transcription}")
|
||||||
|
print(f"Response: {response_text}")
|
||||||
|
|
||||||
|
# Display response
|
||||||
|
display_response(response_text)
|
||||||
|
|
||||||
|
# TODO: Play TTS audio response
|
||||||
|
|
||||||
|
else:
|
||||||
|
lcd.clear(COLOR_BLACK)
|
||||||
|
lcd.draw_string(10, 50, "Error processing",
|
||||||
|
COLOR_RED, COLOR_BLACK)
|
||||||
|
|
||||||
|
# Turn off LED
|
||||||
|
set_led(False)
|
||||||
|
|
||||||
|
# Pause before listening again
|
||||||
|
time.sleep(2)
|
||||||
|
|
||||||
|
# Reset display
|
||||||
|
lcd.clear(COLOR_BLACK)
|
||||||
|
lcd.draw_string(10, lcd.height()//2 - 10, "Say wake word...",
|
||||||
|
COLOR_BLUE, COLOR_BLACK)
|
||||||
|
|
||||||
|
# Small delay to prevent tight loop
|
||||||
|
time.sleep(0.1)
|
||||||
|
|
||||||
|
# Garbage collection
|
||||||
|
if gc.mem_free() < 100000: # If free memory < 100KB
|
||||||
|
gc.collect()
|
||||||
|
|
||||||
|
except KeyboardInterrupt:
|
||||||
|
print("Exiting...")
|
||||||
|
break
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error in main loop: {e}")
|
||||||
|
time.sleep(1)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
"""Main entry point"""
|
||||||
|
print("=" * 40)
|
||||||
|
print("Maix Duino Voice Assistant")
|
||||||
|
print("=" * 40)
|
||||||
|
|
||||||
|
# Initialize hardware
|
||||||
|
init_hardware()
|
||||||
|
|
||||||
|
# Connect to network
|
||||||
|
if not init_network():
|
||||||
|
print("Failed to initialize network. Exiting.")
|
||||||
|
return
|
||||||
|
|
||||||
|
# Load wake word model (optional)
|
||||||
|
load_wake_word_model()
|
||||||
|
|
||||||
|
# Start main loop
|
||||||
|
try:
|
||||||
|
main_loop()
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Fatal error: {e}")
|
||||||
|
finally:
|
||||||
|
# Cleanup
|
||||||
|
set_led(False)
|
||||||
|
lcd.clear(COLOR_BLACK)
|
||||||
|
lcd.draw_string(10, lcd.height()//2, "Stopped",
|
||||||
|
COLOR_RED, COLOR_BLACK)
|
||||||
|
|
||||||
|
|
||||||
|
# Run main program
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
7
hardware/maixduino/secrets.py.example
Normal file
7
hardware/maixduino/secrets.py.example
Normal file
|
|
@ -0,0 +1,7 @@
|
||||||
|
# Copy this file to secrets.py and fill in your values
|
||||||
|
# secrets.py is gitignored — never commit it
|
||||||
|
SECRETS = {
|
||||||
|
"wifi_ssid": "YourNetworkName",
|
||||||
|
"wifi_password": "YourWiFiPassword",
|
||||||
|
"voice_server_url": "http://10.1.10.71:5000", # replace with your Minerva server IP
|
||||||
|
}
|
||||||
409
scripts/download_pretrained_models.sh
Executable file
409
scripts/download_pretrained_models.sh
Executable file
|
|
@ -0,0 +1,409 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
#
|
||||||
|
# Path: download_pretrained_models.sh
|
||||||
|
#
|
||||||
|
# Purpose and usage:
|
||||||
|
# Downloads and sets up pre-trained Mycroft Precise wake word models
|
||||||
|
# - Downloads Hey Mycroft, Hey Jarvis, and other available models
|
||||||
|
# - Tests each model with microphone
|
||||||
|
# - Configures voice server to use them
|
||||||
|
#
|
||||||
|
# Requirements:
|
||||||
|
# - Mycroft Precise installed (run setup_precise.sh first)
|
||||||
|
# - Internet connection for downloads
|
||||||
|
# - Microphone for testing
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# ./download_pretrained_models.sh [--test-all] [--model MODEL_NAME]
|
||||||
|
#
|
||||||
|
# Author: PRbL Library
|
||||||
|
# Created: $(date +"%Y-%m-%d")
|
||||||
|
|
||||||
|
# ----- PRbL Color and output functions -----
|
||||||
|
RED='\033[0;31m'
|
||||||
|
GREEN='\033[0;32m'
|
||||||
|
YELLOW='\033[0;33m'
|
||||||
|
BLUE='\033[0;34m'
|
||||||
|
PURPLE='\033[0;35m'
|
||||||
|
CYAN='\033[0;36m'
|
||||||
|
NC='\033[0m' # No Color
|
||||||
|
|
||||||
|
print_status() {
|
||||||
|
local level="$1"
|
||||||
|
shift
|
||||||
|
case "$level" in
|
||||||
|
"info") echo -e "${BLUE}[INFO]${NC} $*" >&2 ;;
|
||||||
|
"success") echo -e "${GREEN}[SUCCESS]${NC} $*" >&2 ;;
|
||||||
|
"warning") echo -e "${YELLOW}[WARNING]${NC} $*" >&2 ;;
|
||||||
|
"error") echo -e "${RED}[ERROR]${NC} $*" >&2 ;;
|
||||||
|
"debug") [[ "$VERBOSE" == "true" ]] && echo -e "${PURPLE}[DEBUG]${NC} $*" >&2 ;;
|
||||||
|
*) echo -e "$*" >&2 ;;
|
||||||
|
esac
|
||||||
|
}
|
||||||
|
|
||||||
|
# ----- Configuration -----
|
||||||
|
MODELS_DIR="$HOME/precise-models/pretrained"
|
||||||
|
TEST_ALL=false
|
||||||
|
SPECIFIC_MODEL=""
|
||||||
|
VERBOSE=false
|
||||||
|
|
||||||
|
# Available pre-trained models
|
||||||
|
declare -A MODELS=(
|
||||||
|
["hey-mycroft"]="https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz"
|
||||||
|
["hey-jarvis"]="https://github.com/MycroftAI/precise-data/raw/models-dev/hey-jarvis.tar.gz"
|
||||||
|
["christopher"]="https://github.com/MycroftAI/precise-data/raw/models-dev/christopher.tar.gz"
|
||||||
|
["hey-ezra"]="https://github.com/MycroftAI/precise-data/raw/models-dev/hey-ezra.tar.gz"
|
||||||
|
)
|
||||||
|
|
||||||
|
# ----- Dependency checking -----
|
||||||
|
command_exists() {
|
||||||
|
command -v "$1" &> /dev/null
|
||||||
|
}
|
||||||
|
|
||||||
|
check_dependencies() {
|
||||||
|
local missing=()
|
||||||
|
|
||||||
|
if ! command_exists wget; then
|
||||||
|
missing+=("wget")
|
||||||
|
fi
|
||||||
|
|
||||||
|
if ! command_exists precise-listen; then
|
||||||
|
missing+=("precise-listen (run setup_precise.sh first)")
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ ${#missing[@]} -gt 0 ]]; then
|
||||||
|
print_status error "Missing dependencies: ${missing[*]}"
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
# ----- Parse arguments -----
|
||||||
|
parse_args() {
|
||||||
|
while [[ $# -gt 0 ]]; do
|
||||||
|
case "$1" in
|
||||||
|
--test-all)
|
||||||
|
TEST_ALL=true
|
||||||
|
shift
|
||||||
|
;;
|
||||||
|
--model)
|
||||||
|
SPECIFIC_MODEL="$2"
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
-v|--verbose)
|
||||||
|
VERBOSE=true
|
||||||
|
shift
|
||||||
|
;;
|
||||||
|
-h|--help)
|
||||||
|
cat << EOF
|
||||||
|
Usage: $(basename "$0") [OPTIONS]
|
||||||
|
|
||||||
|
Download and test pre-trained Mycroft Precise wake word models
|
||||||
|
|
||||||
|
Options:
|
||||||
|
--test-all Download and test all available models
|
||||||
|
--model NAME Download and test specific model
|
||||||
|
-v, --verbose Enable verbose output
|
||||||
|
-h, --help Show this help message
|
||||||
|
|
||||||
|
Available models:
|
||||||
|
hey-mycroft Original Mycroft wake word (most data)
|
||||||
|
hey-jarvis Popular alternative
|
||||||
|
christopher Alternative wake word
|
||||||
|
hey-ezra Another option
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
$(basename "$0") --model hey-mycroft
|
||||||
|
$(basename "$0") --test-all
|
||||||
|
|
||||||
|
EOF
|
||||||
|
exit 0
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
print_status error "Unknown option: $1"
|
||||||
|
exit 1
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
}
|
||||||
|
|
||||||
|
# ----- Functions -----
|
||||||
|
|
||||||
|
create_models_directory() {
|
||||||
|
print_status info "Creating models directory: $MODELS_DIR"
|
||||||
|
mkdir -p "$MODELS_DIR" || {
|
||||||
|
print_status error "Failed to create directory"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
download_model() {
|
||||||
|
local model_name="$1"
|
||||||
|
local model_url="${MODELS[${model_name}]}"
|
||||||
|
|
||||||
|
if [[ -z "$model_url" ]]; then
|
||||||
|
print_status error "Unknown model: $model_name"
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check if already downloaded
|
||||||
|
if [[ -f "$MODELS_DIR/${model_name}.net" ]]; then
|
||||||
|
print_status info "Model already exists: $model_name"
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
print_status info "Downloading $model_name..."
|
||||||
|
|
||||||
|
local temp_file="/tmp/${model_name}-$$.tar.gz"
|
||||||
|
|
||||||
|
wget -q --show-progress -O "$temp_file" "$model_url" || {
|
||||||
|
print_status error "Failed to download $model_name"
|
||||||
|
rm -f "$temp_file"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
# Extract
|
||||||
|
print_status info "Extracting $model_name..."
|
||||||
|
tar xzf "$temp_file" -C "$MODELS_DIR" || {
|
||||||
|
print_status error "Failed to extract $model_name"
|
||||||
|
rm -f "$temp_file"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
rm -f "$temp_file"
|
||||||
|
|
||||||
|
# Verify extraction
|
||||||
|
if [[ -f "$MODELS_DIR/${model_name}.net" ]]; then
|
||||||
|
print_status success "Downloaded: $model_name"
|
||||||
|
return 0
|
||||||
|
else
|
||||||
|
print_status error "Extraction failed for $model_name"
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
test_model() {
|
||||||
|
local model_name="$1"
|
||||||
|
local model_file="$MODELS_DIR/${model_name}.net"
|
||||||
|
|
||||||
|
if [[ ! -f "$model_file" ]]; then
|
||||||
|
print_status error "Model file not found: $model_file"
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
print_status info "Testing model: $model_name"
|
||||||
|
echo ""
|
||||||
|
echo -e "${CYAN}Instructions:${NC}"
|
||||||
|
echo " - Speak the wake word: '$model_name'"
|
||||||
|
echo " - You should see '!' when detected"
|
||||||
|
echo " - Press Ctrl+C to stop testing"
|
||||||
|
echo ""
|
||||||
|
read -p "Press Enter to start test..."
|
||||||
|
|
||||||
|
# Activate conda environment if needed
|
||||||
|
if command_exists conda; then
|
||||||
|
eval "$(conda shell.bash hook)"
|
||||||
|
conda activate precise 2>/dev/null || true
|
||||||
|
fi
|
||||||
|
|
||||||
|
precise-listen "$model_file" || {
|
||||||
|
print_status warning "Test interrupted or failed"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
create_multi_wake_config() {
|
||||||
|
print_status info "Creating multi-wake-word configuration..."
|
||||||
|
|
||||||
|
local config_file="$MODELS_DIR/multi-wake-config.sh"
|
||||||
|
|
||||||
|
cat > "$config_file" << 'EOF'
|
||||||
|
#!/bin/bash
|
||||||
|
# Multi-wake-word configuration
|
||||||
|
# Generated by download_pretrained_models.sh
|
||||||
|
|
||||||
|
# Start voice server with multiple wake words
|
||||||
|
cd ~/voice-assistant
|
||||||
|
|
||||||
|
# List of wake word models
|
||||||
|
MODELS=""
|
||||||
|
|
||||||
|
EOF
|
||||||
|
|
||||||
|
# Add each downloaded model to config
|
||||||
|
for model_name in "${!MODELS[@]}"; do
|
||||||
|
if [[ -f "$MODELS_DIR/${model_name}.net" ]]; then
|
||||||
|
echo "# Found: $model_name" >> "$config_file"
|
||||||
|
echo "MODELS=\"\${MODELS}${model_name}:$MODELS_DIR/${model_name}.net:0.5,\"" >> "$config_file"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
cat >> "$config_file" << 'EOF'
|
||||||
|
|
||||||
|
# Remove trailing comma
|
||||||
|
MODELS="${MODELS%,}"
|
||||||
|
|
||||||
|
# Activate environment
|
||||||
|
eval "$(conda shell.bash hook)"
|
||||||
|
conda activate precise
|
||||||
|
|
||||||
|
# Start server
|
||||||
|
python voice_server.py \
|
||||||
|
--enable-precise \
|
||||||
|
--precise-models "$MODELS" \
|
||||||
|
--ha-token "$HA_TOKEN"
|
||||||
|
|
||||||
|
EOF
|
||||||
|
|
||||||
|
chmod +x "$config_file"
|
||||||
|
|
||||||
|
print_status success "Created: $config_file"
|
||||||
|
echo ""
|
||||||
|
print_status info "To use multiple wake words, run:"
|
||||||
|
print_status info " $config_file"
|
||||||
|
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
list_downloaded_models() {
|
||||||
|
print_status info "Downloaded models in $MODELS_DIR:"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
local count=0
|
||||||
|
for model_name in "${!MODELS[@]}"; do
|
||||||
|
if [[ -f "$MODELS_DIR/${model_name}.net" ]]; then
|
||||||
|
local size=$(du -h "$MODELS_DIR/${model_name}.net" | cut -f1)
|
||||||
|
echo -e " ${GREEN}✓${NC} ${model_name}.net (${size})"
|
||||||
|
((count++))
|
||||||
|
else
|
||||||
|
echo -e " ${YELLOW}○${NC} ${model_name}.net (not downloaded)"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
print_status success "Total downloaded: $count"
|
||||||
|
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
compare_models() {
|
||||||
|
print_status info "Model comparison:"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
cat << 'EOF'
|
||||||
|
┌─────────────────┬──────────────┬─────────────┬─────────────────┐
|
||||||
|
│ Wake Word │ Popularity │ Difficulty │ Recommended For │
|
||||||
|
├─────────────────┼──────────────┼─────────────┼─────────────────┤
|
||||||
|
│ Hey Mycroft │ ★★★★★ │ Easy │ Default choice │
|
||||||
|
│ Hey Jarvis │ ★★★★☆ │ Easy │ Pop culture │
|
||||||
|
│ Christopher │ ★★☆☆☆ │ Medium │ Unique name │
|
||||||
|
│ Hey Ezra │ ★★☆☆☆ │ Medium │ Alternative │
|
||||||
|
└─────────────────┴──────────────┴─────────────┴─────────────────┘
|
||||||
|
|
||||||
|
Recommendations:
|
||||||
|
- Start with: Hey Mycroft (most training data)
|
||||||
|
- For media: Hey Jarvis (Plex/entertainment)
|
||||||
|
- For uniqueness: Christopher or Hey Ezra
|
||||||
|
|
||||||
|
Multiple wake words:
|
||||||
|
- Use different wake words for different contexts
|
||||||
|
- Example: "Hey Mycroft" for commands, "Hey Jarvis" for media
|
||||||
|
- Server can run 2-3 models simultaneously
|
||||||
|
|
||||||
|
EOF
|
||||||
|
}
|
||||||
|
|
||||||
|
# ----- Main -----
|
||||||
|
main() {
|
||||||
|
print_status info "Mycroft Precise Pre-trained Model Downloader"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
# Parse arguments
|
||||||
|
parse_args "$@"
|
||||||
|
|
||||||
|
# Check dependencies
|
||||||
|
check_dependencies || exit 1
|
||||||
|
|
||||||
|
# Create directory
|
||||||
|
create_models_directory || exit 1
|
||||||
|
|
||||||
|
# Show comparison
|
||||||
|
if [[ -z "$SPECIFIC_MODEL" && "$TEST_ALL" != "true" ]]; then
|
||||||
|
compare_models
|
||||||
|
echo ""
|
||||||
|
print_status info "Use --model <name> to download a specific model"
|
||||||
|
print_status info "Use --test-all to download all models"
|
||||||
|
echo ""
|
||||||
|
list_downloaded_models
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Download models
|
||||||
|
if [[ -n "$SPECIFIC_MODEL" ]]; then
|
||||||
|
# Download specific model
|
||||||
|
download_model "$SPECIFIC_MODEL" || exit 1
|
||||||
|
|
||||||
|
# Offer to test
|
||||||
|
echo ""
|
||||||
|
read -p "Test this model now? (y/N): " -n 1 -r
|
||||||
|
echo
|
||||||
|
if [[ $REPLY =~ ^[Yy]$ ]]; then
|
||||||
|
test_model "$SPECIFIC_MODEL"
|
||||||
|
fi
|
||||||
|
|
||||||
|
elif [[ "$TEST_ALL" == "true" ]]; then
|
||||||
|
# Download all models
|
||||||
|
for model_name in "${!MODELS[@]}"; do
|
||||||
|
download_model "$model_name"
|
||||||
|
echo ""
|
||||||
|
done
|
||||||
|
|
||||||
|
# Offer to test each
|
||||||
|
echo ""
|
||||||
|
print_status success "All models downloaded"
|
||||||
|
echo ""
|
||||||
|
read -p "Test each model? (y/N): " -n 1 -r
|
||||||
|
echo
|
||||||
|
if [[ $REPLY =~ ^[Yy]$ ]]; then
|
||||||
|
for model_name in "${!MODELS[@]}"; do
|
||||||
|
if [[ -f "$MODELS_DIR/${model_name}.net" ]]; then
|
||||||
|
echo ""
|
||||||
|
test_model "$model_name"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
# List results
|
||||||
|
echo ""
|
||||||
|
list_downloaded_models
|
||||||
|
|
||||||
|
# Create multi-wake config if multiple models
|
||||||
|
local model_count=$(find "$MODELS_DIR" -name "*.net" | wc -l)
|
||||||
|
if [[ $model_count -gt 1 ]]; then
|
||||||
|
echo ""
|
||||||
|
create_multi_wake_config
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Final instructions
|
||||||
|
echo ""
|
||||||
|
print_status success "Setup complete!"
|
||||||
|
echo ""
|
||||||
|
print_status info "Next steps:"
|
||||||
|
print_status info "1. Test a model: precise-listen $MODELS_DIR/hey-mycroft.net"
|
||||||
|
print_status info "2. Use in server: python voice_server.py --enable-precise --precise-model $MODELS_DIR/hey-mycroft.net"
|
||||||
|
print_status info "3. Fine-tune: precise-train -e 30 custom.net . --from-checkpoint $MODELS_DIR/hey-mycroft.net"
|
||||||
|
|
||||||
|
if [[ $model_count -gt 1 ]]; then
|
||||||
|
echo ""
|
||||||
|
print_status info "For multiple wake words:"
|
||||||
|
print_status info " $MODELS_DIR/multi-wake-config.sh"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Run main
|
||||||
|
main "$@"
|
||||||
456
scripts/quick_start_hey_mycroft.sh
Executable file
456
scripts/quick_start_hey_mycroft.sh
Executable file
|
|
@ -0,0 +1,456 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
#
|
||||||
|
# Path: quick_start_hey_mycroft.sh
|
||||||
|
#
|
||||||
|
# Purpose and usage:
|
||||||
|
# Zero-training quick start using pre-trained "Hey Mycroft" model
|
||||||
|
# Gets you a working voice assistant in 5 minutes!
|
||||||
|
#
|
||||||
|
# Requirements:
|
||||||
|
# - Heimdall already setup (ran setup_voice_assistant.sh)
|
||||||
|
# - Mycroft Precise installed (ran setup_precise.sh)
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# ./quick_start_hey_mycroft.sh [--test-only]
|
||||||
|
#
|
||||||
|
# Author: PRbL Library
|
||||||
|
|
||||||
|
# ----- PRbL Color and output functions -----
|
||||||
|
RED='\033[0;31m'
|
||||||
|
GREEN='\033[0;32m'
|
||||||
|
YELLOW='\033[0;33m'
|
||||||
|
BLUE='\033[0;34m'
|
||||||
|
PURPLE='\033[0;35m'
|
||||||
|
CYAN='\033[0;36m'
|
||||||
|
NC='\033[0m'
|
||||||
|
|
||||||
|
print_status() {
|
||||||
|
local level="$1"
|
||||||
|
shift
|
||||||
|
case "$level" in
|
||||||
|
"info") echo -e "${BLUE}[INFO]${NC} $*" >&2 ;;
|
||||||
|
"success") echo -e "${GREEN}[SUCCESS]${NC} $*" >&2 ;;
|
||||||
|
"warning") echo -e "${YELLOW}[WARNING]${NC} $*" >&2 ;;
|
||||||
|
"error") echo -e "${RED}[ERROR]${NC} $*" >&2 ;;
|
||||||
|
*) echo -e "$*" >&2 ;;
|
||||||
|
esac
|
||||||
|
}
|
||||||
|
|
||||||
|
# ----- Configuration -----
|
||||||
|
MODELS_DIR="$HOME/precise-models/pretrained"
|
||||||
|
MODEL_URL="https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz"
|
||||||
|
MODEL_NAME="hey-mycroft"
|
||||||
|
TEST_ONLY=false
|
||||||
|
|
||||||
|
# ----- Parse arguments -----
|
||||||
|
parse_args() {
|
||||||
|
while [[ $# -gt 0 ]]; do
|
||||||
|
case "$1" in
|
||||||
|
--test-only)
|
||||||
|
TEST_ONLY=true
|
||||||
|
shift
|
||||||
|
;;
|
||||||
|
-h|--help)
|
||||||
|
cat << EOF
|
||||||
|
Usage: $(basename "$0") [OPTIONS]
|
||||||
|
|
||||||
|
Quick start with pre-trained "Hey Mycroft" wake word model.
|
||||||
|
No training required!
|
||||||
|
|
||||||
|
Options:
|
||||||
|
--test-only Just test the model, don't start server
|
||||||
|
-h, --help Show this help
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
$(basename "$0") # Download, test, and run server
|
||||||
|
$(basename "$0") --test-only # Just download and test
|
||||||
|
|
||||||
|
EOF
|
||||||
|
exit 0
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
print_status error "Unknown option: $1"
|
||||||
|
exit 1
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
}
|
||||||
|
|
||||||
|
# ----- Functions -----
|
||||||
|
|
||||||
|
check_prerequisites() {
|
||||||
|
print_status info "Checking prerequisites..."
|
||||||
|
|
||||||
|
# Check conda
|
||||||
|
if ! command -v conda &> /dev/null; then
|
||||||
|
print_status error "conda not found"
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check precise environment
|
||||||
|
if ! conda env list | grep -q "^precise\s"; then
|
||||||
|
print_status error "Precise environment not found"
|
||||||
|
print_status info "Run: ./setup_precise.sh first"
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check voice-assistant directory
|
||||||
|
if [[ ! -d "$HOME/voice-assistant" ]]; then
|
||||||
|
print_status error "Voice assistant not setup"
|
||||||
|
print_status info "Run: ./setup_voice_assistant.sh first"
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
print_status success "Prerequisites OK"
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
download_pretrained_model() {
|
||||||
|
print_status info "Downloading pre-trained 'Hey Mycroft' model..."
|
||||||
|
|
||||||
|
# Create directory
|
||||||
|
mkdir -p "$MODELS_DIR"
|
||||||
|
|
||||||
|
# Check if already downloaded
|
||||||
|
if [[ -f "$MODELS_DIR/${MODEL_NAME}.net" ]]; then
|
||||||
|
print_status info "Model already downloaded"
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Download
|
||||||
|
cd "$MODELS_DIR" || return 1
|
||||||
|
|
||||||
|
print_status info "Fetching from GitHub..."
|
||||||
|
wget -q --show-progress "$MODEL_URL" || {
|
||||||
|
print_status error "Failed to download model"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
# Extract
|
||||||
|
print_status info "Extracting model..."
|
||||||
|
tar xzf hey-mycroft.tar.gz || {
|
||||||
|
print_status error "Failed to extract model"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
# Verify
|
||||||
|
if [[ ! -f "${MODEL_NAME}.net" ]]; then
|
||||||
|
print_status error "Model file not found after extraction"
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
print_status success "Model downloaded: $MODELS_DIR/${MODEL_NAME}.net"
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
test_model() {
|
||||||
|
print_status info "Testing wake word model..."
|
||||||
|
|
||||||
|
cd "$MODELS_DIR" || return 1
|
||||||
|
|
||||||
|
# Activate conda
|
||||||
|
eval "$(conda shell.bash hook)"
|
||||||
|
conda activate precise || {
|
||||||
|
print_status error "Failed to activate precise environment"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
cat << EOF
|
||||||
|
|
||||||
|
${CYAN}═══════════════════════════════════════════════════${NC}
|
||||||
|
${CYAN} Wake Word Test: "Hey Mycroft"${NC}
|
||||||
|
${CYAN}═══════════════════════════════════════════════════${NC}
|
||||||
|
|
||||||
|
${YELLOW}Instructions:${NC}
|
||||||
|
1. Speak "Hey Mycroft" into your microphone
|
||||||
|
2. You should see ${GREEN}"!"${NC} when detected
|
||||||
|
3. Try other phrases - should ${RED}not${NC} trigger
|
||||||
|
4. Press ${RED}Ctrl+C${NC} when done testing
|
||||||
|
|
||||||
|
${CYAN}Starting in 3 seconds...${NC}
|
||||||
|
|
||||||
|
EOF
|
||||||
|
|
||||||
|
sleep 3
|
||||||
|
|
||||||
|
# Test the model
|
||||||
|
precise-listen "${MODEL_NAME}.net" || {
|
||||||
|
print_status error "Model test failed"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
print_status success "Model test complete!"
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
update_config() {
|
||||||
|
print_status info "Updating voice assistant configuration..."
|
||||||
|
|
||||||
|
local config_file="$HOME/voice-assistant/config/.env"
|
||||||
|
|
||||||
|
if [[ ! -f "$config_file" ]]; then
|
||||||
|
print_status error "Config file not found: $config_file"
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Update PRECISE_MODEL if exists, otherwise add it
|
||||||
|
if grep -q "^PRECISE_MODEL=" "$config_file"; then
|
||||||
|
sed -i "s|^PRECISE_MODEL=.*|PRECISE_MODEL=$MODELS_DIR/${MODEL_NAME}.net|" "$config_file"
|
||||||
|
else
|
||||||
|
echo "PRECISE_MODEL=$MODELS_DIR/${MODEL_NAME}.net" >> "$config_file"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Update sensitivity if not set
|
||||||
|
if ! grep -q "^PRECISE_SENSITIVITY=" "$config_file"; then
|
||||||
|
echo "PRECISE_SENSITIVITY=0.5" >> "$config_file"
|
||||||
|
fi
|
||||||
|
|
||||||
|
print_status success "Configuration updated"
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
start_server() {
|
||||||
|
print_status info "Starting voice assistant server..."
|
||||||
|
|
||||||
|
cd "$HOME/voice-assistant" || return 1
|
||||||
|
|
||||||
|
# Activate conda
|
||||||
|
eval "$(conda shell.bash hook)"
|
||||||
|
conda activate precise || {
|
||||||
|
print_status error "Failed to activate environment"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
cat << EOF
|
||||||
|
|
||||||
|
${CYAN}═══════════════════════════════════════════════════${NC}
|
||||||
|
${GREEN} Starting Voice Assistant Server${NC}
|
||||||
|
${CYAN}═══════════════════════════════════════════════════${NC}
|
||||||
|
|
||||||
|
${YELLOW}Configuration:${NC}
|
||||||
|
Wake word: ${GREEN}Hey Mycroft${NC}
|
||||||
|
Model: ${MODEL_NAME}.net
|
||||||
|
Server: http://0.0.0.0:5000
|
||||||
|
|
||||||
|
${YELLOW}What to do next:${NC}
|
||||||
|
1. Wait for "Precise listening started" message
|
||||||
|
2. Say ${GREEN}"Hey Mycroft"${NC} to test wake word
|
||||||
|
3. Say a command like ${GREEN}"turn on the lights"${NC}
|
||||||
|
4. Check server logs for activity
|
||||||
|
|
||||||
|
${YELLOW}Press Ctrl+C to stop the server${NC}
|
||||||
|
|
||||||
|
${CYAN}Starting server...${NC}
|
||||||
|
|
||||||
|
EOF
|
||||||
|
|
||||||
|
# Check if HA token is set
|
||||||
|
if ! grep -q "^HA_TOKEN=..*" config/.env; then
|
||||||
|
print_status warning "Home Assistant token not set!"
|
||||||
|
print_status warning "Commands won't execute without it."
|
||||||
|
print_status info "Edit config/.env and add your HA token"
|
||||||
|
echo
|
||||||
|
read -p "Continue anyway? (y/N): " -n 1 -r
|
||||||
|
echo
|
||||||
|
if [[ ! $REPLY =~ ^[Yy]$ ]]; then
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Start server
|
||||||
|
python voice_server.py \
|
||||||
|
--enable-precise \
|
||||||
|
--precise-model "$MODELS_DIR/${MODEL_NAME}.net" \
|
||||||
|
--precise-sensitivity 0.5
|
||||||
|
|
||||||
|
return $?
|
||||||
|
}
|
||||||
|
|
||||||
|
create_systemd_service() {
|
||||||
|
print_status info "Creating systemd service..."
|
||||||
|
|
||||||
|
local service_file="/etc/systemd/system/voice-assistant.service"
|
||||||
|
|
||||||
|
# Check if we should update existing service
|
||||||
|
if [[ -f "$service_file" ]]; then
|
||||||
|
print_status warning "Service file already exists"
|
||||||
|
read -p "Update with Hey Mycroft configuration? (y/N): " -n 1 -r
|
||||||
|
echo
|
||||||
|
if [[ ! $REPLY =~ ^[Yy]$ ]]; then
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Create service file
|
||||||
|
sudo tee "$service_file" > /dev/null << EOF
|
||||||
|
[Unit]
|
||||||
|
Description=Voice Assistant with Hey Mycroft Wake Word
|
||||||
|
After=network.target
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=simple
|
||||||
|
User=$USER
|
||||||
|
WorkingDirectory=$HOME/voice-assistant
|
||||||
|
Environment="PATH=$HOME/miniconda3/envs/precise/bin:/usr/local/bin:/usr/bin:/bin"
|
||||||
|
EnvironmentFile=$HOME/voice-assistant/config/.env
|
||||||
|
ExecStart=$HOME/miniconda3/envs/precise/bin/python voice_server.py \\
|
||||||
|
--enable-precise \\
|
||||||
|
--precise-model $MODELS_DIR/${MODEL_NAME}.net \\
|
||||||
|
--precise-sensitivity 0.5
|
||||||
|
Restart=on-failure
|
||||||
|
RestartSec=10
|
||||||
|
StandardOutput=append:$HOME/voice-assistant/logs/voice_assistant.log
|
||||||
|
StandardError=append:$HOME/voice-assistant/logs/voice_assistant_error.log
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=multi-user.target
|
||||||
|
EOF
|
||||||
|
|
||||||
|
# Reload systemd
|
||||||
|
sudo systemctl daemon-reload
|
||||||
|
|
||||||
|
print_status success "Systemd service created"
|
||||||
|
|
||||||
|
cat << EOF
|
||||||
|
|
||||||
|
${CYAN}To enable and start the service:${NC}
|
||||||
|
sudo systemctl enable voice-assistant
|
||||||
|
sudo systemctl start voice-assistant
|
||||||
|
sudo systemctl status voice-assistant
|
||||||
|
|
||||||
|
${CYAN}To view logs:${NC}
|
||||||
|
journalctl -u voice-assistant -f
|
||||||
|
|
||||||
|
EOF
|
||||||
|
|
||||||
|
read -p "Enable service now? (y/N): " -n 1 -r
|
||||||
|
echo
|
||||||
|
if [[ $REPLY =~ ^[Yy]$ ]]; then
|
||||||
|
sudo systemctl enable voice-assistant
|
||||||
|
sudo systemctl start voice-assistant
|
||||||
|
sleep 2
|
||||||
|
sudo systemctl status voice-assistant
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
print_next_steps() {
|
||||||
|
cat << EOF
|
||||||
|
|
||||||
|
${GREEN}═══════════════════════════════════════════════════${NC}
|
||||||
|
${GREEN} Success! Your voice assistant is ready!${NC}
|
||||||
|
${GREEN}═══════════════════════════════════════════════════${NC}
|
||||||
|
|
||||||
|
${CYAN}What you have:${NC}
|
||||||
|
✓ Pre-trained "Hey Mycroft" wake word
|
||||||
|
✓ Voice assistant server configured
|
||||||
|
✓ Ready to control Home Assistant
|
||||||
|
|
||||||
|
${CYAN}Quick test:${NC}
|
||||||
|
1. Say: ${GREEN}"Hey Mycroft"${NC}
|
||||||
|
2. Say: ${GREEN}"Turn on the living room lights"${NC}
|
||||||
|
3. Check if command executed
|
||||||
|
|
||||||
|
${CYAN}Next steps:${NC}
|
||||||
|
1. ${YELLOW}Configure Home Assistant entities${NC}
|
||||||
|
Edit: ~/voice-assistant/config/.env
|
||||||
|
Add: HA_TOKEN=your_token_here
|
||||||
|
|
||||||
|
2. ${YELLOW}Add more entity mappings${NC}
|
||||||
|
Edit: voice_server.py
|
||||||
|
Update: IntentParser.ENTITY_MAP
|
||||||
|
|
||||||
|
3. ${YELLOW}Fine-tune for your voice (optional)${NC}
|
||||||
|
cd ~/precise-models/hey-mycroft-custom
|
||||||
|
./1-record-wake-word.sh
|
||||||
|
# Record 20-30 samples
|
||||||
|
precise-train -e 30 hey-mycroft-custom.net . \\
|
||||||
|
--from-checkpoint $MODELS_DIR/${MODEL_NAME}.net
|
||||||
|
|
||||||
|
4. ${YELLOW}Setup Maix Duino${NC}
|
||||||
|
See: QUICKSTART.md Phase 2
|
||||||
|
|
||||||
|
${CYAN}Useful commands:${NC}
|
||||||
|
# Test wake word only
|
||||||
|
cd $MODELS_DIR && conda activate precise
|
||||||
|
precise-listen ${MODEL_NAME}.net
|
||||||
|
|
||||||
|
# Check server health
|
||||||
|
curl http://localhost:5000/health
|
||||||
|
|
||||||
|
# Monitor logs
|
||||||
|
journalctl -u voice-assistant -f
|
||||||
|
|
||||||
|
${CYAN}Documentation:${NC}
|
||||||
|
README.md - Project overview
|
||||||
|
WAKE_WORD_ADVANCED.md - Multiple wake words guide
|
||||||
|
QUICKSTART.md - Complete setup guide
|
||||||
|
|
||||||
|
${GREEN}Happy voice assisting! 🎙️${NC}
|
||||||
|
|
||||||
|
EOF
|
||||||
|
}
|
||||||
|
|
||||||
|
# ----- Main -----
|
||||||
|
main() {
|
||||||
|
cat << EOF
|
||||||
|
${CYAN}═══════════════════════════════════════════════════${NC}
|
||||||
|
${CYAN} Quick Start: Hey Mycroft Wake Word${NC}
|
||||||
|
${CYAN}═══════════════════════════════════════════════════${NC}
|
||||||
|
|
||||||
|
${YELLOW}This script will:${NC}
|
||||||
|
1. Download pre-trained "Hey Mycroft" model
|
||||||
|
2. Test wake word detection
|
||||||
|
3. Configure voice assistant server
|
||||||
|
4. Start the server (optional)
|
||||||
|
|
||||||
|
${YELLOW}Total time: ~5 minutes (no training!)${NC}
|
||||||
|
|
||||||
|
EOF
|
||||||
|
|
||||||
|
# Parse arguments
|
||||||
|
parse_args "$@"
|
||||||
|
|
||||||
|
# Check prerequisites
|
||||||
|
check_prerequisites || exit 1
|
||||||
|
|
||||||
|
# Download model
|
||||||
|
download_pretrained_model || exit 1
|
||||||
|
|
||||||
|
# Test model
|
||||||
|
print_status info "Ready to test wake word"
|
||||||
|
read -p "Test now? (Y/n): " -n 1 -r
|
||||||
|
echo
|
||||||
|
if [[ ! $REPLY =~ ^[Nn]$ ]]; then
|
||||||
|
test_model
|
||||||
|
fi
|
||||||
|
|
||||||
|
# If test-only mode, stop here
|
||||||
|
if [[ "$TEST_ONLY" == "true" ]]; then
|
||||||
|
print_status success "Test complete!"
|
||||||
|
print_status info "Model location: $MODELS_DIR/${MODEL_NAME}.net"
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Update configuration
|
||||||
|
update_config || exit 1
|
||||||
|
|
||||||
|
# Start server
|
||||||
|
read -p "Start voice assistant server now? (Y/n): " -n 1 -r
|
||||||
|
echo
|
||||||
|
if [[ ! $REPLY =~ ^[Nn]$ ]]; then
|
||||||
|
start_server
|
||||||
|
else
|
||||||
|
# Offer to create systemd service
|
||||||
|
read -p "Create systemd service instead? (y/N): " -n 1 -r
|
||||||
|
echo
|
||||||
|
if [[ $REPLY =~ ^[Yy]$ ]]; then
|
||||||
|
create_systemd_service
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Print next steps
|
||||||
|
print_next_steps
|
||||||
|
}
|
||||||
|
|
||||||
|
# Run main
|
||||||
|
main "$@"
|
||||||
630
scripts/setup_precise.sh
Executable file
630
scripts/setup_precise.sh
Executable file
|
|
@ -0,0 +1,630 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
#
|
||||||
|
# Path: setup_precise.sh
|
||||||
|
#
|
||||||
|
# Purpose and usage:
|
||||||
|
# Sets up Mycroft Precise wake word detection on Heimdall
|
||||||
|
# - Creates conda environment for Precise
|
||||||
|
# - Installs TensorFlow 1.x and dependencies
|
||||||
|
# - Downloads precise-engine
|
||||||
|
# - Sets up training directories
|
||||||
|
# - Provides helper scripts for training
|
||||||
|
#
|
||||||
|
# Requirements:
|
||||||
|
# - conda/miniconda installed
|
||||||
|
# - Internet connection for downloads
|
||||||
|
# - Microphone for recording samples
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# ./setup_precise.sh [--wake-word "phrase"] [--env-name NAME]
|
||||||
|
#
|
||||||
|
# Author: PRbL Library
|
||||||
|
# Created: $(date +"%Y-%m-%d")
|
||||||
|
|
||||||
|
# ----- PRbL Color and output functions -----
|
||||||
|
RED='\033[0;31m'
|
||||||
|
GREEN='\033[0;32m'
|
||||||
|
YELLOW='\033[0;33m'
|
||||||
|
BLUE='\033[0;34m'
|
||||||
|
PURPLE='\033[0;35m'
|
||||||
|
CYAN='\033[0;36m'
|
||||||
|
NC='\033[0m' # No Color
|
||||||
|
|
||||||
|
print_status() {
|
||||||
|
local level="$1"
|
||||||
|
shift
|
||||||
|
case "$level" in
|
||||||
|
"info") echo -e "${BLUE}[INFO]${NC} $*" >&2 ;;
|
||||||
|
"success") echo -e "${GREEN}[SUCCESS]${NC} $*" >&2 ;;
|
||||||
|
"warning") echo -e "${YELLOW}[WARNING]${NC} $*" >&2 ;;
|
||||||
|
"error") echo -e "${RED}[ERROR]${NC} $*" >&2 ;;
|
||||||
|
"debug") [[ "$VERBOSE" == "true" ]] && echo -e "${PURPLE}[DEBUG]${NC} $*" >&2 ;;
|
||||||
|
*) echo -e "$*" >&2 ;;
|
||||||
|
esac
|
||||||
|
}
|
||||||
|
|
||||||
|
# ----- Configuration -----
|
||||||
|
CONDA_ENV_NAME="precise"
|
||||||
|
WAKE_WORD="hey computer"
|
||||||
|
MODELS_DIR="$HOME/precise-models"
|
||||||
|
VERBOSE=false
|
||||||
|
|
||||||
|
# ----- Dependency checking -----
|
||||||
|
command_exists() {
|
||||||
|
command -v "$1" &> /dev/null
|
||||||
|
}
|
||||||
|
|
||||||
|
check_conda() {
|
||||||
|
if ! command_exists conda; then
|
||||||
|
print_status error "conda not found. Please install miniconda first."
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
# ----- Parse arguments -----
|
||||||
|
parse_args() {
|
||||||
|
while [[ $# -gt 0 ]]; do
|
||||||
|
case "$1" in
|
||||||
|
--wake-word)
|
||||||
|
WAKE_WORD="$2"
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
--env-name)
|
||||||
|
CONDA_ENV_NAME="$2"
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
-v|--verbose)
|
||||||
|
VERBOSE=true
|
||||||
|
shift
|
||||||
|
;;
|
||||||
|
-h|--help)
|
||||||
|
cat << EOF
|
||||||
|
Usage: $(basename "$0") [OPTIONS]
|
||||||
|
|
||||||
|
Options:
|
||||||
|
--wake-word "phrase" Wake word to train (default: "hey computer")
|
||||||
|
--env-name NAME Custom conda environment name (default: precise)
|
||||||
|
-v, --verbose Enable verbose output
|
||||||
|
-h, --help Show this help message
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
$(basename "$0") --wake-word "hey jarvis"
|
||||||
|
$(basename "$0") --env-name mycroft-precise
|
||||||
|
|
||||||
|
EOF
|
||||||
|
exit 0
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
print_status error "Unknown option: $1"
|
||||||
|
exit 1
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
}
|
||||||
|
|
||||||
|
# ----- Setup functions -----
|
||||||
|
|
||||||
|
create_conda_environment() {
|
||||||
|
print_status info "Creating conda environment: $CONDA_ENV_NAME"
|
||||||
|
|
||||||
|
# Check if environment already exists
|
||||||
|
if conda env list | grep -q "^${CONDA_ENV_NAME}\s"; then
|
||||||
|
print_status warning "Environment $CONDA_ENV_NAME already exists"
|
||||||
|
read -p "Remove and recreate? (y/N): " -n 1 -r
|
||||||
|
echo
|
||||||
|
if [[ $REPLY =~ ^[Yy]$ ]]; then
|
||||||
|
print_status info "Removing existing environment..."
|
||||||
|
conda env remove -n "$CONDA_ENV_NAME" -y
|
||||||
|
else
|
||||||
|
print_status info "Using existing environment"
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Create new environment with Python 3.7 (required for TF 1.15)
|
||||||
|
print_status info "Creating Python 3.7 environment..."
|
||||||
|
conda create -n "$CONDA_ENV_NAME" python=3.7 -y || {
|
||||||
|
print_status error "Failed to create conda environment"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
print_status success "Conda environment created"
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
install_tensorflow() {
|
||||||
|
print_status info "Installing TensorFlow 1.15..."
|
||||||
|
|
||||||
|
# Activate conda environment
|
||||||
|
eval "$(conda shell.bash hook)"
|
||||||
|
conda activate "$CONDA_ENV_NAME" || {
|
||||||
|
print_status error "Failed to activate conda environment"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
# Install TensorFlow 1.15 (last 1.x version)
|
||||||
|
pip install tensorflow==1.15.5 --break-system-packages || {
|
||||||
|
print_status error "Failed to install TensorFlow"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
# Verify installation
|
||||||
|
python -c "import tensorflow as tf; print(f'TensorFlow {tf.__version__} installed')" || {
|
||||||
|
print_status error "TensorFlow installation verification failed"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
print_status success "TensorFlow 1.15 installed"
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
install_precise() {
|
||||||
|
print_status info "Installing Mycroft Precise..."
|
||||||
|
|
||||||
|
# Activate conda environment
|
||||||
|
eval "$(conda shell.bash hook)"
|
||||||
|
conda activate "$CONDA_ENV_NAME" || {
|
||||||
|
print_status error "Failed to activate conda environment"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
# Install audio dependencies
|
||||||
|
print_status info "Installing system audio dependencies..."
|
||||||
|
if command_exists apt-get; then
|
||||||
|
sudo apt-get update
|
||||||
|
sudo apt-get install -y portaudio19-dev sox libatlas-base-dev || {
|
||||||
|
print_status warning "Some audio dependencies failed to install"
|
||||||
|
}
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Install Python audio libraries
|
||||||
|
pip install pyaudio --break-system-packages || {
|
||||||
|
print_status warning "PyAudio installation failed (may need manual installation)"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Install Precise
|
||||||
|
pip install mycroft-precise --break-system-packages || {
|
||||||
|
print_status error "Failed to install Mycroft Precise"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
# Verify installation
|
||||||
|
python -c "import precise_runner; print('Precise installed successfully')" || {
|
||||||
|
print_status error "Precise installation verification failed"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
print_status success "Mycroft Precise installed"
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
download_precise_engine() {
|
||||||
|
print_status info "Downloading precise-engine..."
|
||||||
|
|
||||||
|
local engine_version="0.3.0"
|
||||||
|
local engine_url="https://github.com/MycroftAI/mycroft-precise/releases/download/v${engine_version}/precise-engine_${engine_version}_x86_64.tar.gz"
|
||||||
|
local temp_dir=$(mktemp -d)
|
||||||
|
|
||||||
|
# Download engine
|
||||||
|
wget -q --show-progress -O "$temp_dir/precise-engine.tar.gz" "$engine_url" || {
|
||||||
|
print_status error "Failed to download precise-engine"
|
||||||
|
rm -rf "$temp_dir"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
# Extract
|
||||||
|
tar xzf "$temp_dir/precise-engine.tar.gz" -C "$temp_dir" || {
|
||||||
|
print_status error "Failed to extract precise-engine"
|
||||||
|
rm -rf "$temp_dir"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
# Install to /usr/local/bin
|
||||||
|
sudo cp "$temp_dir/precise-engine/precise-engine" /usr/local/bin/ || {
|
||||||
|
print_status error "Failed to install precise-engine"
|
||||||
|
rm -rf "$temp_dir"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
sudo chmod +x /usr/local/bin/precise-engine
|
||||||
|
|
||||||
|
# Clean up
|
||||||
|
rm -rf "$temp_dir"
|
||||||
|
|
||||||
|
# Verify installation
|
||||||
|
precise-engine --version || {
|
||||||
|
print_status error "precise-engine installation verification failed"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
print_status success "precise-engine installed"
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
create_training_directory() {
|
||||||
|
print_status info "Creating training directory structure..."
|
||||||
|
|
||||||
|
# Sanitize wake word for directory name
|
||||||
|
local wake_word_dir=$(echo "$WAKE_WORD" | tr ' ' '-' | tr '[:upper:]' '[:lower:]')
|
||||||
|
local project_dir="$MODELS_DIR/$wake_word_dir"
|
||||||
|
|
||||||
|
mkdir -p "$project_dir"/{wake-word,not-wake-word,test/wake-word,test/not-wake-word}
|
||||||
|
|
||||||
|
print_status success "Training directory created: $project_dir"
|
||||||
|
|
||||||
|
# Store project path for later use
|
||||||
|
echo "$project_dir" > "$MODELS_DIR/.current_project"
|
||||||
|
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
create_training_scripts() {
|
||||||
|
print_status info "Creating training helper scripts..."
|
||||||
|
|
||||||
|
local wake_word_dir=$(echo "$WAKE_WORD" | tr ' ' '-' | tr '[:upper:]' '[:lower:]')
|
||||||
|
local project_dir="$MODELS_DIR/$wake_word_dir"
|
||||||
|
|
||||||
|
# Create recording script
|
||||||
|
cat > "$project_dir/1-record-wake-word.sh" << 'EOF'
|
||||||
|
#!/bin/bash
|
||||||
|
# Step 1: Record wake word samples
|
||||||
|
# Run this script and follow the prompts to record ~50-100 samples
|
||||||
|
|
||||||
|
eval "$(conda shell.bash hook)"
|
||||||
|
conda activate precise
|
||||||
|
|
||||||
|
echo "Recording wake word samples..."
|
||||||
|
echo "Press SPACE to start/stop recording"
|
||||||
|
echo "Press Ctrl+C when done (aim for 50-100 samples)"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
precise-collect
|
||||||
|
EOF
|
||||||
|
|
||||||
|
# Create not-wake-word recording script
|
||||||
|
cat > "$project_dir/2-record-not-wake-word.sh" << 'EOF'
|
||||||
|
#!/bin/bash
|
||||||
|
# Step 2: Record "not wake word" samples
|
||||||
|
# Record random speech, TV, music, similar-sounding phrases
|
||||||
|
|
||||||
|
eval "$(conda shell.bash hook)"
|
||||||
|
conda activate precise
|
||||||
|
|
||||||
|
echo "Recording not-wake-word samples..."
|
||||||
|
echo "Record:"
|
||||||
|
echo " - Normal conversation"
|
||||||
|
echo " - TV/music background"
|
||||||
|
echo " - Similar sounding phrases"
|
||||||
|
echo " - Ambient noise"
|
||||||
|
echo ""
|
||||||
|
echo "Press SPACE to start/stop recording"
|
||||||
|
echo "Press Ctrl+C when done (aim for 200-500 samples)"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
precise-collect -f not-wake-word/samples.wav
|
||||||
|
EOF
|
||||||
|
|
||||||
|
# Create training script
|
||||||
|
cat > "$project_dir/3-train-model.sh" << EOF
|
||||||
|
#!/bin/bash
|
||||||
|
# Step 3: Train the model
|
||||||
|
# This will train for 60 epochs (adjust -e parameter for more/less)
|
||||||
|
|
||||||
|
eval "\$(conda shell.bash hook)"
|
||||||
|
conda activate precise
|
||||||
|
|
||||||
|
echo "Training wake word model..."
|
||||||
|
echo "This will take 30-60 minutes..."
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
# Train model
|
||||||
|
precise-train -e 60 ${wake_word_dir}.net .
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "Training complete!"
|
||||||
|
echo "Test with: precise-listen ${wake_word_dir}.net"
|
||||||
|
EOF
|
||||||
|
|
||||||
|
# Create testing script
|
||||||
|
cat > "$project_dir/4-test-model.sh" << EOF
|
||||||
|
#!/bin/bash
|
||||||
|
# Step 4: Test the model with live microphone
|
||||||
|
|
||||||
|
eval "\$(conda shell.bash hook)"
|
||||||
|
conda activate precise
|
||||||
|
|
||||||
|
echo "Testing wake word model..."
|
||||||
|
echo "Speak your wake word - you should see '!' when detected"
|
||||||
|
echo "Speak other phrases - should not trigger"
|
||||||
|
echo ""
|
||||||
|
echo "Press Ctrl+C to exit"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
precise-listen ${wake_word_dir}.net
|
||||||
|
EOF
|
||||||
|
|
||||||
|
# Create evaluation script
|
||||||
|
cat > "$project_dir/5-evaluate-model.sh" << EOF
|
||||||
|
#!/bin/bash
|
||||||
|
# Step 5: Evaluate model on test set
|
||||||
|
|
||||||
|
eval "\$(conda shell.bash hook)"
|
||||||
|
conda activate precise
|
||||||
|
|
||||||
|
echo "Evaluating wake word model on test set..."
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
precise-test ${wake_word_dir}.net test/
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "Check metrics above:"
|
||||||
|
echo " - Wake word accuracy should be >95%"
|
||||||
|
echo " - False positive rate should be <5%"
|
||||||
|
EOF
|
||||||
|
|
||||||
|
# Create tuning script
|
||||||
|
cat > "$project_dir/6-tune-threshold.sh" << EOF
|
||||||
|
#!/bin/bash
|
||||||
|
# Step 6: Tune activation threshold
|
||||||
|
|
||||||
|
eval "\$(conda shell.bash hook)"
|
||||||
|
conda activate precise
|
||||||
|
|
||||||
|
echo "Testing different thresholds..."
|
||||||
|
echo ""
|
||||||
|
echo "Default threshold: 0.5"
|
||||||
|
echo "Higher = fewer false positives, may miss some wake words"
|
||||||
|
echo "Lower = catch more wake words, more false positives"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
for threshold in 0.3 0.5 0.7; do
|
||||||
|
echo "Testing threshold: \$threshold"
|
||||||
|
echo "Press Ctrl+C to try next threshold"
|
||||||
|
precise-listen ${wake_word_dir}.net -t \$threshold
|
||||||
|
done
|
||||||
|
EOF
|
||||||
|
|
||||||
|
# Make all scripts executable
|
||||||
|
chmod +x "$project_dir"/*.sh
|
||||||
|
|
||||||
|
print_status success "Training scripts created in $project_dir"
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
create_readme() {
|
||||||
|
print_status info "Creating README..."
|
||||||
|
|
||||||
|
local wake_word_dir=$(echo "$WAKE_WORD" | tr ' ' '-' | tr '[:upper:]' '[:lower:]')
|
||||||
|
local project_dir="$MODELS_DIR/$wake_word_dir"
|
||||||
|
|
||||||
|
cat > "$project_dir/README.md" << EOF
|
||||||
|
# Wake Word Training: "$WAKE_WORD"
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
Follow these steps in order:
|
||||||
|
|
||||||
|
### 1. Record Wake Word Samples
|
||||||
|
\`\`\`bash
|
||||||
|
./1-record-wake-word.sh
|
||||||
|
\`\`\`
|
||||||
|
|
||||||
|
Record 50-100 samples:
|
||||||
|
- Vary your tone and speed
|
||||||
|
- Different distances from microphone
|
||||||
|
- Different background noise levels
|
||||||
|
- Have family members record too
|
||||||
|
|
||||||
|
### 2. Record Not-Wake-Word Samples
|
||||||
|
\`\`\`bash
|
||||||
|
./2-record-not-wake-word.sh
|
||||||
|
\`\`\`
|
||||||
|
|
||||||
|
Record 200-500 samples of:
|
||||||
|
- Normal conversation
|
||||||
|
- TV/music in background
|
||||||
|
- Similar sounding phrases
|
||||||
|
- Ambient household noise
|
||||||
|
|
||||||
|
### 3. Organize Samples
|
||||||
|
|
||||||
|
Move files into training/test split:
|
||||||
|
\`\`\`bash
|
||||||
|
# 80% of wake-word samples go to:
|
||||||
|
mv wake-word-samples-* wake-word/
|
||||||
|
|
||||||
|
# 20% of wake-word samples go to:
|
||||||
|
mv wake-word-samples-* test/wake-word/
|
||||||
|
|
||||||
|
# 80% of not-wake-word samples go to:
|
||||||
|
mv not-wake-word-samples-* not-wake-word/
|
||||||
|
|
||||||
|
# 20% of not-wake-word samples go to:
|
||||||
|
mv not-wake-word-samples-* test/not-wake-word/
|
||||||
|
\`\`\`
|
||||||
|
|
||||||
|
### 4. Train Model
|
||||||
|
\`\`\`bash
|
||||||
|
./3-train-model.sh
|
||||||
|
\`\`\`
|
||||||
|
|
||||||
|
Wait 30-60 minutes for training to complete.
|
||||||
|
|
||||||
|
### 5. Test Model
|
||||||
|
\`\`\`bash
|
||||||
|
./4-test-model.sh
|
||||||
|
\`\`\`
|
||||||
|
|
||||||
|
Speak your wake word and verify detection.
|
||||||
|
|
||||||
|
### 6. Evaluate Model
|
||||||
|
\`\`\`bash
|
||||||
|
./5-evaluate-model.sh
|
||||||
|
\`\`\`
|
||||||
|
|
||||||
|
Check accuracy metrics on test set.
|
||||||
|
|
||||||
|
### 7. Tune Threshold
|
||||||
|
\`\`\`bash
|
||||||
|
./6-tune-threshold.sh
|
||||||
|
\`\`\`
|
||||||
|
|
||||||
|
Find the best threshold for your environment.
|
||||||
|
|
||||||
|
## Tips for Good Training
|
||||||
|
|
||||||
|
1. **Quality over quantity** - Clear samples are better than many poor ones
|
||||||
|
2. **Diverse conditions** - Different noise levels, distances, speakers
|
||||||
|
3. **Hard negatives** - Include similar-sounding phrases in not-wake-word set
|
||||||
|
4. **Regular updates** - Add false positives/negatives and retrain
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
|
||||||
|
Once trained and tested:
|
||||||
|
|
||||||
|
1. Copy model to voice assistant server:
|
||||||
|
\`\`\`bash
|
||||||
|
cp ${wake_word_dir}.net ~/voice-assistant/models/
|
||||||
|
\`\`\`
|
||||||
|
|
||||||
|
2. Update voice assistant config:
|
||||||
|
\`\`\`bash
|
||||||
|
vim ~/voice-assistant/config/.env
|
||||||
|
# Set: PRECISE_MODEL=~/voice-assistant/models/${wake_word_dir}.net
|
||||||
|
\`\`\`
|
||||||
|
|
||||||
|
3. Restart voice assistant service:
|
||||||
|
\`\`\`bash
|
||||||
|
sudo systemctl restart voice-assistant
|
||||||
|
\`\`\`
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
**Low accuracy?**
|
||||||
|
- Collect more training samples
|
||||||
|
- Increase training epochs (edit 3-train-model.sh, change -e 60 to -e 120)
|
||||||
|
- Verify 80/20 train/test split
|
||||||
|
|
||||||
|
**Too many false positives?**
|
||||||
|
- Increase threshold (use 6-tune-threshold.sh)
|
||||||
|
- Add false trigger audio to not-wake-word set
|
||||||
|
- Retrain with more diverse negative samples
|
||||||
|
|
||||||
|
**Misses wake words?**
|
||||||
|
- Lower threshold
|
||||||
|
- Add missed samples to training set
|
||||||
|
- Ensure good audio quality
|
||||||
|
|
||||||
|
## Resources
|
||||||
|
|
||||||
|
- Mycroft Precise Docs: https://github.com/MycroftAI/mycroft-precise
|
||||||
|
- Training Guide: https://mycroft-ai.gitbook.io/docs/mycroft-technologies/precise
|
||||||
|
- Community Models: https://github.com/MycroftAI/precise-data
|
||||||
|
EOF
|
||||||
|
|
||||||
|
print_status success "README created in $project_dir"
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
download_pretrained_models() {
|
||||||
|
print_status info "Downloading pre-trained models..."
|
||||||
|
|
||||||
|
# Create models directory
|
||||||
|
mkdir -p "$MODELS_DIR/pretrained"
|
||||||
|
|
||||||
|
# Download Hey Mycroft model (as example/base)
|
||||||
|
local model_url="https://github.com/MycroftAI/precise-data/raw/models-dev/hey-mycroft.tar.gz"
|
||||||
|
|
||||||
|
if [[ ! -f "$MODELS_DIR/pretrained/hey-mycroft.net" ]]; then
|
||||||
|
print_status info "Downloading Hey Mycroft model..."
|
||||||
|
wget -q --show-progress -O "$MODELS_DIR/pretrained/hey-mycroft.tar.gz" "$model_url" || {
|
||||||
|
print_status warning "Failed to download pre-trained model (optional)"
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
tar xzf "$MODELS_DIR/pretrained/hey-mycroft.tar.gz" -C "$MODELS_DIR/pretrained/" || {
|
||||||
|
print_status warning "Failed to extract pre-trained model"
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
print_status success "Pre-trained model downloaded"
|
||||||
|
else
|
||||||
|
print_status info "Pre-trained model already exists"
|
||||||
|
fi
|
||||||
|
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
print_next_steps() {
|
||||||
|
local wake_word_dir=$(echo "$WAKE_WORD" | tr ' ' '-' | tr '[:upper:]' '[:lower:]')
|
||||||
|
local project_dir="$MODELS_DIR/$wake_word_dir"
|
||||||
|
|
||||||
|
cat << EOF
|
||||||
|
|
||||||
|
${GREEN}Setup complete!${NC}
|
||||||
|
|
||||||
|
Wake word: "$WAKE_WORD"
|
||||||
|
Project directory: $project_dir
|
||||||
|
|
||||||
|
${BLUE}Next steps:${NC}
|
||||||
|
|
||||||
|
1. ${CYAN}Activate conda environment:${NC}
|
||||||
|
conda activate $CONDA_ENV_NAME
|
||||||
|
|
||||||
|
2. ${CYAN}Navigate to project directory:${NC}
|
||||||
|
cd $project_dir
|
||||||
|
|
||||||
|
3. ${CYAN}Follow the README or run scripts in order:${NC}
|
||||||
|
./1-record-wake-word.sh # Record wake word samples
|
||||||
|
./2-record-not-wake-word.sh # Record negative samples
|
||||||
|
# Organize samples into train/test directories
|
||||||
|
./3-train-model.sh # Train the model (30-60 min)
|
||||||
|
./4-test-model.sh # Test with microphone
|
||||||
|
./5-evaluate-model.sh # Check accuracy metrics
|
||||||
|
./6-tune-threshold.sh # Find best threshold
|
||||||
|
|
||||||
|
${BLUE}Helpful commands:${NC}
|
||||||
|
|
||||||
|
Test pre-trained model:
|
||||||
|
conda activate $CONDA_ENV_NAME
|
||||||
|
precise-listen $MODELS_DIR/pretrained/hey-mycroft.net
|
||||||
|
|
||||||
|
Check precise-engine:
|
||||||
|
precise-engine --version
|
||||||
|
|
||||||
|
${BLUE}Resources:${NC}
|
||||||
|
|
||||||
|
Full guide: See MYCROFT_PRECISE_GUIDE.md
|
||||||
|
Project README: $project_dir/README.md
|
||||||
|
Mycroft Docs: https://github.com/MycroftAI/mycroft-precise
|
||||||
|
|
||||||
|
EOF
|
||||||
|
}
|
||||||
|
|
||||||
|
# ----- Main -----
|
||||||
|
main() {
|
||||||
|
print_status info "Starting Mycroft Precise setup..."
|
||||||
|
|
||||||
|
# Parse arguments
|
||||||
|
parse_args "$@"
|
||||||
|
|
||||||
|
# Check dependencies
|
||||||
|
check_conda || exit 1
|
||||||
|
|
||||||
|
# Setup steps
|
||||||
|
create_conda_environment || exit 1
|
||||||
|
install_tensorflow || exit 1
|
||||||
|
install_precise || exit 1
|
||||||
|
download_precise_engine || exit 1
|
||||||
|
create_training_directory || exit 1
|
||||||
|
create_training_scripts || exit 1
|
||||||
|
create_readme || exit 1
|
||||||
|
download_pretrained_models || exit 1
|
||||||
|
|
||||||
|
# Print next steps
|
||||||
|
print_next_steps
|
||||||
|
}
|
||||||
|
|
||||||
|
# Run main
|
||||||
|
main "$@"
|
||||||
429
scripts/setup_voice_assistant.sh
Executable file
429
scripts/setup_voice_assistant.sh
Executable file
|
|
@ -0,0 +1,429 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
#
|
||||||
|
# Path: setup_voice_assistant.sh
|
||||||
|
#
|
||||||
|
# Purpose and usage:
|
||||||
|
# Sets up the voice assistant server environment on Heimdall
|
||||||
|
# - Creates conda environment
|
||||||
|
# - Installs dependencies (Whisper, Flask, Piper TTS)
|
||||||
|
# - Downloads and configures TTS models
|
||||||
|
# - Sets up systemd service (optional)
|
||||||
|
# - Configures environment variables
|
||||||
|
#
|
||||||
|
# Requirements:
|
||||||
|
# - conda/miniconda installed
|
||||||
|
# - Internet connection for downloads
|
||||||
|
# - Sudo access (for systemd service setup)
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# ./setup_voice_assistant.sh [--no-service] [--env-name NAME]
|
||||||
|
#
|
||||||
|
# Author: PRbL Library
|
||||||
|
# Created: $(date +"%Y-%m-%d")
|
||||||
|
|
||||||
|
# ----- PRbL Color and output functions -----
|
||||||
|
RED='\033[0;31m'
|
||||||
|
GREEN='\033[0;32m'
|
||||||
|
YELLOW='\033[0;33m'
|
||||||
|
BLUE='\033[0;34m'
|
||||||
|
PURPLE='\033[0;35m'
|
||||||
|
CYAN='\033[0;36m'
|
||||||
|
NC='\033[0m' # No Color
|
||||||
|
|
||||||
|
print_status() {
|
||||||
|
local level="$1"
|
||||||
|
shift
|
||||||
|
case "$level" in
|
||||||
|
"info") echo -e "${BLUE}[INFO]${NC} $*" >&2 ;;
|
||||||
|
"success") echo -e "${GREEN}[SUCCESS]${NC} $*" >&2 ;;
|
||||||
|
"warning") echo -e "${YELLOW}[WARNING]${NC} $*" >&2 ;;
|
||||||
|
"error") echo -e "${RED}[ERROR]${NC} $*" >&2 ;;
|
||||||
|
"debug") [[ "$VERBOSE" == "true" ]] && echo -e "${PURPLE}[DEBUG]${NC} $*" >&2 ;;
|
||||||
|
*) echo -e "$*" >&2 ;;
|
||||||
|
esac
|
||||||
|
}
|
||||||
|
|
||||||
|
# ----- Configuration -----
|
||||||
|
CONDA_ENV_NAME="voice-assistant"
|
||||||
|
PROJECT_DIR="$HOME/voice-assistant"
|
||||||
|
INSTALL_SYSTEMD=true
|
||||||
|
VERBOSE=false
|
||||||
|
|
||||||
|
# ----- Dependency checking -----
|
||||||
|
command_exists() {
|
||||||
|
command -v "$1" &> /dev/null
|
||||||
|
}
|
||||||
|
|
||||||
|
check_conda() {
|
||||||
|
if ! command_exists conda; then
|
||||||
|
print_status error "conda not found. Please install miniconda first."
|
||||||
|
print_status info "Install with: wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh"
|
||||||
|
print_status info " bash Miniconda3-latest-Linux-x86_64.sh"
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
# ----- Parse arguments -----
|
||||||
|
parse_args() {
|
||||||
|
while [[ $# -gt 0 ]]; do
|
||||||
|
case "$1" in
|
||||||
|
--no-service)
|
||||||
|
INSTALL_SYSTEMD=false
|
||||||
|
shift
|
||||||
|
;;
|
||||||
|
--env-name)
|
||||||
|
CONDA_ENV_NAME="$2"
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
-v|--verbose)
|
||||||
|
VERBOSE=true
|
||||||
|
shift
|
||||||
|
;;
|
||||||
|
-h|--help)
|
||||||
|
cat << EOF
|
||||||
|
Usage: $(basename "$0") [OPTIONS]
|
||||||
|
|
||||||
|
Options:
|
||||||
|
--no-service Don't install systemd service
|
||||||
|
--env-name NAME Custom conda environment name (default: voice-assistant)
|
||||||
|
-v, --verbose Enable verbose output
|
||||||
|
-h, --help Show this help message
|
||||||
|
|
||||||
|
EOF
|
||||||
|
exit 0
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
print_status error "Unknown option: $1"
|
||||||
|
exit 1
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
}
|
||||||
|
|
||||||
|
# ----- Setup functions -----
|
||||||
|
|
||||||
|
create_project_directory() {
|
||||||
|
print_status info "Creating project directory: $PROJECT_DIR"
|
||||||
|
|
||||||
|
if [[ ! -d "$PROJECT_DIR" ]]; then
|
||||||
|
mkdir -p "$PROJECT_DIR" || {
|
||||||
|
print_status error "Failed to create project directory"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Create subdirectories
|
||||||
|
mkdir -p "$PROJECT_DIR"/{logs,models,config}
|
||||||
|
|
||||||
|
print_status success "Project directory created"
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
create_conda_environment() {
|
||||||
|
print_status info "Creating conda environment: $CONDA_ENV_NAME"
|
||||||
|
|
||||||
|
# Check if environment already exists
|
||||||
|
if conda env list | grep -q "^${CONDA_ENV_NAME}\s"; then
|
||||||
|
print_status warning "Environment $CONDA_ENV_NAME already exists"
|
||||||
|
read -p "Remove and recreate? (y/N): " -n 1 -r
|
||||||
|
echo
|
||||||
|
if [[ $REPLY =~ ^[Yy]$ ]]; then
|
||||||
|
print_status info "Removing existing environment..."
|
||||||
|
conda env remove -n "$CONDA_ENV_NAME" -y
|
||||||
|
else
|
||||||
|
print_status info "Using existing environment"
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Create new environment
|
||||||
|
print_status info "Creating Python 3.10 environment..."
|
||||||
|
conda create -n "$CONDA_ENV_NAME" python=3.10 -y || {
|
||||||
|
print_status error "Failed to create conda environment"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
print_status success "Conda environment created"
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
install_python_dependencies() {
|
||||||
|
print_status info "Installing Python dependencies..."
|
||||||
|
|
||||||
|
# Activate conda environment
|
||||||
|
eval "$(conda shell.bash hook)"
|
||||||
|
conda activate "$CONDA_ENV_NAME" || {
|
||||||
|
print_status error "Failed to activate conda environment"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
# Install base dependencies
|
||||||
|
print_status info "Installing base packages..."
|
||||||
|
pip install --upgrade pip --break-system-packages || true
|
||||||
|
|
||||||
|
# Install Whisper (OpenAI)
|
||||||
|
print_status info "Installing OpenAI Whisper..."
|
||||||
|
pip install -U openai-whisper --break-system-packages || {
|
||||||
|
print_status error "Failed to install Whisper"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
# Install Flask
|
||||||
|
print_status info "Installing Flask..."
|
||||||
|
pip install flask --break-system-packages || {
|
||||||
|
print_status error "Failed to install Flask"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
# Install requests
|
||||||
|
print_status info "Installing requests..."
|
||||||
|
pip install requests --break-system-packages || {
|
||||||
|
print_status error "Failed to install requests"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
# Install python-dotenv
|
||||||
|
print_status info "Installing python-dotenv..."
|
||||||
|
pip install python-dotenv --break-system-packages || {
|
||||||
|
print_status warning "Failed to install python-dotenv (optional)"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Install Piper TTS
|
||||||
|
print_status info "Installing Piper TTS..."
|
||||||
|
# Note: Piper TTS installation method varies, adjust as needed
|
||||||
|
# For now, we'll install the Python package if available
|
||||||
|
pip install piper-tts --break-system-packages || {
|
||||||
|
print_status warning "Piper TTS pip package not found"
|
||||||
|
print_status info "You may need to install Piper manually from: https://github.com/rhasspy/piper"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Install PyAudio for audio handling
|
||||||
|
print_status info "Installing PyAudio dependencies..."
|
||||||
|
if command_exists apt-get; then
|
||||||
|
sudo apt-get install -y portaudio19-dev python3-pyaudio || {
|
||||||
|
print_status warning "Failed to install portaudio dev packages"
|
||||||
|
}
|
||||||
|
fi
|
||||||
|
|
||||||
|
pip install pyaudio --break-system-packages || {
|
||||||
|
print_status warning "Failed to install PyAudio (may need manual installation)"
|
||||||
|
}
|
||||||
|
|
||||||
|
print_status success "Python dependencies installed"
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
download_piper_models() {
|
||||||
|
print_status info "Downloading Piper TTS models..."
|
||||||
|
|
||||||
|
local models_dir="$PROJECT_DIR/models/piper"
|
||||||
|
mkdir -p "$models_dir"
|
||||||
|
|
||||||
|
# Download a default voice model
|
||||||
|
# Example: en_US-lessac-medium
|
||||||
|
local model_url="https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx"
|
||||||
|
local config_url="https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx.json"
|
||||||
|
|
||||||
|
if [[ ! -f "$models_dir/en_US-lessac-medium.onnx" ]]; then
|
||||||
|
print_status info "Downloading voice model..."
|
||||||
|
wget -q --show-progress -O "$models_dir/en_US-lessac-medium.onnx" "$model_url" || {
|
||||||
|
print_status warning "Failed to download Piper model (manual download may be needed)"
|
||||||
|
}
|
||||||
|
|
||||||
|
wget -q --show-progress -O "$models_dir/en_US-lessac-medium.onnx.json" "$config_url" || {
|
||||||
|
print_status warning "Failed to download Piper config"
|
||||||
|
}
|
||||||
|
else
|
||||||
|
print_status info "Piper model already downloaded"
|
||||||
|
fi
|
||||||
|
|
||||||
|
print_status success "Piper models ready"
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
create_config_file() {
|
||||||
|
print_status info "Creating configuration file..."
|
||||||
|
|
||||||
|
local config_file="$PROJECT_DIR/config/.env"
|
||||||
|
|
||||||
|
if [[ -f "$config_file" ]]; then
|
||||||
|
print_status warning "Config file already exists: $config_file"
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
cat > "$config_file" << 'EOF'
|
||||||
|
# Voice Assistant Configuration
|
||||||
|
# Path: ~/voice-assistant/config/.env
|
||||||
|
|
||||||
|
# Home Assistant Configuration
|
||||||
|
HA_URL=http://homeassistant.local:8123
|
||||||
|
HA_TOKEN=your_long_lived_access_token_here
|
||||||
|
|
||||||
|
# Server Configuration
|
||||||
|
SERVER_HOST=0.0.0.0
|
||||||
|
SERVER_PORT=5000
|
||||||
|
|
||||||
|
# Whisper Configuration
|
||||||
|
WHISPER_MODEL=medium
|
||||||
|
|
||||||
|
# Piper TTS Configuration
|
||||||
|
PIPER_MODEL=/path/to/piper/model.onnx
|
||||||
|
PIPER_CONFIG=/path/to/piper/model.onnx.json
|
||||||
|
|
||||||
|
# Logging
|
||||||
|
LOG_LEVEL=INFO
|
||||||
|
LOG_FILE=/home/$USER/voice-assistant/logs/voice_assistant.log
|
||||||
|
EOF
|
||||||
|
|
||||||
|
# Update paths in config
|
||||||
|
sed -i "s|/path/to/piper/model.onnx|$PROJECT_DIR/models/piper/en_US-lessac-medium.onnx|g" "$config_file"
|
||||||
|
sed -i "s|/path/to/piper/model.onnx.json|$PROJECT_DIR/models/piper/en_US-lessac-medium.onnx.json|g" "$config_file"
|
||||||
|
sed -i "s|/home/\$USER|$HOME|g" "$config_file"
|
||||||
|
|
||||||
|
chmod 600 "$config_file"
|
||||||
|
|
||||||
|
print_status success "Config file created: $config_file"
|
||||||
|
print_status warning "Please edit $config_file and add your Home Assistant token"
|
||||||
|
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
create_systemd_service() {
|
||||||
|
if [[ "$INSTALL_SYSTEMD" != "true" ]]; then
|
||||||
|
print_status info "Skipping systemd service installation"
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
print_status info "Creating systemd service..."
|
||||||
|
|
||||||
|
local service_file="/etc/systemd/system/voice-assistant.service"
|
||||||
|
|
||||||
|
# Create service file
|
||||||
|
sudo tee "$service_file" > /dev/null << EOF
|
||||||
|
[Unit]
|
||||||
|
Description=Voice Assistant Server
|
||||||
|
After=network.target
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=simple
|
||||||
|
User=$USER
|
||||||
|
WorkingDirectory=$PROJECT_DIR
|
||||||
|
Environment="PATH=$HOME/miniconda3/envs/$CONDA_ENV_NAME/bin:/usr/local/bin:/usr/bin:/bin"
|
||||||
|
EnvironmentFile=$PROJECT_DIR/config/.env
|
||||||
|
ExecStart=$HOME/miniconda3/envs/$CONDA_ENV_NAME/bin/python $PROJECT_DIR/voice_server.py
|
||||||
|
Restart=on-failure
|
||||||
|
RestartSec=10
|
||||||
|
StandardOutput=append:$PROJECT_DIR/logs/voice_assistant.log
|
||||||
|
StandardError=append:$PROJECT_DIR/logs/voice_assistant_error.log
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=multi-user.target
|
||||||
|
EOF
|
||||||
|
|
||||||
|
# Reload systemd
|
||||||
|
sudo systemctl daemon-reload
|
||||||
|
|
||||||
|
print_status success "Systemd service created"
|
||||||
|
print_status info "To enable and start the service:"
|
||||||
|
print_status info " sudo systemctl enable voice-assistant"
|
||||||
|
print_status info " sudo systemctl start voice-assistant"
|
||||||
|
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
create_test_script() {
|
||||||
|
print_status info "Creating test script..."
|
||||||
|
|
||||||
|
local test_script="$PROJECT_DIR/test_server.sh"
|
||||||
|
|
||||||
|
cat > "$test_script" << 'EOF'
|
||||||
|
#!/bin/bash
|
||||||
|
# Test script for voice assistant server
|
||||||
|
|
||||||
|
# Activate conda environment
|
||||||
|
eval "$(conda shell.bash hook)"
|
||||||
|
conda activate voice-assistant
|
||||||
|
|
||||||
|
# Load environment variables
|
||||||
|
if [[ -f ~/voice-assistant/config/.env ]]; then
|
||||||
|
export $(grep -v '^#' ~/voice-assistant/config/.env | xargs)
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Run server
|
||||||
|
cd ~/voice-assistant
|
||||||
|
python voice_server.py --verbose
|
||||||
|
EOF
|
||||||
|
|
||||||
|
chmod +x "$test_script"
|
||||||
|
|
||||||
|
print_status success "Test script created: $test_script"
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
install_voice_server_script() {
|
||||||
|
print_status info "Installing voice_server.py..."
|
||||||
|
|
||||||
|
# Check if voice_server.py exists in outputs
|
||||||
|
if [[ -f "$HOME/voice_server.py" ]]; then
|
||||||
|
cp "$HOME/voice_server.py" "$PROJECT_DIR/voice_server.py"
|
||||||
|
print_status success "voice_server.py installed"
|
||||||
|
elif [[ -f "./voice_server.py" ]]; then
|
||||||
|
cp "./voice_server.py" "$PROJECT_DIR/voice_server.py"
|
||||||
|
print_status success "voice_server.py installed"
|
||||||
|
else
|
||||||
|
print_status warning "voice_server.py not found in current directory"
|
||||||
|
print_status info "Please copy voice_server.py to $PROJECT_DIR manually"
|
||||||
|
fi
|
||||||
|
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
# ----- Main -----
|
||||||
|
main() {
|
||||||
|
print_status info "Starting voice assistant setup..."
|
||||||
|
|
||||||
|
# Parse arguments
|
||||||
|
parse_args "$@"
|
||||||
|
|
||||||
|
# Check dependencies
|
||||||
|
check_conda || exit 1
|
||||||
|
|
||||||
|
# Setup steps
|
||||||
|
create_project_directory || exit 1
|
||||||
|
create_conda_environment || exit 1
|
||||||
|
install_python_dependencies || exit 1
|
||||||
|
download_piper_models || exit 1
|
||||||
|
create_config_file || exit 1
|
||||||
|
install_voice_server_script || exit 1
|
||||||
|
create_test_script || exit 1
|
||||||
|
|
||||||
|
if [[ "$INSTALL_SYSTEMD" == "true" ]]; then
|
||||||
|
create_systemd_service || exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Final instructions
|
||||||
|
print_status success "Setup complete!"
|
||||||
|
echo
|
||||||
|
print_status info "Next steps:"
|
||||||
|
print_status info "1. Edit config file: vim $PROJECT_DIR/config/.env"
|
||||||
|
print_status info "2. Add your Home Assistant long-lived access token"
|
||||||
|
print_status info "3. Test the server: $PROJECT_DIR/test_server.sh"
|
||||||
|
print_status info "4. Configure your Maix Duino device"
|
||||||
|
|
||||||
|
if [[ "$INSTALL_SYSTEMD" == "true" ]]; then
|
||||||
|
echo
|
||||||
|
print_status info "To run as a service:"
|
||||||
|
print_status info " sudo systemctl enable voice-assistant"
|
||||||
|
print_status info " sudo systemctl start voice-assistant"
|
||||||
|
print_status info " sudo systemctl status voice-assistant"
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo
|
||||||
|
print_status info "Project directory: $PROJECT_DIR"
|
||||||
|
print_status info "Conda environment: $CONDA_ENV_NAME"
|
||||||
|
print_status info "Activate with: conda activate $CONDA_ENV_NAME"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Run main
|
||||||
|
main "$@"
|
||||||
700
scripts/voice_server.py
Executable file
700
scripts/voice_server.py
Executable file
|
|
@ -0,0 +1,700 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Voice Processing Server for Maix Duino Voice Assistant
|
||||||
|
|
||||||
|
Purpose and usage:
|
||||||
|
This server runs on Heimdall (10.1.10.71) and handles:
|
||||||
|
- Audio stream reception from Maix Duino
|
||||||
|
- Speech-to-text using Whisper
|
||||||
|
- Intent recognition and Home Assistant API calls
|
||||||
|
- Text-to-speech using Piper
|
||||||
|
- Audio response streaming back to device
|
||||||
|
|
||||||
|
Path: /home/alan/voice-assistant/voice_server.py
|
||||||
|
|
||||||
|
Requirements:
|
||||||
|
- whisper (already installed)
|
||||||
|
- piper-tts
|
||||||
|
- flask
|
||||||
|
- requests
|
||||||
|
- python-dotenv
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python3 voice_server.py [--host HOST] [--port PORT] [--ha-url URL]
|
||||||
|
"""
|
||||||
|
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
import argparse
|
||||||
|
import tempfile
|
||||||
|
import wave
|
||||||
|
import io
|
||||||
|
import re
|
||||||
|
import threading
|
||||||
|
import queue
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Optional, Dict, Any, Tuple
|
||||||
|
|
||||||
|
import whisper
|
||||||
|
import requests
|
||||||
|
from flask import Flask, request, jsonify, send_file
|
||||||
|
from werkzeug.exceptions import BadRequest
|
||||||
|
|
||||||
|
# Try to load environment variables
|
||||||
|
try:
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
load_dotenv()
|
||||||
|
except ImportError:
|
||||||
|
print("Warning: python-dotenv not installed. Using environment variables only.")
|
||||||
|
|
||||||
|
# Try to import Mycroft Precise
|
||||||
|
PRECISE_AVAILABLE = False
|
||||||
|
try:
|
||||||
|
from precise_runner import PreciseEngine, PreciseRunner
|
||||||
|
import pyaudio
|
||||||
|
PRECISE_AVAILABLE = True
|
||||||
|
except ImportError:
|
||||||
|
print("Warning: Mycroft Precise not installed. Wake word detection disabled.")
|
||||||
|
print("Install with: pip install mycroft-precise pyaudio")
|
||||||
|
|
||||||
|
# Configuration
|
||||||
|
DEFAULT_HOST = "0.0.0.0"
|
||||||
|
DEFAULT_PORT = 5000
|
||||||
|
DEFAULT_WHISPER_MODEL = "medium"
|
||||||
|
DEFAULT_HA_URL = os.getenv("HA_URL", "http://homeassistant.local:8123")
|
||||||
|
DEFAULT_HA_TOKEN = os.getenv("HA_TOKEN", "")
|
||||||
|
DEFAULT_PRECISE_MODEL = os.getenv("PRECISE_MODEL", "")
|
||||||
|
DEFAULT_PRECISE_SENSITIVITY = float(os.getenv("PRECISE_SENSITIVITY", "0.5"))
|
||||||
|
DEFAULT_PRECISE_ENGINE = "/usr/local/bin/precise-engine"
|
||||||
|
|
||||||
|
# Initialize Flask app
|
||||||
|
app = Flask(__name__)
|
||||||
|
app.config['MAX_CONTENT_LENGTH'] = 16 * 1024 * 1024 # 16MB max audio file
|
||||||
|
|
||||||
|
# Global variables for loaded models
|
||||||
|
whisper_model = None
|
||||||
|
ha_client = None
|
||||||
|
precise_runner = None
|
||||||
|
precise_enabled = False
|
||||||
|
wake_word_queue = queue.Queue() # Queue for wake word detections
|
||||||
|
|
||||||
|
|
||||||
|
class HomeAssistantClient:
|
||||||
|
"""Client for interacting with Home Assistant API"""
|
||||||
|
|
||||||
|
def __init__(self, base_url: str, token: str):
|
||||||
|
self.base_url = base_url.rstrip('/')
|
||||||
|
self.token = token
|
||||||
|
self.session = requests.Session()
|
||||||
|
self.session.headers.update({
|
||||||
|
'Authorization': f'Bearer {token}',
|
||||||
|
'Content-Type': 'application/json'
|
||||||
|
})
|
||||||
|
|
||||||
|
def get_state(self, entity_id: str) -> Optional[Dict[str, Any]]:
|
||||||
|
"""Get the state of an entity"""
|
||||||
|
try:
|
||||||
|
response = self.session.get(f'{self.base_url}/api/states/{entity_id}')
|
||||||
|
response.raise_for_status()
|
||||||
|
return response.json()
|
||||||
|
except requests.RequestException as e:
|
||||||
|
print(f"Error getting state for {entity_id}: {e}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
def call_service(self, domain: str, service: str, entity_id: str,
|
||||||
|
**kwargs) -> bool:
|
||||||
|
"""Call a Home Assistant service"""
|
||||||
|
try:
|
||||||
|
data = {'entity_id': entity_id}
|
||||||
|
data.update(kwargs)
|
||||||
|
|
||||||
|
response = self.session.post(
|
||||||
|
f'{self.base_url}/api/services/{domain}/{service}',
|
||||||
|
json=data
|
||||||
|
)
|
||||||
|
response.raise_for_status()
|
||||||
|
return True
|
||||||
|
except requests.RequestException as e:
|
||||||
|
print(f"Error calling service {domain}.{service}: {e}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
def turn_on(self, entity_id: str, **kwargs) -> bool:
|
||||||
|
"""Turn on an entity"""
|
||||||
|
domain = entity_id.split('.')[0]
|
||||||
|
return self.call_service(domain, 'turn_on', entity_id, **kwargs)
|
||||||
|
|
||||||
|
def turn_off(self, entity_id: str, **kwargs) -> bool:
|
||||||
|
"""Turn off an entity"""
|
||||||
|
domain = entity_id.split('.')[0]
|
||||||
|
return self.call_service(domain, 'turn_off', entity_id, **kwargs)
|
||||||
|
|
||||||
|
def toggle(self, entity_id: str, **kwargs) -> bool:
|
||||||
|
"""Toggle an entity"""
|
||||||
|
domain = entity_id.split('.')[0]
|
||||||
|
return self.call_service(domain, 'toggle', entity_id, **kwargs)
|
||||||
|
|
||||||
|
|
||||||
|
class IntentParser:
|
||||||
|
"""Simple pattern-based intent recognition"""
|
||||||
|
|
||||||
|
# Intent patterns (can be expanded or replaced with ML-based NLU)
|
||||||
|
PATTERNS = {
|
||||||
|
'turn_on': [
|
||||||
|
r'turn on (the )?(.+)',
|
||||||
|
r'switch on (the )?(.+)',
|
||||||
|
r'enable (the )?(.+)',
|
||||||
|
],
|
||||||
|
'turn_off': [
|
||||||
|
r'turn off (the )?(.+)',
|
||||||
|
r'switch off (the )?(.+)',
|
||||||
|
r'disable (the )?(.+)',
|
||||||
|
],
|
||||||
|
'toggle': [
|
||||||
|
r'toggle (the )?(.+)',
|
||||||
|
],
|
||||||
|
'get_state': [
|
||||||
|
r'what(?:\'s| is) (the )?(.+)',
|
||||||
|
r'how is (the )?(.+)',
|
||||||
|
r'status of (the )?(.+)',
|
||||||
|
],
|
||||||
|
'get_temperature': [
|
||||||
|
r'what(?:\'s| is) the temperature',
|
||||||
|
r'how (?:warm|cold|hot) is it',
|
||||||
|
],
|
||||||
|
}
|
||||||
|
|
||||||
|
# Entity name mapping (friendly names to entity IDs)
|
||||||
|
ENTITY_MAP = {
|
||||||
|
'living room light': 'light.living_room',
|
||||||
|
'living room lights': 'light.living_room',
|
||||||
|
'bedroom light': 'light.bedroom',
|
||||||
|
'bedroom lights': 'light.bedroom',
|
||||||
|
'kitchen light': 'light.kitchen',
|
||||||
|
'kitchen lights': 'light.kitchen',
|
||||||
|
'all lights': 'group.all_lights',
|
||||||
|
'temperature': 'sensor.temperature',
|
||||||
|
'thermostat': 'climate.thermostat',
|
||||||
|
}
|
||||||
|
|
||||||
|
def parse(self, text: str) -> Optional[Tuple[str, str, Dict[str, Any]]]:
|
||||||
|
"""
|
||||||
|
Parse text into intent, entity, and parameters
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
(intent, entity_id, params) or None if no match
|
||||||
|
"""
|
||||||
|
text = text.lower().strip()
|
||||||
|
|
||||||
|
for intent, patterns in self.PATTERNS.items():
|
||||||
|
for pattern in patterns:
|
||||||
|
match = re.match(pattern, text, re.IGNORECASE)
|
||||||
|
if match:
|
||||||
|
# Extract entity name from match groups
|
||||||
|
entity_name = None
|
||||||
|
for group in match.groups():
|
||||||
|
if group and group.lower() not in ['the', 'a', 'an']:
|
||||||
|
entity_name = group.lower().strip()
|
||||||
|
break
|
||||||
|
|
||||||
|
# Map entity name to entity ID
|
||||||
|
entity_id = None
|
||||||
|
if entity_name:
|
||||||
|
entity_id = self.ENTITY_MAP.get(entity_name)
|
||||||
|
|
||||||
|
# For get_temperature, use default sensor
|
||||||
|
if intent == 'get_temperature':
|
||||||
|
entity_id = self.ENTITY_MAP.get('temperature')
|
||||||
|
|
||||||
|
if entity_id:
|
||||||
|
return (intent, entity_id, {})
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def load_whisper_model(model_name: str = DEFAULT_WHISPER_MODEL):
|
||||||
|
"""Load Whisper model"""
|
||||||
|
global whisper_model
|
||||||
|
|
||||||
|
if whisper_model is None:
|
||||||
|
print(f"Loading Whisper model: {model_name}")
|
||||||
|
whisper_model = whisper.load_model(model_name)
|
||||||
|
print("Whisper model loaded successfully")
|
||||||
|
|
||||||
|
return whisper_model
|
||||||
|
|
||||||
|
|
||||||
|
def transcribe_audio(audio_file_path: str) -> Optional[str]:
|
||||||
|
"""Transcribe audio file using Whisper"""
|
||||||
|
try:
|
||||||
|
model = load_whisper_model()
|
||||||
|
result = model.transcribe(audio_file_path)
|
||||||
|
return result['text'].strip()
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error transcribing audio: {e}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def generate_tts(text: str) -> Optional[bytes]:
|
||||||
|
"""
|
||||||
|
Generate speech from text using Piper TTS
|
||||||
|
|
||||||
|
TODO: Implement Piper TTS integration
|
||||||
|
For now, returns None - implement based on Piper installation
|
||||||
|
"""
|
||||||
|
# Placeholder for TTS implementation
|
||||||
|
print(f"TTS requested for: {text}")
|
||||||
|
|
||||||
|
# You'll need to add Piper TTS integration here
|
||||||
|
# Example command: piper --model <model> --output_file <file> < text
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def on_wake_word_detected():
|
||||||
|
"""
|
||||||
|
Callback when Mycroft Precise detects wake word
|
||||||
|
|
||||||
|
This function is called by the Precise runner when the wake word
|
||||||
|
is detected. It signals the main application to start recording
|
||||||
|
and processing the user's command.
|
||||||
|
"""
|
||||||
|
print("Wake word detected by Precise!")
|
||||||
|
wake_word_queue.put({
|
||||||
|
'timestamp': time.time(),
|
||||||
|
'source': 'precise'
|
||||||
|
})
|
||||||
|
|
||||||
|
|
||||||
|
def start_precise_listener(model_path: str, sensitivity: float = 0.5,
|
||||||
|
engine_path: str = DEFAULT_PRECISE_ENGINE):
|
||||||
|
"""
|
||||||
|
Start Mycroft Precise wake word detection
|
||||||
|
|
||||||
|
Args:
|
||||||
|
model_path: Path to .net model file
|
||||||
|
sensitivity: Detection threshold (0.0-1.0, default 0.5)
|
||||||
|
engine_path: Path to precise-engine binary
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
PreciseRunner instance if successful, None otherwise
|
||||||
|
"""
|
||||||
|
global precise_runner, precise_enabled
|
||||||
|
|
||||||
|
if not PRECISE_AVAILABLE:
|
||||||
|
print("Error: Mycroft Precise not available")
|
||||||
|
return None
|
||||||
|
|
||||||
|
# Verify model exists
|
||||||
|
if not os.path.exists(model_path):
|
||||||
|
print(f"Error: Precise model not found: {model_path}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
# Verify engine exists
|
||||||
|
if not os.path.exists(engine_path):
|
||||||
|
print(f"Error: precise-engine not found: {engine_path}")
|
||||||
|
print("Download from: https://github.com/MycroftAI/mycroft-precise/releases")
|
||||||
|
return None
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Create Precise engine
|
||||||
|
engine = PreciseEngine(engine_path, model_path)
|
||||||
|
|
||||||
|
# Create runner with callback
|
||||||
|
precise_runner = PreciseRunner(
|
||||||
|
engine,
|
||||||
|
sensitivity=sensitivity,
|
||||||
|
on_activation=on_wake_word_detected
|
||||||
|
)
|
||||||
|
|
||||||
|
# Start listening
|
||||||
|
precise_runner.start()
|
||||||
|
precise_enabled = True
|
||||||
|
|
||||||
|
print(f"Precise listening started:")
|
||||||
|
print(f" Model: {model_path}")
|
||||||
|
print(f" Sensitivity: {sensitivity}")
|
||||||
|
print(f" Engine: {engine_path}")
|
||||||
|
|
||||||
|
return precise_runner
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error starting Precise: {e}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def stop_precise_listener():
|
||||||
|
"""Stop Mycroft Precise wake word detection"""
|
||||||
|
global precise_runner, precise_enabled
|
||||||
|
|
||||||
|
if precise_runner:
|
||||||
|
try:
|
||||||
|
precise_runner.stop()
|
||||||
|
precise_enabled = False
|
||||||
|
print("Precise listener stopped")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error stopping Precise: {e}")
|
||||||
|
|
||||||
|
|
||||||
|
def record_audio_after_wake(duration: int = 5) -> Optional[bytes]:
|
||||||
|
"""
|
||||||
|
Record audio after wake word is detected
|
||||||
|
|
||||||
|
Args:
|
||||||
|
duration: Maximum recording duration in seconds
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
WAV audio data or None
|
||||||
|
|
||||||
|
Note: This is for server-side wake word detection where
|
||||||
|
the server is also doing audio capture. For Maix Duino
|
||||||
|
client-side wake detection, audio comes from the client.
|
||||||
|
"""
|
||||||
|
if not PRECISE_AVAILABLE:
|
||||||
|
return None
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Audio settings
|
||||||
|
CHUNK = 1024
|
||||||
|
FORMAT = pyaudio.paInt16
|
||||||
|
CHANNELS = 1
|
||||||
|
RATE = 16000
|
||||||
|
|
||||||
|
p = pyaudio.PyAudio()
|
||||||
|
|
||||||
|
# Open stream
|
||||||
|
stream = p.open(
|
||||||
|
format=FORMAT,
|
||||||
|
channels=CHANNELS,
|
||||||
|
rate=RATE,
|
||||||
|
input=True,
|
||||||
|
frames_per_buffer=CHUNK
|
||||||
|
)
|
||||||
|
|
||||||
|
print(f"Recording for {duration} seconds...")
|
||||||
|
|
||||||
|
frames = []
|
||||||
|
for _ in range(0, int(RATE / CHUNK * duration)):
|
||||||
|
data = stream.read(CHUNK)
|
||||||
|
frames.append(data)
|
||||||
|
|
||||||
|
# Stop and close stream
|
||||||
|
stream.stop_stream()
|
||||||
|
stream.close()
|
||||||
|
p.terminate()
|
||||||
|
|
||||||
|
# Convert to WAV
|
||||||
|
wav_buffer = io.BytesIO()
|
||||||
|
with wave.open(wav_buffer, 'wb') as wf:
|
||||||
|
wf.setnchannels(CHANNELS)
|
||||||
|
wf.setsampwidth(p.get_sample_size(FORMAT))
|
||||||
|
wf.setframerate(RATE)
|
||||||
|
wf.writeframes(b''.join(frames))
|
||||||
|
|
||||||
|
return wav_buffer.getvalue()
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error recording audio: {e}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
import time # Add this import at the top if not already there
|
||||||
|
|
||||||
|
|
||||||
|
def execute_intent(intent: str, entity_id: str, params: Dict[str, Any]) -> str:
|
||||||
|
"""Execute an intent and return response text"""
|
||||||
|
|
||||||
|
if intent == 'turn_on':
|
||||||
|
success = ha_client.turn_on(entity_id)
|
||||||
|
if success:
|
||||||
|
entity_name = entity_id.split('.')[-1].replace('_', ' ')
|
||||||
|
return f"Turned on {entity_name}"
|
||||||
|
else:
|
||||||
|
return "Sorry, I couldn't turn that on"
|
||||||
|
|
||||||
|
elif intent == 'turn_off':
|
||||||
|
success = ha_client.turn_off(entity_id)
|
||||||
|
if success:
|
||||||
|
entity_name = entity_id.split('.')[-1].replace('_', ' ')
|
||||||
|
return f"Turned off {entity_name}"
|
||||||
|
else:
|
||||||
|
return "Sorry, I couldn't turn that off"
|
||||||
|
|
||||||
|
elif intent == 'toggle':
|
||||||
|
success = ha_client.toggle(entity_id)
|
||||||
|
if success:
|
||||||
|
entity_name = entity_id.split('.')[-1].replace('_', ' ')
|
||||||
|
return f"Toggled {entity_name}"
|
||||||
|
else:
|
||||||
|
return "Sorry, I couldn't toggle that"
|
||||||
|
|
||||||
|
elif intent in ['get_state', 'get_temperature']:
|
||||||
|
state = ha_client.get_state(entity_id)
|
||||||
|
if state:
|
||||||
|
entity_name = entity_id.split('.')[-1].replace('_', ' ')
|
||||||
|
value = state.get('state', 'unknown')
|
||||||
|
unit = state.get('attributes', {}).get('unit_of_measurement', '')
|
||||||
|
|
||||||
|
return f"The {entity_name} is {value} {unit}".strip()
|
||||||
|
else:
|
||||||
|
return "Sorry, I couldn't get that information"
|
||||||
|
|
||||||
|
return "I didn't understand that command"
|
||||||
|
|
||||||
|
|
||||||
|
# Flask routes
|
||||||
|
|
||||||
|
@app.route('/health', methods=['GET'])
|
||||||
|
def health():
|
||||||
|
"""Health check endpoint"""
|
||||||
|
return jsonify({
|
||||||
|
'status': 'healthy',
|
||||||
|
'whisper_loaded': whisper_model is not None,
|
||||||
|
'ha_connected': ha_client is not None,
|
||||||
|
'precise_enabled': precise_enabled,
|
||||||
|
'precise_available': PRECISE_AVAILABLE
|
||||||
|
})
|
||||||
|
|
||||||
|
|
||||||
|
@app.route('/wake-word/status', methods=['GET'])
|
||||||
|
def wake_word_status():
|
||||||
|
"""Get wake word detection status"""
|
||||||
|
return jsonify({
|
||||||
|
'enabled': precise_enabled,
|
||||||
|
'available': PRECISE_AVAILABLE,
|
||||||
|
'model': DEFAULT_PRECISE_MODEL if precise_enabled else None,
|
||||||
|
'sensitivity': DEFAULT_PRECISE_SENSITIVITY if precise_enabled else None
|
||||||
|
})
|
||||||
|
|
||||||
|
|
||||||
|
@app.route('/wake-word/detections', methods=['GET'])
|
||||||
|
def wake_word_detections():
|
||||||
|
"""
|
||||||
|
Get recent wake word detections (non-blocking)
|
||||||
|
|
||||||
|
Returns any wake word detections in the queue.
|
||||||
|
Used for testing and monitoring.
|
||||||
|
"""
|
||||||
|
detections = []
|
||||||
|
|
||||||
|
try:
|
||||||
|
while not wake_word_queue.empty():
|
||||||
|
detections.append(wake_word_queue.get_nowait())
|
||||||
|
except queue.Empty:
|
||||||
|
pass
|
||||||
|
|
||||||
|
return jsonify({
|
||||||
|
'detections': detections,
|
||||||
|
'count': len(detections)
|
||||||
|
})
|
||||||
|
|
||||||
|
|
||||||
|
@app.route('/transcribe', methods=['POST'])
|
||||||
|
def transcribe():
|
||||||
|
"""
|
||||||
|
Transcribe audio file
|
||||||
|
|
||||||
|
Expects: WAV audio file in request body
|
||||||
|
Returns: JSON with transcribed text
|
||||||
|
"""
|
||||||
|
if 'audio' not in request.files:
|
||||||
|
raise BadRequest('No audio file provided')
|
||||||
|
|
||||||
|
audio_file = request.files['audio']
|
||||||
|
|
||||||
|
# Save to temporary file
|
||||||
|
with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as temp_file:
|
||||||
|
audio_file.save(temp_file.name)
|
||||||
|
temp_path = temp_file.name
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Transcribe
|
||||||
|
text = transcribe_audio(temp_path)
|
||||||
|
|
||||||
|
if text:
|
||||||
|
return jsonify({
|
||||||
|
'success': True,
|
||||||
|
'text': text
|
||||||
|
})
|
||||||
|
else:
|
||||||
|
return jsonify({
|
||||||
|
'success': False,
|
||||||
|
'error': 'Transcription failed'
|
||||||
|
}), 500
|
||||||
|
|
||||||
|
finally:
|
||||||
|
# Clean up temp file
|
||||||
|
if os.path.exists(temp_path):
|
||||||
|
os.remove(temp_path)
|
||||||
|
|
||||||
|
|
||||||
|
@app.route('/process', methods=['POST'])
|
||||||
|
def process():
|
||||||
|
"""
|
||||||
|
Process complete voice command
|
||||||
|
|
||||||
|
Expects: WAV audio file in request body
|
||||||
|
Returns: JSON with response and audio file
|
||||||
|
"""
|
||||||
|
if 'audio' not in request.files:
|
||||||
|
raise BadRequest('No audio file provided')
|
||||||
|
|
||||||
|
audio_file = request.files['audio']
|
||||||
|
|
||||||
|
# Save to temporary file
|
||||||
|
with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as temp_file:
|
||||||
|
audio_file.save(temp_file.name)
|
||||||
|
temp_path = temp_file.name
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Step 1: Transcribe
|
||||||
|
text = transcribe_audio(temp_path)
|
||||||
|
|
||||||
|
if not text:
|
||||||
|
return jsonify({
|
||||||
|
'success': False,
|
||||||
|
'error': 'Transcription failed'
|
||||||
|
}), 500
|
||||||
|
|
||||||
|
print(f"Transcribed: {text}")
|
||||||
|
|
||||||
|
# Step 2: Parse intent
|
||||||
|
parser = IntentParser()
|
||||||
|
intent_result = parser.parse(text)
|
||||||
|
|
||||||
|
if not intent_result:
|
||||||
|
response_text = "I didn't understand that command"
|
||||||
|
else:
|
||||||
|
intent, entity_id, params = intent_result
|
||||||
|
print(f"Intent: {intent}, Entity: {entity_id}")
|
||||||
|
|
||||||
|
# Step 3: Execute intent
|
||||||
|
response_text = execute_intent(intent, entity_id, params)
|
||||||
|
|
||||||
|
print(f"Response: {response_text}")
|
||||||
|
|
||||||
|
# Step 4: Generate TTS (placeholder for now)
|
||||||
|
# audio_response = generate_tts(response_text)
|
||||||
|
|
||||||
|
return jsonify({
|
||||||
|
'success': True,
|
||||||
|
'transcription': text,
|
||||||
|
'response': response_text,
|
||||||
|
# 'audio_available': audio_response is not None
|
||||||
|
})
|
||||||
|
|
||||||
|
finally:
|
||||||
|
# Clean up temp file
|
||||||
|
if os.path.exists(temp_path):
|
||||||
|
os.remove(temp_path)
|
||||||
|
|
||||||
|
|
||||||
|
@app.route('/tts', methods=['POST'])
|
||||||
|
def tts():
|
||||||
|
"""
|
||||||
|
Generate TTS audio
|
||||||
|
|
||||||
|
Expects: JSON with 'text' field
|
||||||
|
Returns: WAV audio file
|
||||||
|
"""
|
||||||
|
data = request.get_json()
|
||||||
|
|
||||||
|
if not data or 'text' not in data:
|
||||||
|
raise BadRequest('No text provided')
|
||||||
|
|
||||||
|
text = data['text']
|
||||||
|
|
||||||
|
# Generate TTS
|
||||||
|
audio_data = generate_tts(text)
|
||||||
|
|
||||||
|
if audio_data:
|
||||||
|
return send_file(
|
||||||
|
io.BytesIO(audio_data),
|
||||||
|
mimetype='audio/wav',
|
||||||
|
as_attachment=True,
|
||||||
|
download_name='response.wav'
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
return jsonify({
|
||||||
|
'success': False,
|
||||||
|
'error': 'TTS generation not implemented yet'
|
||||||
|
}), 501
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="Voice Processing Server for Maix Duino Voice Assistant"
|
||||||
|
)
|
||||||
|
parser.add_argument('--host', default=DEFAULT_HOST,
|
||||||
|
help=f'Server host (default: {DEFAULT_HOST})')
|
||||||
|
parser.add_argument('--port', type=int, default=DEFAULT_PORT,
|
||||||
|
help=f'Server port (default: {DEFAULT_PORT})')
|
||||||
|
parser.add_argument('--whisper-model', default=DEFAULT_WHISPER_MODEL,
|
||||||
|
help=f'Whisper model to use (default: {DEFAULT_WHISPER_MODEL})')
|
||||||
|
parser.add_argument('--ha-url', default=DEFAULT_HA_URL,
|
||||||
|
help=f'Home Assistant URL (default: {DEFAULT_HA_URL})')
|
||||||
|
parser.add_argument('--ha-token', default=DEFAULT_HA_TOKEN,
|
||||||
|
help='Home Assistant long-lived access token')
|
||||||
|
parser.add_argument('--enable-precise', action='store_true',
|
||||||
|
help='Enable Mycroft Precise wake word detection')
|
||||||
|
parser.add_argument('--precise-model', default=DEFAULT_PRECISE_MODEL,
|
||||||
|
help='Path to Precise .net model file')
|
||||||
|
parser.add_argument('--precise-sensitivity', type=float,
|
||||||
|
default=DEFAULT_PRECISE_SENSITIVITY,
|
||||||
|
help='Precise sensitivity threshold (0.0-1.0, default: 0.5)')
|
||||||
|
parser.add_argument('--precise-engine', default=DEFAULT_PRECISE_ENGINE,
|
||||||
|
help=f'Path to precise-engine binary (default: {DEFAULT_PRECISE_ENGINE})')
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
# Validate HA configuration
|
||||||
|
if not args.ha_token:
|
||||||
|
print("Warning: No Home Assistant token provided!")
|
||||||
|
print("Set HA_TOKEN environment variable or use --ha-token")
|
||||||
|
print("Commands will not execute without authentication.")
|
||||||
|
|
||||||
|
# Initialize global clients
|
||||||
|
global ha_client
|
||||||
|
ha_client = HomeAssistantClient(args.ha_url, args.ha_token)
|
||||||
|
|
||||||
|
# Load Whisper model
|
||||||
|
print(f"Starting voice processing server on {args.host}:{args.port}")
|
||||||
|
load_whisper_model(args.whisper_model)
|
||||||
|
|
||||||
|
# Start Precise if enabled
|
||||||
|
if args.enable_precise:
|
||||||
|
if not PRECISE_AVAILABLE:
|
||||||
|
print("Error: --enable-precise specified but Mycroft Precise not installed")
|
||||||
|
print("Install with: pip install mycroft-precise pyaudio")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
if not args.precise_model:
|
||||||
|
print("Error: --enable-precise requires --precise-model")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
print("\nStarting Mycroft Precise wake word detection...")
|
||||||
|
precise_result = start_precise_listener(
|
||||||
|
args.precise_model,
|
||||||
|
args.precise_sensitivity,
|
||||||
|
args.precise_engine
|
||||||
|
)
|
||||||
|
|
||||||
|
if not precise_result:
|
||||||
|
print("Error: Failed to start Precise listener")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
print("\nWake word detection active!")
|
||||||
|
print("The server will detect wake words and queue them for processing.")
|
||||||
|
print("Use /wake-word/detections endpoint to check for detections.\n")
|
||||||
|
|
||||||
|
# Start Flask server
|
||||||
|
try:
|
||||||
|
app.run(host=args.host, port=args.port, debug=False)
|
||||||
|
except KeyboardInterrupt:
|
||||||
|
print("\nShutting down...")
|
||||||
|
if args.enable_precise:
|
||||||
|
stop_precise_listener()
|
||||||
|
sys.exit(0)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
||||||
580
scripts/voice_server_enhanced.py
Executable file
580
scripts/voice_server_enhanced.py
Executable file
|
|
@ -0,0 +1,580 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Enhanced Voice Server with Multiple Wake Words and Speaker Identification
|
||||||
|
|
||||||
|
Path: /home/alan/voice-assistant/voice_server_enhanced.py
|
||||||
|
|
||||||
|
This enhanced version adds:
|
||||||
|
- Multiple wake word support
|
||||||
|
- Speaker identification using pyannote.audio
|
||||||
|
- Per-user customization
|
||||||
|
- Wake word-specific responses
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python3 voice_server_enhanced.py \
|
||||||
|
--enable-precise \
|
||||||
|
--multi-wake-word \
|
||||||
|
--enable-speaker-id
|
||||||
|
"""
|
||||||
|
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
import json
|
||||||
|
import argparse
|
||||||
|
import tempfile
|
||||||
|
import wave
|
||||||
|
import io
|
||||||
|
import re
|
||||||
|
import threading
|
||||||
|
import queue
|
||||||
|
import time
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Optional, Dict, Any, Tuple, List
|
||||||
|
|
||||||
|
import whisper
|
||||||
|
import requests
|
||||||
|
from flask import Flask, request, jsonify, send_file
|
||||||
|
from werkzeug.exceptions import BadRequest
|
||||||
|
|
||||||
|
try:
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
load_dotenv()
|
||||||
|
except ImportError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
# Mycroft Precise
|
||||||
|
PRECISE_AVAILABLE = False
|
||||||
|
try:
|
||||||
|
from precise_runner import PreciseEngine, PreciseRunner
|
||||||
|
import pyaudio
|
||||||
|
PRECISE_AVAILABLE = True
|
||||||
|
except ImportError:
|
||||||
|
print("Warning: Mycroft Precise not installed")
|
||||||
|
|
||||||
|
# Speaker identification
|
||||||
|
SPEAKER_ID_AVAILABLE = False
|
||||||
|
try:
|
||||||
|
from pyannote.audio import Inference
|
||||||
|
from scipy.spatial.distance import cosine
|
||||||
|
import numpy as np
|
||||||
|
SPEAKER_ID_AVAILABLE = True
|
||||||
|
except ImportError:
|
||||||
|
print("Warning: Speaker ID not available. Install: pip install pyannote.audio scipy")
|
||||||
|
|
||||||
|
# Configuration
|
||||||
|
DEFAULT_HOST = "0.0.0.0"
|
||||||
|
DEFAULT_PORT = 5000
|
||||||
|
DEFAULT_WHISPER_MODEL = "medium"
|
||||||
|
DEFAULT_HA_URL = os.getenv("HA_URL", "http://homeassistant.local:8123")
|
||||||
|
DEFAULT_HA_TOKEN = os.getenv("HA_TOKEN", "")
|
||||||
|
DEFAULT_PRECISE_ENGINE = "/usr/local/bin/precise-engine"
|
||||||
|
DEFAULT_HF_TOKEN = os.getenv("HF_TOKEN", "")
|
||||||
|
|
||||||
|
# Wake word configurations
|
||||||
|
WAKE_WORD_CONFIGS = {
|
||||||
|
'hey_mycroft': {
|
||||||
|
'model': os.path.expanduser('~/precise-models/pretrained/hey-mycroft.net'),
|
||||||
|
'sensitivity': 0.5,
|
||||||
|
'response': 'Yes?',
|
||||||
|
'enabled': True,
|
||||||
|
'context': 'general'
|
||||||
|
},
|
||||||
|
'hey_computer': {
|
||||||
|
'model': os.path.expanduser('~/precise-models/hey-computer/hey-computer.net'),
|
||||||
|
'sensitivity': 0.5,
|
||||||
|
'response': 'I\'m listening',
|
||||||
|
'enabled': False, # Disabled by default (requires training)
|
||||||
|
'context': 'general'
|
||||||
|
},
|
||||||
|
'jarvis': {
|
||||||
|
'model': os.path.expanduser('~/precise-models/jarvis/jarvis.net'),
|
||||||
|
'sensitivity': 0.6,
|
||||||
|
'response': 'At your service',
|
||||||
|
'enabled': False,
|
||||||
|
'context': 'personal'
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
# Speaker profiles (stored in JSON file)
|
||||||
|
SPEAKER_PROFILES_FILE = os.path.expanduser('~/voice-assistant/config/speaker_profiles.json')
|
||||||
|
|
||||||
|
# Flask app
|
||||||
|
app = Flask(__name__)
|
||||||
|
app.config['MAX_CONTENT_LENGTH'] = 16 * 1024 * 1024
|
||||||
|
|
||||||
|
# Global state
|
||||||
|
whisper_model = None
|
||||||
|
ha_client = None
|
||||||
|
precise_runners = {}
|
||||||
|
precise_enabled = False
|
||||||
|
speaker_id_enabled = False
|
||||||
|
speaker_inference = None
|
||||||
|
speaker_profiles = {}
|
||||||
|
wake_word_queue = queue.Queue()
|
||||||
|
|
||||||
|
|
||||||
|
class HomeAssistantClient:
|
||||||
|
"""Client for Home Assistant API"""
|
||||||
|
|
||||||
|
def __init__(self, base_url: str, token: str):
|
||||||
|
self.base_url = base_url.rstrip('/')
|
||||||
|
self.token = token
|
||||||
|
self.session = requests.Session()
|
||||||
|
self.session.headers.update({
|
||||||
|
'Authorization': f'Bearer {token}',
|
||||||
|
'Content-Type': 'application/json'
|
||||||
|
})
|
||||||
|
|
||||||
|
def get_state(self, entity_id: str) -> Optional[Dict[str, Any]]:
|
||||||
|
try:
|
||||||
|
response = self.session.get(f'{self.base_url}/api/states/{entity_id}')
|
||||||
|
response.raise_for_status()
|
||||||
|
return response.json()
|
||||||
|
except requests.RequestException as e:
|
||||||
|
print(f"Error getting state for {entity_id}: {e}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
def call_service(self, domain: str, service: str, entity_id: str, **kwargs) -> bool:
|
||||||
|
try:
|
||||||
|
data = {'entity_id': entity_id}
|
||||||
|
data.update(kwargs)
|
||||||
|
response = self.session.post(
|
||||||
|
f'{self.base_url}/api/services/{domain}/{service}',
|
||||||
|
json=data
|
||||||
|
)
|
||||||
|
response.raise_for_status()
|
||||||
|
return True
|
||||||
|
except requests.RequestException as e:
|
||||||
|
print(f"Error calling service {domain}.{service}: {e}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
def turn_on(self, entity_id: str, **kwargs) -> bool:
|
||||||
|
domain = entity_id.split('.')[0]
|
||||||
|
return self.call_service(domain, 'turn_on', entity_id, **kwargs)
|
||||||
|
|
||||||
|
def turn_off(self, entity_id: str, **kwargs) -> bool:
|
||||||
|
domain = entity_id.split('.')[0]
|
||||||
|
return self.call_service(domain, 'turn_off', entity_id, **kwargs)
|
||||||
|
|
||||||
|
|
||||||
|
class SpeakerIdentification:
|
||||||
|
"""Speaker identification using pyannote.audio"""
|
||||||
|
|
||||||
|
def __init__(self, hf_token: str):
|
||||||
|
if not SPEAKER_ID_AVAILABLE:
|
||||||
|
raise ImportError("Speaker ID dependencies not available")
|
||||||
|
|
||||||
|
self.inference = Inference(
|
||||||
|
"pyannote/embedding",
|
||||||
|
use_auth_token=hf_token
|
||||||
|
)
|
||||||
|
self.profiles = {}
|
||||||
|
|
||||||
|
def enroll_speaker(self, name: str, audio_file: str):
|
||||||
|
"""Enroll a speaker from audio file"""
|
||||||
|
embedding = self.inference(audio_file)
|
||||||
|
self.profiles[name] = {
|
||||||
|
'embedding': embedding.tolist(), # Convert to list for JSON
|
||||||
|
'enrolled': time.time()
|
||||||
|
}
|
||||||
|
print(f"Enrolled speaker: {name}")
|
||||||
|
|
||||||
|
def identify_speaker(self, audio_file: str, threshold: float = 0.7) -> Optional[str]:
|
||||||
|
"""Identify speaker from audio file"""
|
||||||
|
if not self.profiles:
|
||||||
|
return None
|
||||||
|
|
||||||
|
unknown_embedding = self.inference(audio_file)
|
||||||
|
|
||||||
|
best_match = None
|
||||||
|
best_similarity = 0.0
|
||||||
|
|
||||||
|
for name, profile in self.profiles.items():
|
||||||
|
known_embedding = np.array(profile['embedding'])
|
||||||
|
similarity = 1 - cosine(unknown_embedding, known_embedding)
|
||||||
|
|
||||||
|
if similarity > best_similarity:
|
||||||
|
best_similarity = similarity
|
||||||
|
best_match = name
|
||||||
|
|
||||||
|
if best_similarity >= threshold:
|
||||||
|
return best_match
|
||||||
|
|
||||||
|
return 'unknown'
|
||||||
|
|
||||||
|
def load_profiles(self, filepath: str):
|
||||||
|
"""Load speaker profiles from JSON"""
|
||||||
|
if os.path.exists(filepath):
|
||||||
|
with open(filepath, 'r') as f:
|
||||||
|
self.profiles = json.load(f)
|
||||||
|
print(f"Loaded {len(self.profiles)} speaker profiles")
|
||||||
|
|
||||||
|
def save_profiles(self, filepath: str):
|
||||||
|
"""Save speaker profiles to JSON"""
|
||||||
|
os.makedirs(os.path.dirname(filepath), exist_ok=True)
|
||||||
|
with open(filepath, 'w') as f:
|
||||||
|
json.dump(self.profiles, f, indent=2)
|
||||||
|
print(f"Saved {len(self.profiles)} speaker profiles")
|
||||||
|
|
||||||
|
|
||||||
|
def load_whisper_model(model_name: str = DEFAULT_WHISPER_MODEL):
|
||||||
|
"""Load Whisper model"""
|
||||||
|
global whisper_model
|
||||||
|
if whisper_model is None:
|
||||||
|
print(f"Loading Whisper model: {model_name}")
|
||||||
|
whisper_model = whisper.load_model(model_name)
|
||||||
|
print("Whisper model loaded")
|
||||||
|
return whisper_model
|
||||||
|
|
||||||
|
|
||||||
|
def transcribe_audio(audio_file_path: str) -> Optional[str]:
|
||||||
|
"""Transcribe audio file"""
|
||||||
|
try:
|
||||||
|
model = load_whisper_model()
|
||||||
|
result = model.transcribe(audio_file_path)
|
||||||
|
return result['text'].strip()
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error transcribing: {e}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def on_wake_word_detected(wake_word_name: str):
|
||||||
|
"""Callback factory for wake word detection"""
|
||||||
|
def callback():
|
||||||
|
config = WAKE_WORD_CONFIGS.get(wake_word_name, {})
|
||||||
|
print(f"Wake word detected: {wake_word_name}")
|
||||||
|
|
||||||
|
wake_word_queue.put({
|
||||||
|
'timestamp': time.time(),
|
||||||
|
'wake_word': wake_word_name,
|
||||||
|
'response': config.get('response', 'Yes?'),
|
||||||
|
'context': config.get('context', 'general')
|
||||||
|
})
|
||||||
|
|
||||||
|
return callback
|
||||||
|
|
||||||
|
|
||||||
|
def start_multiple_wake_words(configs: Dict[str, Dict], engine_path: str):
|
||||||
|
"""Start multiple Precise wake word listeners"""
|
||||||
|
global precise_runners, precise_enabled
|
||||||
|
|
||||||
|
if not PRECISE_AVAILABLE:
|
||||||
|
print("Error: Precise not available")
|
||||||
|
return False
|
||||||
|
|
||||||
|
active_count = 0
|
||||||
|
|
||||||
|
for name, config in configs.items():
|
||||||
|
if not config.get('enabled', False):
|
||||||
|
continue
|
||||||
|
|
||||||
|
model_path = config['model']
|
||||||
|
if not os.path.exists(model_path):
|
||||||
|
print(f"Warning: Model not found: {model_path} (skipping {name})")
|
||||||
|
continue
|
||||||
|
|
||||||
|
try:
|
||||||
|
engine = PreciseEngine(engine_path, model_path)
|
||||||
|
runner = PreciseRunner(
|
||||||
|
engine,
|
||||||
|
sensitivity=config.get('sensitivity', 0.5),
|
||||||
|
on_activation=on_wake_word_detected(name)
|
||||||
|
)
|
||||||
|
runner.start()
|
||||||
|
precise_runners[name] = runner
|
||||||
|
active_count += 1
|
||||||
|
|
||||||
|
print(f"✓ Started wake word: {name}")
|
||||||
|
print(f" Model: {model_path}")
|
||||||
|
print(f" Sensitivity: {config.get('sensitivity', 0.5)}")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"✗ Failed to start {name}: {e}")
|
||||||
|
|
||||||
|
if active_count > 0:
|
||||||
|
precise_enabled = True
|
||||||
|
print(f"\nTotal active wake words: {active_count}")
|
||||||
|
return True
|
||||||
|
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def stop_all_wake_words():
|
||||||
|
"""Stop all wake word listeners"""
|
||||||
|
global precise_runners, precise_enabled
|
||||||
|
|
||||||
|
for name, runner in precise_runners.items():
|
||||||
|
try:
|
||||||
|
runner.stop()
|
||||||
|
print(f"Stopped wake word: {name}")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error stopping {name}: {e}")
|
||||||
|
|
||||||
|
precise_runners = {}
|
||||||
|
precise_enabled = False
|
||||||
|
|
||||||
|
|
||||||
|
def init_speaker_identification(hf_token: str) -> Optional[SpeakerIdentification]:
|
||||||
|
"""Initialize speaker identification"""
|
||||||
|
global speaker_inference, speaker_id_enabled
|
||||||
|
|
||||||
|
if not SPEAKER_ID_AVAILABLE:
|
||||||
|
print("Speaker ID not available")
|
||||||
|
return None
|
||||||
|
|
||||||
|
try:
|
||||||
|
speaker_inference = SpeakerIdentification(hf_token)
|
||||||
|
|
||||||
|
# Load existing profiles
|
||||||
|
if os.path.exists(SPEAKER_PROFILES_FILE):
|
||||||
|
speaker_inference.load_profiles(SPEAKER_PROFILES_FILE)
|
||||||
|
|
||||||
|
speaker_id_enabled = True
|
||||||
|
print("Speaker identification initialized")
|
||||||
|
return speaker_inference
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error initializing speaker ID: {e}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
# Flask routes
|
||||||
|
|
||||||
|
@app.route('/health', methods=['GET'])
|
||||||
|
def health():
|
||||||
|
"""Health check"""
|
||||||
|
return jsonify({
|
||||||
|
'status': 'healthy',
|
||||||
|
'whisper_loaded': whisper_model is not None,
|
||||||
|
'ha_connected': ha_client is not None,
|
||||||
|
'precise_enabled': precise_enabled,
|
||||||
|
'active_wake_words': list(precise_runners.keys()),
|
||||||
|
'speaker_id_enabled': speaker_id_enabled,
|
||||||
|
'enrolled_speakers': list(speaker_inference.profiles.keys()) if speaker_inference else []
|
||||||
|
})
|
||||||
|
|
||||||
|
|
||||||
|
@app.route('/wake-words', methods=['GET'])
|
||||||
|
def list_wake_words():
|
||||||
|
"""List all configured wake words"""
|
||||||
|
wake_words = []
|
||||||
|
|
||||||
|
for name, config in WAKE_WORD_CONFIGS.items():
|
||||||
|
wake_words.append({
|
||||||
|
'name': name,
|
||||||
|
'enabled': config.get('enabled', False),
|
||||||
|
'active': name in precise_runners,
|
||||||
|
'model': config['model'],
|
||||||
|
'sensitivity': config.get('sensitivity', 0.5),
|
||||||
|
'response': config.get('response', ''),
|
||||||
|
'context': config.get('context', 'general')
|
||||||
|
})
|
||||||
|
|
||||||
|
return jsonify({
|
||||||
|
'wake_words': wake_words,
|
||||||
|
'total': len(wake_words),
|
||||||
|
'active': len(precise_runners)
|
||||||
|
})
|
||||||
|
|
||||||
|
|
||||||
|
@app.route('/wake-words/<name>/enable', methods=['POST'])
|
||||||
|
def enable_wake_word(name):
|
||||||
|
"""Enable a wake word"""
|
||||||
|
if name not in WAKE_WORD_CONFIGS:
|
||||||
|
return jsonify({'error': 'Wake word not found'}), 404
|
||||||
|
|
||||||
|
config = WAKE_WORD_CONFIGS[name]
|
||||||
|
config['enabled'] = True
|
||||||
|
|
||||||
|
# Start the wake word if not already running
|
||||||
|
if name not in precise_runners:
|
||||||
|
# Restart all wake words to pick up changes
|
||||||
|
# (simpler than starting individual ones)
|
||||||
|
return jsonify({
|
||||||
|
'message': f'Enabled {name}. Restart server to activate.'
|
||||||
|
})
|
||||||
|
|
||||||
|
return jsonify({'message': f'Wake word {name} enabled'})
|
||||||
|
|
||||||
|
|
||||||
|
@app.route('/speakers/enroll', methods=['POST'])
|
||||||
|
def enroll_speaker():
|
||||||
|
"""Enroll a new speaker"""
|
||||||
|
if not speaker_id_enabled or not speaker_inference:
|
||||||
|
return jsonify({'error': 'Speaker ID not enabled'}), 400
|
||||||
|
|
||||||
|
if 'audio' not in request.files:
|
||||||
|
return jsonify({'error': 'No audio file'}), 400
|
||||||
|
|
||||||
|
name = request.form.get('name')
|
||||||
|
if not name:
|
||||||
|
return jsonify({'error': 'No speaker name provided'}), 400
|
||||||
|
|
||||||
|
audio_file = request.files['audio']
|
||||||
|
|
||||||
|
# Save temporarily
|
||||||
|
with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as temp:
|
||||||
|
audio_file.save(temp.name)
|
||||||
|
temp_path = temp.name
|
||||||
|
|
||||||
|
try:
|
||||||
|
speaker_inference.enroll_speaker(name, temp_path)
|
||||||
|
speaker_inference.save_profiles(SPEAKER_PROFILES_FILE)
|
||||||
|
|
||||||
|
return jsonify({
|
||||||
|
'message': f'Enrolled speaker: {name}',
|
||||||
|
'total_speakers': len(speaker_inference.profiles)
|
||||||
|
})
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
return jsonify({'error': str(e)}), 500
|
||||||
|
|
||||||
|
finally:
|
||||||
|
if os.path.exists(temp_path):
|
||||||
|
os.remove(temp_path)
|
||||||
|
|
||||||
|
|
||||||
|
@app.route('/speakers', methods=['GET'])
|
||||||
|
def list_speakers():
|
||||||
|
"""List enrolled speakers"""
|
||||||
|
if not speaker_id_enabled or not speaker_inference:
|
||||||
|
return jsonify({'error': 'Speaker ID not enabled'}), 400
|
||||||
|
|
||||||
|
speakers = []
|
||||||
|
for name, profile in speaker_inference.profiles.items():
|
||||||
|
speakers.append({
|
||||||
|
'name': name,
|
||||||
|
'enrolled': profile.get('enrolled', 0)
|
||||||
|
})
|
||||||
|
|
||||||
|
return jsonify({
|
||||||
|
'speakers': speakers,
|
||||||
|
'total': len(speakers)
|
||||||
|
})
|
||||||
|
|
||||||
|
|
||||||
|
@app.route('/process-enhanced', methods=['POST'])
|
||||||
|
def process_enhanced():
|
||||||
|
"""
|
||||||
|
Enhanced processing with speaker ID and wake word context
|
||||||
|
"""
|
||||||
|
if 'audio' not in request.files:
|
||||||
|
return jsonify({'error': 'No audio file'}), 400
|
||||||
|
|
||||||
|
wake_word = request.form.get('wake_word', 'unknown')
|
||||||
|
|
||||||
|
audio_file = request.files['audio']
|
||||||
|
|
||||||
|
with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as temp:
|
||||||
|
audio_file.save(temp.name)
|
||||||
|
temp_path = temp.name
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Identify speaker (if enabled)
|
||||||
|
speaker = 'unknown'
|
||||||
|
if speaker_id_enabled and speaker_inference:
|
||||||
|
speaker = speaker_inference.identify_speaker(temp_path)
|
||||||
|
print(f"Identified speaker: {speaker}")
|
||||||
|
|
||||||
|
# Transcribe
|
||||||
|
text = transcribe_audio(temp_path)
|
||||||
|
if not text:
|
||||||
|
return jsonify({'error': 'Transcription failed'}), 500
|
||||||
|
|
||||||
|
print(f"[{speaker}] via [{wake_word}]: {text}")
|
||||||
|
|
||||||
|
# Get wake word config
|
||||||
|
config = WAKE_WORD_CONFIGS.get(wake_word, {})
|
||||||
|
context = config.get('context', 'general')
|
||||||
|
|
||||||
|
# Process based on context and speaker
|
||||||
|
response = f"Heard via {wake_word}: {text}"
|
||||||
|
|
||||||
|
return jsonify({
|
||||||
|
'success': True,
|
||||||
|
'transcription': text,
|
||||||
|
'speaker': speaker,
|
||||||
|
'wake_word': wake_word,
|
||||||
|
'context': context,
|
||||||
|
'response': response
|
||||||
|
})
|
||||||
|
|
||||||
|
finally:
|
||||||
|
if os.path.exists(temp_path):
|
||||||
|
os.remove(temp_path)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="Enhanced Voice Server with Multi-Wake-Word and Speaker ID"
|
||||||
|
)
|
||||||
|
parser.add_argument('--host', default=DEFAULT_HOST)
|
||||||
|
parser.add_argument('--port', type=int, default=DEFAULT_PORT)
|
||||||
|
parser.add_argument('--whisper-model', default=DEFAULT_WHISPER_MODEL)
|
||||||
|
parser.add_argument('--ha-url', default=DEFAULT_HA_URL)
|
||||||
|
parser.add_argument('--ha-token', default=DEFAULT_HA_TOKEN)
|
||||||
|
parser.add_argument('--enable-precise', action='store_true',
|
||||||
|
help='Enable wake word detection')
|
||||||
|
parser.add_argument('--multi-wake-word', action='store_true',
|
||||||
|
help='Enable multiple wake words')
|
||||||
|
parser.add_argument('--precise-engine', default=DEFAULT_PRECISE_ENGINE)
|
||||||
|
parser.add_argument('--enable-speaker-id', action='store_true',
|
||||||
|
help='Enable speaker identification')
|
||||||
|
parser.add_argument('--hf-token', default=DEFAULT_HF_TOKEN,
|
||||||
|
help='HuggingFace token for speaker ID')
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
# Initialize HA client
|
||||||
|
global ha_client
|
||||||
|
ha_client = HomeAssistantClient(args.ha_url, args.ha_token)
|
||||||
|
|
||||||
|
# Load Whisper
|
||||||
|
print(f"Starting enhanced voice server on {args.host}:{args.port}")
|
||||||
|
load_whisper_model(args.whisper_model)
|
||||||
|
|
||||||
|
# Start Precise (multiple wake words)
|
||||||
|
if args.enable_precise:
|
||||||
|
if not PRECISE_AVAILABLE:
|
||||||
|
print("Error: Precise not available")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
# Enable all or just first wake word
|
||||||
|
if args.multi_wake_word:
|
||||||
|
# Enable all configured wake words
|
||||||
|
enabled_count = sum(1 for c in WAKE_WORD_CONFIGS.values() if c.get('enabled'))
|
||||||
|
print(f"\nStarting {enabled_count} wake words...")
|
||||||
|
else:
|
||||||
|
# Enable only first wake word
|
||||||
|
first_key = list(WAKE_WORD_CONFIGS.keys())[0]
|
||||||
|
WAKE_WORD_CONFIGS[first_key]['enabled'] = True
|
||||||
|
for key in list(WAKE_WORD_CONFIGS.keys())[1:]:
|
||||||
|
WAKE_WORD_CONFIGS[key]['enabled'] = False
|
||||||
|
|
||||||
|
if not start_multiple_wake_words(WAKE_WORD_CONFIGS, args.precise_engine):
|
||||||
|
print("Error: No wake words started")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
# Initialize speaker ID
|
||||||
|
if args.enable_speaker_id:
|
||||||
|
if not args.hf_token:
|
||||||
|
print("Error: --hf-token required for speaker ID")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
if not init_speaker_identification(args.hf_token):
|
||||||
|
print("Warning: Speaker ID initialization failed")
|
||||||
|
|
||||||
|
# Start server
|
||||||
|
try:
|
||||||
|
print("\n" + "="*50)
|
||||||
|
print("Server ready!")
|
||||||
|
print("="*50 + "\n")
|
||||||
|
app.run(host=args.host, port=args.port, debug=False)
|
||||||
|
except KeyboardInterrupt:
|
||||||
|
print("\nShutting down...")
|
||||||
|
stop_all_wake_words()
|
||||||
|
sys.exit(0)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
||||||
Loading…
Reference in a new issue