Ports prior voice assistant research and prototypes from devl/Devops into the Minerva repo. Includes: - docs/: architecture, wake word guides, ESP32-S3 spec, hardware buying guide - scripts/: voice_server.py, voice_server_enhanced.py, setup scripts - hardware/maixduino/: edge device scripts with WiFi credentials scrubbed (replaced hardcoded password with secrets.py pattern) - config/.env.example: server config template - .gitignore: excludes .env, secrets.py, model blobs, ELF firmware - CLAUDE.md: Minerva product context and connection to cf-voice roadmap
347 lines
9.8 KiB
Markdown
Executable file
347 lines
9.8 KiB
Markdown
Executable file
# Maix Duino Voice Assistant - System Architecture
|
|
|
|
## Overview
|
|
Local voice assistant using Sipeed Maix Duino board integrated with Home Assistant, leveraging existing home lab infrastructure for AI processing.
|
|
|
|
## Hardware Components
|
|
|
|
### Maix Duino Board
|
|
- **Processor**: K210 dual-core RISC-V @ 400MHz
|
|
- **AI Accelerator**: KPU for neural network inference
|
|
- **Audio**: I2S microphone + speaker output
|
|
- **Connectivity**: ESP32 for WiFi/BLE
|
|
- **Programming**: MaixPy (MicroPython)
|
|
|
|
### Recommended Accessories
|
|
- I2S MEMS microphone (or microphone array for better pickup)
|
|
- Small speaker (3-5W) or audio output to existing speakers
|
|
- USB-C power supply (5V/2A minimum)
|
|
|
|
## Software Architecture
|
|
|
|
### Edge Layer (Maix Duino)
|
|
```
|
|
┌─────────────────────────────────────┐
|
|
│ Maix Duino (MaixPy) │
|
|
├─────────────────────────────────────┤
|
|
│ • Wake Word Detection (KPU) │
|
|
│ • Audio Capture (I2S) │
|
|
│ • Audio Streaming → Heimdall │
|
|
│ • Audio Playback ← Heimdall │
|
|
│ • LED Feedback (listening status) │
|
|
└─────────────────────────────────────┘
|
|
↕ WiFi/HTTP/WebSocket
|
|
┌─────────────────────────────────────┐
|
|
│ Voice Processing Server │
|
|
│ (Heimdall - 10.1.10.71) │
|
|
├─────────────────────────────────────┤
|
|
│ • Whisper STT (existing setup!) │
|
|
│ • Intent Recognition (Rasa/custom) │
|
|
│ • Piper TTS │
|
|
│ • Home Assistant API Client │
|
|
└─────────────────────────────────────┘
|
|
↕ REST API/MQTT
|
|
┌─────────────────────────────────────┐
|
|
│ Home Assistant │
|
|
│ (Your HA instance) │
|
|
├─────────────────────────────────────┤
|
|
│ • Device Control │
|
|
│ • State Management │
|
|
│ • Automation Triggers │
|
|
└─────────────────────────────────────┘
|
|
```
|
|
|
|
## Communication Flow
|
|
|
|
### 1. Wake Word Detection (Local)
|
|
```
|
|
User says "Hey Assistant"
|
|
↓
|
|
Maix Duino KPU detects wake word
|
|
↓
|
|
LED turns on (listening mode)
|
|
↓
|
|
Start audio streaming to Heimdall
|
|
```
|
|
|
|
### 2. Speech Processing (Heimdall)
|
|
```
|
|
Audio stream received
|
|
↓
|
|
Whisper transcribes to text
|
|
↓
|
|
Intent parser extracts command
|
|
↓
|
|
Query Home Assistant API
|
|
↓
|
|
Generate response text
|
|
↓
|
|
Piper TTS creates audio
|
|
↓
|
|
Stream audio back to Maix Duino
|
|
```
|
|
|
|
### 3. Playback & Feedback
|
|
```
|
|
Receive audio stream
|
|
↓
|
|
Play through speaker
|
|
↓
|
|
LED indicates completion
|
|
↓
|
|
Return to wake word detection
|
|
```
|
|
|
|
## Network Configuration
|
|
|
|
### Maix Duino Network Settings
|
|
- **IP**: 10.1.10.xxx (assign static via DHCP reservation)
|
|
- **Gateway**: 10.1.10.1
|
|
- **DNS**: 10.1.10.4 (Pi-hole)
|
|
|
|
### Service Endpoints
|
|
- **Voice Processing Server**: http://10.1.10.71:5000
|
|
- **Home Assistant**: (your existing HA URL)
|
|
- **MQTT Broker**: (optional, if using MQTT)
|
|
|
|
### Caddy Reverse Proxy Entry
|
|
Add to `/mnt/project/epona_-_Caddyfile`:
|
|
```caddy
|
|
# Voice Assistant API
|
|
handle /voice-assistant* {
|
|
uri strip_prefix /voice-assistant
|
|
reverse_proxy http://10.1.10.71:5000
|
|
}
|
|
```
|
|
|
|
## Software Stack
|
|
|
|
### Maix Duino (MaixPy)
|
|
- **Firmware**: Latest MaixPy release
|
|
- **Libraries**:
|
|
- `Maix.KPU` - Neural network inference
|
|
- `Maix.I2S` - Audio capture/playback
|
|
- `socket` - Network communication
|
|
- `ujson` - JSON handling
|
|
|
|
### Heimdall Server (Python)
|
|
- **Environment**: Create new conda env
|
|
```bash
|
|
conda create -n voice-assistant python=3.10
|
|
conda activate voice-assistant
|
|
```
|
|
- **Dependencies**:
|
|
- `openai-whisper` (already installed!)
|
|
- `piper-tts` - Text-to-speech
|
|
- `flask` - REST API server
|
|
- `requests` - HTTP client
|
|
- `pyaudio` - Audio handling
|
|
- `websockets` - Real-time streaming
|
|
|
|
### Optional: Intent Recognition
|
|
- **Rasa** - Full NLU framework (heavier but powerful)
|
|
- **Simple pattern matching** - Lightweight, start here
|
|
- **LLM-based** - Use your existing LLM setup on Heimdall
|
|
|
|
## Data Flow Examples
|
|
|
|
### Example 1: Turn on lights
|
|
```
|
|
User: "Hey Assistant, turn on the living room lights"
|
|
↓
|
|
Wake word detected → Start recording
|
|
↓
|
|
Whisper STT: "turn on the living room lights"
|
|
↓
|
|
Intent Parser: {
|
|
"action": "turn_on",
|
|
"entity": "light.living_room"
|
|
}
|
|
↓
|
|
Home Assistant API:
|
|
POST /api/services/light/turn_on
|
|
{"entity_id": "light.living_room"}
|
|
↓
|
|
Response: "Living room lights turned on"
|
|
↓
|
|
Piper TTS → Audio playback
|
|
```
|
|
|
|
### Example 2: Get status
|
|
```
|
|
User: "What's the temperature?"
|
|
↓
|
|
Whisper STT: "what's the temperature"
|
|
↓
|
|
Intent Parser: {
|
|
"action": "get_state",
|
|
"entity": "sensor.temperature"
|
|
}
|
|
↓
|
|
Home Assistant API:
|
|
GET /api/states/sensor.temperature
|
|
↓
|
|
Response: "The temperature is 72 degrees"
|
|
↓
|
|
Piper TTS → Audio playback
|
|
```
|
|
|
|
## Phase 1 Implementation Plan
|
|
|
|
### Step 1: Maix Duino Setup (Week 1)
|
|
- [ ] Flash latest MaixPy firmware
|
|
- [ ] Test audio input/output
|
|
- [ ] Implement basic network communication
|
|
- [ ] Test streaming audio to server
|
|
|
|
### Step 2: Server Setup (Week 1-2)
|
|
- [ ] Create conda environment on Heimdall
|
|
- [ ] Set up Flask API server
|
|
- [ ] Integrate Whisper (already have this!)
|
|
- [ ] Install and test Piper TTS
|
|
- [ ] Create basic Home Assistant API client
|
|
|
|
### Step 3: Wake Word Training (Week 2)
|
|
- [ ] Record wake word samples
|
|
- [ ] Train custom wake word model
|
|
- [ ] Convert model for K210 KPU
|
|
- [ ] Test on-device detection
|
|
|
|
### Step 4: Integration (Week 3)
|
|
- [ ] Connect all components
|
|
- [ ] Test end-to-end flow
|
|
- [ ] Add error handling
|
|
- [ ] Implement fallbacks
|
|
|
|
### Step 5: Enhancement (Week 4+)
|
|
- [ ] Add more intents
|
|
- [ ] Improve NLU accuracy
|
|
- [ ] Add multi-room support
|
|
- [ ] Implement conversation context
|
|
|
|
## Development Tools
|
|
|
|
### Testing Wake Word
|
|
```python
|
|
# Use existing diarization.py for testing audio quality
|
|
python3 /path/to/diarization.py test_audio.wav \
|
|
--format vtt \
|
|
--model medium
|
|
```
|
|
|
|
### Monitoring
|
|
- Heimdall logs: `/var/log/voice-assistant/`
|
|
- Maix Duino serial console: 115200 baud
|
|
- Home Assistant logs: Standard HA logging
|
|
|
|
## Security Considerations
|
|
|
|
1. **No external cloud services** - Everything local
|
|
2. **Network isolation** - Keep on 10.1.10.0/24
|
|
3. **Authentication** - Use HA long-lived tokens
|
|
4. **Rate limiting** - Prevent abuse
|
|
5. **Audio privacy** - Only stream after wake word
|
|
|
|
## Resource Requirements
|
|
|
|
### Heimdall
|
|
- **CPU**: Minimal (< 5% idle, spikes during STT)
|
|
- **RAM**: ~2GB for Whisper medium model
|
|
- **Storage**: ~5GB for models
|
|
- **Network**: Low bandwidth (16kHz audio stream)
|
|
|
|
### Maix Duino
|
|
- **Power**: ~1-2W typical
|
|
- **Storage**: 16MB flash (plenty for wake word model)
|
|
- **RAM**: 8MB SRAM (sufficient for audio buffering)
|
|
|
|
## Alternative Architectures
|
|
|
|
### Option A: Fully On-Device (Limited)
|
|
- Everything on Maix Duino
|
|
- Very limited vocabulary
|
|
- No internet required
|
|
- Lower accuracy
|
|
|
|
### Option B: Hybrid (Recommended)
|
|
- Wake word on Maix Duino
|
|
- Processing on Heimdall
|
|
- Best balance of speed/accuracy
|
|
|
|
### Option C: Raspberry Pi Alternative
|
|
- If K210 proves limiting
|
|
- More processing power
|
|
- Still local/FOSS
|
|
- Higher cost
|
|
|
|
## Expansion Ideas
|
|
|
|
### Future Enhancements
|
|
1. **Multi-room**: Deploy multiple Maix Duino units
|
|
2. **Music playback**: Integrate with Plex
|
|
3. **Timers/Reminders**: Local scheduling
|
|
4. **Weather**: Pull from local weather station
|
|
5. **Calendar**: Sync with Nextcloud
|
|
6. **Intercom**: Room-to-room communication
|
|
7. **Sound events**: Doorbell, smoke alarm detection
|
|
|
|
### Integration with Existing Infrastructure
|
|
- **Plex**: Voice control for media playback
|
|
- **qBittorrent**: Status queries, torrent management
|
|
- **Nextcloud**: Calendar/contact queries
|
|
- **Matrix**: Send messages via voice
|
|
|
|
## Cost Estimate
|
|
|
|
- Maix Duino board: ~$20-30 (already have!)
|
|
- Microphone: ~$5-10 (if not included)
|
|
- Speaker: ~$10-15 (or use existing)
|
|
- **Total**: $0-55 (mostly already have)
|
|
|
|
Compare to commercial solutions:
|
|
- Google Home Mini: $50 (requires cloud)
|
|
- Amazon Echo Dot: $50 (requires cloud)
|
|
- Apple HomePod Mini: $99 (requires cloud)
|
|
|
|
## Success Criteria
|
|
|
|
### Minimum Viable Product (MVP)
|
|
- ✓ Wake word detection < 1 second
|
|
- ✓ Speech-to-text accuracy > 90%
|
|
- ✓ Home Assistant command execution
|
|
- ✓ Response time < 3 seconds total
|
|
- ✓ All processing local (no cloud)
|
|
|
|
### Enhanced Version
|
|
- ✓ Multi-intent conversations
|
|
- ✓ Context awareness
|
|
- ✓ Multiple wake words
|
|
- ✓ Room-aware responses
|
|
- ✓ Custom voice training
|
|
|
|
## Resources & Documentation
|
|
|
|
### Official Documentation
|
|
- Maix Duino: https://wiki.sipeed.com/hardware/en/maix/
|
|
- MaixPy: https://maixpy.sipeed.com/
|
|
- Home Assistant API: https://developers.home-assistant.io/
|
|
|
|
### Wake Word Tools
|
|
- Mycroft Precise: https://github.com/MycroftAI/mycroft-precise
|
|
- Porcupine: https://github.com/Picovoice/porcupine
|
|
|
|
### TTS Options
|
|
- Piper: https://github.com/rhasspy/piper
|
|
- Coqui TTS: https://github.com/coqui-ai/TTS
|
|
|
|
### Community Projects
|
|
- Rhasspy: https://rhasspy.readthedocs.io/ (full voice assistant framework)
|
|
- Willow: https://github.com/toverainc/willow (ESP32-based alternative)
|
|
|
|
## Next Steps
|
|
|
|
1. **Test current setup**: Verify Maix Duino boots and can connect to WiFi
|
|
2. **Audio test**: Record and playback test on the board
|
|
3. **Server setup**: Create conda environment and install dependencies
|
|
4. **Simple prototype**: Wake word → beep (no processing yet)
|
|
5. **Iterate**: Add complexity step by step
|