minerva/docs/maix-voice-assistant-architecture.md
pyr0ball 173f7f37d4 feat: import mycroft-precise work as Minerva foundation
Ports prior voice assistant research and prototypes from devl/Devops
into the Minerva repo. Includes:

- docs/: architecture, wake word guides, ESP32-S3 spec, hardware buying guide
- scripts/: voice_server.py, voice_server_enhanced.py, setup scripts
- hardware/maixduino/: edge device scripts with WiFi credentials scrubbed
  (replaced hardcoded password with secrets.py pattern)
- config/.env.example: server config template
- .gitignore: excludes .env, secrets.py, model blobs, ELF firmware
- CLAUDE.md: Minerva product context and connection to cf-voice roadmap
2026-04-06 22:21:12 -07:00

347 lines
9.8 KiB
Markdown
Executable file

# Maix Duino Voice Assistant - System Architecture
## Overview
Local voice assistant using Sipeed Maix Duino board integrated with Home Assistant, leveraging existing home lab infrastructure for AI processing.
## Hardware Components
### Maix Duino Board
- **Processor**: K210 dual-core RISC-V @ 400MHz
- **AI Accelerator**: KPU for neural network inference
- **Audio**: I2S microphone + speaker output
- **Connectivity**: ESP32 for WiFi/BLE
- **Programming**: MaixPy (MicroPython)
### Recommended Accessories
- I2S MEMS microphone (or microphone array for better pickup)
- Small speaker (3-5W) or audio output to existing speakers
- USB-C power supply (5V/2A minimum)
## Software Architecture
### Edge Layer (Maix Duino)
```
┌─────────────────────────────────────┐
│ Maix Duino (MaixPy) │
├─────────────────────────────────────┤
│ • Wake Word Detection (KPU) │
│ • Audio Capture (I2S) │
│ • Audio Streaming → Heimdall │
│ • Audio Playback ← Heimdall │
│ • LED Feedback (listening status) │
└─────────────────────────────────────┘
↕ WiFi/HTTP/WebSocket
┌─────────────────────────────────────┐
│ Voice Processing Server │
│ (Heimdall - 10.1.10.71) │
├─────────────────────────────────────┤
│ • Whisper STT (existing setup!) │
│ • Intent Recognition (Rasa/custom) │
│ • Piper TTS │
│ • Home Assistant API Client │
└─────────────────────────────────────┘
↕ REST API/MQTT
┌─────────────────────────────────────┐
│ Home Assistant │
│ (Your HA instance) │
├─────────────────────────────────────┤
│ • Device Control │
│ • State Management │
│ • Automation Triggers │
└─────────────────────────────────────┘
```
## Communication Flow
### 1. Wake Word Detection (Local)
```
User says "Hey Assistant"
Maix Duino KPU detects wake word
LED turns on (listening mode)
Start audio streaming to Heimdall
```
### 2. Speech Processing (Heimdall)
```
Audio stream received
Whisper transcribes to text
Intent parser extracts command
Query Home Assistant API
Generate response text
Piper TTS creates audio
Stream audio back to Maix Duino
```
### 3. Playback & Feedback
```
Receive audio stream
Play through speaker
LED indicates completion
Return to wake word detection
```
## Network Configuration
### Maix Duino Network Settings
- **IP**: 10.1.10.xxx (assign static via DHCP reservation)
- **Gateway**: 10.1.10.1
- **DNS**: 10.1.10.4 (Pi-hole)
### Service Endpoints
- **Voice Processing Server**: http://10.1.10.71:5000
- **Home Assistant**: (your existing HA URL)
- **MQTT Broker**: (optional, if using MQTT)
### Caddy Reverse Proxy Entry
Add to `/mnt/project/epona_-_Caddyfile`:
```caddy
# Voice Assistant API
handle /voice-assistant* {
uri strip_prefix /voice-assistant
reverse_proxy http://10.1.10.71:5000
}
```
## Software Stack
### Maix Duino (MaixPy)
- **Firmware**: Latest MaixPy release
- **Libraries**:
- `Maix.KPU` - Neural network inference
- `Maix.I2S` - Audio capture/playback
- `socket` - Network communication
- `ujson` - JSON handling
### Heimdall Server (Python)
- **Environment**: Create new conda env
```bash
conda create -n voice-assistant python=3.10
conda activate voice-assistant
```
- **Dependencies**:
- `openai-whisper` (already installed!)
- `piper-tts` - Text-to-speech
- `flask` - REST API server
- `requests` - HTTP client
- `pyaudio` - Audio handling
- `websockets` - Real-time streaming
### Optional: Intent Recognition
- **Rasa** - Full NLU framework (heavier but powerful)
- **Simple pattern matching** - Lightweight, start here
- **LLM-based** - Use your existing LLM setup on Heimdall
## Data Flow Examples
### Example 1: Turn on lights
```
User: "Hey Assistant, turn on the living room lights"
Wake word detected → Start recording
Whisper STT: "turn on the living room lights"
Intent Parser: {
"action": "turn_on",
"entity": "light.living_room"
}
Home Assistant API:
POST /api/services/light/turn_on
{"entity_id": "light.living_room"}
Response: "Living room lights turned on"
Piper TTS → Audio playback
```
### Example 2: Get status
```
User: "What's the temperature?"
Whisper STT: "what's the temperature"
Intent Parser: {
"action": "get_state",
"entity": "sensor.temperature"
}
Home Assistant API:
GET /api/states/sensor.temperature
Response: "The temperature is 72 degrees"
Piper TTS → Audio playback
```
## Phase 1 Implementation Plan
### Step 1: Maix Duino Setup (Week 1)
- [ ] Flash latest MaixPy firmware
- [ ] Test audio input/output
- [ ] Implement basic network communication
- [ ] Test streaming audio to server
### Step 2: Server Setup (Week 1-2)
- [ ] Create conda environment on Heimdall
- [ ] Set up Flask API server
- [ ] Integrate Whisper (already have this!)
- [ ] Install and test Piper TTS
- [ ] Create basic Home Assistant API client
### Step 3: Wake Word Training (Week 2)
- [ ] Record wake word samples
- [ ] Train custom wake word model
- [ ] Convert model for K210 KPU
- [ ] Test on-device detection
### Step 4: Integration (Week 3)
- [ ] Connect all components
- [ ] Test end-to-end flow
- [ ] Add error handling
- [ ] Implement fallbacks
### Step 5: Enhancement (Week 4+)
- [ ] Add more intents
- [ ] Improve NLU accuracy
- [ ] Add multi-room support
- [ ] Implement conversation context
## Development Tools
### Testing Wake Word
```python
# Use existing diarization.py for testing audio quality
python3 /path/to/diarization.py test_audio.wav \
--format vtt \
--model medium
```
### Monitoring
- Heimdall logs: `/var/log/voice-assistant/`
- Maix Duino serial console: 115200 baud
- Home Assistant logs: Standard HA logging
## Security Considerations
1. **No external cloud services** - Everything local
2. **Network isolation** - Keep on 10.1.10.0/24
3. **Authentication** - Use HA long-lived tokens
4. **Rate limiting** - Prevent abuse
5. **Audio privacy** - Only stream after wake word
## Resource Requirements
### Heimdall
- **CPU**: Minimal (< 5% idle, spikes during STT)
- **RAM**: ~2GB for Whisper medium model
- **Storage**: ~5GB for models
- **Network**: Low bandwidth (16kHz audio stream)
### Maix Duino
- **Power**: ~1-2W typical
- **Storage**: 16MB flash (plenty for wake word model)
- **RAM**: 8MB SRAM (sufficient for audio buffering)
## Alternative Architectures
### Option A: Fully On-Device (Limited)
- Everything on Maix Duino
- Very limited vocabulary
- No internet required
- Lower accuracy
### Option B: Hybrid (Recommended)
- Wake word on Maix Duino
- Processing on Heimdall
- Best balance of speed/accuracy
### Option C: Raspberry Pi Alternative
- If K210 proves limiting
- More processing power
- Still local/FOSS
- Higher cost
## Expansion Ideas
### Future Enhancements
1. **Multi-room**: Deploy multiple Maix Duino units
2. **Music playback**: Integrate with Plex
3. **Timers/Reminders**: Local scheduling
4. **Weather**: Pull from local weather station
5. **Calendar**: Sync with Nextcloud
6. **Intercom**: Room-to-room communication
7. **Sound events**: Doorbell, smoke alarm detection
### Integration with Existing Infrastructure
- **Plex**: Voice control for media playback
- **qBittorrent**: Status queries, torrent management
- **Nextcloud**: Calendar/contact queries
- **Matrix**: Send messages via voice
## Cost Estimate
- Maix Duino board: ~$20-30 (already have!)
- Microphone: ~$5-10 (if not included)
- Speaker: ~$10-15 (or use existing)
- **Total**: $0-55 (mostly already have)
Compare to commercial solutions:
- Google Home Mini: $50 (requires cloud)
- Amazon Echo Dot: $50 (requires cloud)
- Apple HomePod Mini: $99 (requires cloud)
## Success Criteria
### Minimum Viable Product (MVP)
- Wake word detection < 1 second
- Speech-to-text accuracy > 90%
- ✓ Home Assistant command execution
- ✓ Response time < 3 seconds total
- All processing local (no cloud)
### Enhanced Version
- Multi-intent conversations
- Context awareness
- Multiple wake words
- Room-aware responses
- Custom voice training
## Resources & Documentation
### Official Documentation
- Maix Duino: https://wiki.sipeed.com/hardware/en/maix/
- MaixPy: https://maixpy.sipeed.com/
- Home Assistant API: https://developers.home-assistant.io/
### Wake Word Tools
- Mycroft Precise: https://github.com/MycroftAI/mycroft-precise
- Porcupine: https://github.com/Picovoice/porcupine
### TTS Options
- Piper: https://github.com/rhasspy/piper
- Coqui TTS: https://github.com/coqui-ai/TTS
### Community Projects
- Rhasspy: https://rhasspy.readthedocs.io/ (full voice assistant framework)
- Willow: https://github.com/toverainc/willow (ESP32-based alternative)
## Next Steps
1. **Test current setup**: Verify Maix Duino boots and can connect to WiFi
2. **Audio test**: Record and playback test on the board
3. **Server setup**: Create conda environment and install dependencies
4. **Simple prototype**: Wake word beep (no processing yet)
5. **Iterate**: Add complexity step by step