Ports prior voice assistant research and prototypes from devl/Devops into the Minerva repo. Includes: - docs/: architecture, wake word guides, ESP32-S3 spec, hardware buying guide - scripts/: voice_server.py, voice_server_enhanced.py, setup scripts - hardware/maixduino/: edge device scripts with WiFi credentials scrubbed (replaced hardcoded password with secrets.py pattern) - config/.env.example: server config template - .gitignore: excludes .env, secrets.py, model blobs, ELF firmware - CLAUDE.md: Minerva product context and connection to cf-voice roadmap
9.8 KiB
Executable file
9.8 KiB
Executable file
Maix Duino Voice Assistant - System Architecture
Overview
Local voice assistant using Sipeed Maix Duino board integrated with Home Assistant, leveraging existing home lab infrastructure for AI processing.
Hardware Components
Maix Duino Board
- Processor: K210 dual-core RISC-V @ 400MHz
- AI Accelerator: KPU for neural network inference
- Audio: I2S microphone + speaker output
- Connectivity: ESP32 for WiFi/BLE
- Programming: MaixPy (MicroPython)
Recommended Accessories
- I2S MEMS microphone (or microphone array for better pickup)
- Small speaker (3-5W) or audio output to existing speakers
- USB-C power supply (5V/2A minimum)
Software Architecture
Edge Layer (Maix Duino)
┌─────────────────────────────────────┐
│ Maix Duino (MaixPy) │
├─────────────────────────────────────┤
│ • Wake Word Detection (KPU) │
│ • Audio Capture (I2S) │
│ • Audio Streaming → Heimdall │
│ • Audio Playback ← Heimdall │
│ • LED Feedback (listening status) │
└─────────────────────────────────────┘
↕ WiFi/HTTP/WebSocket
┌─────────────────────────────────────┐
│ Voice Processing Server │
│ (Heimdall - 10.1.10.71) │
├─────────────────────────────────────┤
│ • Whisper STT (existing setup!) │
│ • Intent Recognition (Rasa/custom) │
│ • Piper TTS │
│ • Home Assistant API Client │
└─────────────────────────────────────┘
↕ REST API/MQTT
┌─────────────────────────────────────┐
│ Home Assistant │
│ (Your HA instance) │
├─────────────────────────────────────┤
│ • Device Control │
│ • State Management │
│ • Automation Triggers │
└─────────────────────────────────────┘
Communication Flow
1. Wake Word Detection (Local)
User says "Hey Assistant"
↓
Maix Duino KPU detects wake word
↓
LED turns on (listening mode)
↓
Start audio streaming to Heimdall
2. Speech Processing (Heimdall)
Audio stream received
↓
Whisper transcribes to text
↓
Intent parser extracts command
↓
Query Home Assistant API
↓
Generate response text
↓
Piper TTS creates audio
↓
Stream audio back to Maix Duino
3. Playback & Feedback
Receive audio stream
↓
Play through speaker
↓
LED indicates completion
↓
Return to wake word detection
Network Configuration
Maix Duino Network Settings
- IP: 10.1.10.xxx (assign static via DHCP reservation)
- Gateway: 10.1.10.1
- DNS: 10.1.10.4 (Pi-hole)
Service Endpoints
- Voice Processing Server: http://10.1.10.71:5000
- Home Assistant: (your existing HA URL)
- MQTT Broker: (optional, if using MQTT)
Caddy Reverse Proxy Entry
Add to /mnt/project/epona_-_Caddyfile:
# Voice Assistant API
handle /voice-assistant* {
uri strip_prefix /voice-assistant
reverse_proxy http://10.1.10.71:5000
}
Software Stack
Maix Duino (MaixPy)
- Firmware: Latest MaixPy release
- Libraries:
Maix.KPU- Neural network inferenceMaix.I2S- Audio capture/playbacksocket- Network communicationujson- JSON handling
Heimdall Server (Python)
- Environment: Create new conda env
conda create -n voice-assistant python=3.10 conda activate voice-assistant - Dependencies:
openai-whisper(already installed!)piper-tts- Text-to-speechflask- REST API serverrequests- HTTP clientpyaudio- Audio handlingwebsockets- Real-time streaming
Optional: Intent Recognition
- Rasa - Full NLU framework (heavier but powerful)
- Simple pattern matching - Lightweight, start here
- LLM-based - Use your existing LLM setup on Heimdall
Data Flow Examples
Example 1: Turn on lights
User: "Hey Assistant, turn on the living room lights"
↓
Wake word detected → Start recording
↓
Whisper STT: "turn on the living room lights"
↓
Intent Parser: {
"action": "turn_on",
"entity": "light.living_room"
}
↓
Home Assistant API:
POST /api/services/light/turn_on
{"entity_id": "light.living_room"}
↓
Response: "Living room lights turned on"
↓
Piper TTS → Audio playback
Example 2: Get status
User: "What's the temperature?"
↓
Whisper STT: "what's the temperature"
↓
Intent Parser: {
"action": "get_state",
"entity": "sensor.temperature"
}
↓
Home Assistant API:
GET /api/states/sensor.temperature
↓
Response: "The temperature is 72 degrees"
↓
Piper TTS → Audio playback
Phase 1 Implementation Plan
Step 1: Maix Duino Setup (Week 1)
- Flash latest MaixPy firmware
- Test audio input/output
- Implement basic network communication
- Test streaming audio to server
Step 2: Server Setup (Week 1-2)
- Create conda environment on Heimdall
- Set up Flask API server
- Integrate Whisper (already have this!)
- Install and test Piper TTS
- Create basic Home Assistant API client
Step 3: Wake Word Training (Week 2)
- Record wake word samples
- Train custom wake word model
- Convert model for K210 KPU
- Test on-device detection
Step 4: Integration (Week 3)
- Connect all components
- Test end-to-end flow
- Add error handling
- Implement fallbacks
Step 5: Enhancement (Week 4+)
- Add more intents
- Improve NLU accuracy
- Add multi-room support
- Implement conversation context
Development Tools
Testing Wake Word
# Use existing diarization.py for testing audio quality
python3 /path/to/diarization.py test_audio.wav \
--format vtt \
--model medium
Monitoring
- Heimdall logs:
/var/log/voice-assistant/ - Maix Duino serial console: 115200 baud
- Home Assistant logs: Standard HA logging
Security Considerations
- No external cloud services - Everything local
- Network isolation - Keep on 10.1.10.0/24
- Authentication - Use HA long-lived tokens
- Rate limiting - Prevent abuse
- Audio privacy - Only stream after wake word
Resource Requirements
Heimdall
- CPU: Minimal (< 5% idle, spikes during STT)
- RAM: ~2GB for Whisper medium model
- Storage: ~5GB for models
- Network: Low bandwidth (16kHz audio stream)
Maix Duino
- Power: ~1-2W typical
- Storage: 16MB flash (plenty for wake word model)
- RAM: 8MB SRAM (sufficient for audio buffering)
Alternative Architectures
Option A: Fully On-Device (Limited)
- Everything on Maix Duino
- Very limited vocabulary
- No internet required
- Lower accuracy
Option B: Hybrid (Recommended)
- Wake word on Maix Duino
- Processing on Heimdall
- Best balance of speed/accuracy
Option C: Raspberry Pi Alternative
- If K210 proves limiting
- More processing power
- Still local/FOSS
- Higher cost
Expansion Ideas
Future Enhancements
- Multi-room: Deploy multiple Maix Duino units
- Music playback: Integrate with Plex
- Timers/Reminders: Local scheduling
- Weather: Pull from local weather station
- Calendar: Sync with Nextcloud
- Intercom: Room-to-room communication
- Sound events: Doorbell, smoke alarm detection
Integration with Existing Infrastructure
- Plex: Voice control for media playback
- qBittorrent: Status queries, torrent management
- Nextcloud: Calendar/contact queries
- Matrix: Send messages via voice
Cost Estimate
- Maix Duino board: ~$20-30 (already have!)
- Microphone: ~$5-10 (if not included)
- Speaker: ~$10-15 (or use existing)
- Total: $0-55 (mostly already have)
Compare to commercial solutions:
- Google Home Mini: $50 (requires cloud)
- Amazon Echo Dot: $50 (requires cloud)
- Apple HomePod Mini: $99 (requires cloud)
Success Criteria
Minimum Viable Product (MVP)
- ✓ Wake word detection < 1 second
- ✓ Speech-to-text accuracy > 90%
- ✓ Home Assistant command execution
- ✓ Response time < 3 seconds total
- ✓ All processing local (no cloud)
Enhanced Version
- ✓ Multi-intent conversations
- ✓ Context awareness
- ✓ Multiple wake words
- ✓ Room-aware responses
- ✓ Custom voice training
Resources & Documentation
Official Documentation
- Maix Duino: https://wiki.sipeed.com/hardware/en/maix/
- MaixPy: https://maixpy.sipeed.com/
- Home Assistant API: https://developers.home-assistant.io/
Wake Word Tools
- Mycroft Precise: https://github.com/MycroftAI/mycroft-precise
- Porcupine: https://github.com/Picovoice/porcupine
TTS Options
- Piper: https://github.com/rhasspy/piper
- Coqui TTS: https://github.com/coqui-ai/TTS
Community Projects
- Rhasspy: https://rhasspy.readthedocs.io/ (full voice assistant framework)
- Willow: https://github.com/toverainc/willow (ESP32-based alternative)
Next Steps
- Test current setup: Verify Maix Duino boots and can connect to WiFi
- Audio test: Record and playback test on the board
- Server setup: Create conda environment and install dependencies
- Simple prototype: Wake word → beep (no processing yet)
- Iterate: Add complexity step by step