# Maix Duino Voice Assistant - System Architecture ## Overview Local voice assistant using Sipeed Maix Duino board integrated with Home Assistant, leveraging existing home lab infrastructure for AI processing. ## Hardware Components ### Maix Duino Board - **Processor**: K210 dual-core RISC-V @ 400MHz - **AI Accelerator**: KPU for neural network inference - **Audio**: I2S microphone + speaker output - **Connectivity**: ESP32 for WiFi/BLE - **Programming**: MaixPy (MicroPython) ### Recommended Accessories - I2S MEMS microphone (or microphone array for better pickup) - Small speaker (3-5W) or audio output to existing speakers - USB-C power supply (5V/2A minimum) ## Software Architecture ### Edge Layer (Maix Duino) ``` ┌─────────────────────────────────────┐ │ Maix Duino (MaixPy) │ ├─────────────────────────────────────┤ │ • Wake Word Detection (KPU) │ │ • Audio Capture (I2S) │ │ • Audio Streaming → Heimdall │ │ • Audio Playback ← Heimdall │ │ • LED Feedback (listening status) │ └─────────────────────────────────────┘ ↕ WiFi/HTTP/WebSocket ┌─────────────────────────────────────┐ │ Voice Processing Server │ │ (Heimdall - 10.1.10.71) │ ├─────────────────────────────────────┤ │ • Whisper STT (existing setup!) │ │ • Intent Recognition (Rasa/custom) │ │ • Piper TTS │ │ • Home Assistant API Client │ └─────────────────────────────────────┘ ↕ REST API/MQTT ┌─────────────────────────────────────┐ │ Home Assistant │ │ (Your HA instance) │ ├─────────────────────────────────────┤ │ • Device Control │ │ • State Management │ │ • Automation Triggers │ └─────────────────────────────────────┘ ``` ## Communication Flow ### 1. Wake Word Detection (Local) ``` User says "Hey Assistant" ↓ Maix Duino KPU detects wake word ↓ LED turns on (listening mode) ↓ Start audio streaming to Heimdall ``` ### 2. Speech Processing (Heimdall) ``` Audio stream received ↓ Whisper transcribes to text ↓ Intent parser extracts command ↓ Query Home Assistant API ↓ Generate response text ↓ Piper TTS creates audio ↓ Stream audio back to Maix Duino ``` ### 3. Playback & Feedback ``` Receive audio stream ↓ Play through speaker ↓ LED indicates completion ↓ Return to wake word detection ``` ## Network Configuration ### Maix Duino Network Settings - **IP**: 10.1.10.xxx (assign static via DHCP reservation) - **Gateway**: 10.1.10.1 - **DNS**: 10.1.10.4 (Pi-hole) ### Service Endpoints - **Voice Processing Server**: http://10.1.10.71:5000 - **Home Assistant**: (your existing HA URL) - **MQTT Broker**: (optional, if using MQTT) ### Caddy Reverse Proxy Entry Add to `/mnt/project/epona_-_Caddyfile`: ```caddy # Voice Assistant API handle /voice-assistant* { uri strip_prefix /voice-assistant reverse_proxy http://10.1.10.71:5000 } ``` ## Software Stack ### Maix Duino (MaixPy) - **Firmware**: Latest MaixPy release - **Libraries**: - `Maix.KPU` - Neural network inference - `Maix.I2S` - Audio capture/playback - `socket` - Network communication - `ujson` - JSON handling ### Heimdall Server (Python) - **Environment**: Create new conda env ```bash conda create -n voice-assistant python=3.10 conda activate voice-assistant ``` - **Dependencies**: - `openai-whisper` (already installed!) - `piper-tts` - Text-to-speech - `flask` - REST API server - `requests` - HTTP client - `pyaudio` - Audio handling - `websockets` - Real-time streaming ### Optional: Intent Recognition - **Rasa** - Full NLU framework (heavier but powerful) - **Simple pattern matching** - Lightweight, start here - **LLM-based** - Use your existing LLM setup on Heimdall ## Data Flow Examples ### Example 1: Turn on lights ``` User: "Hey Assistant, turn on the living room lights" ↓ Wake word detected → Start recording ↓ Whisper STT: "turn on the living room lights" ↓ Intent Parser: { "action": "turn_on", "entity": "light.living_room" } ↓ Home Assistant API: POST /api/services/light/turn_on {"entity_id": "light.living_room"} ↓ Response: "Living room lights turned on" ↓ Piper TTS → Audio playback ``` ### Example 2: Get status ``` User: "What's the temperature?" ↓ Whisper STT: "what's the temperature" ↓ Intent Parser: { "action": "get_state", "entity": "sensor.temperature" } ↓ Home Assistant API: GET /api/states/sensor.temperature ↓ Response: "The temperature is 72 degrees" ↓ Piper TTS → Audio playback ``` ## Phase 1 Implementation Plan ### Step 1: Maix Duino Setup (Week 1) - [ ] Flash latest MaixPy firmware - [ ] Test audio input/output - [ ] Implement basic network communication - [ ] Test streaming audio to server ### Step 2: Server Setup (Week 1-2) - [ ] Create conda environment on Heimdall - [ ] Set up Flask API server - [ ] Integrate Whisper (already have this!) - [ ] Install and test Piper TTS - [ ] Create basic Home Assistant API client ### Step 3: Wake Word Training (Week 2) - [ ] Record wake word samples - [ ] Train custom wake word model - [ ] Convert model for K210 KPU - [ ] Test on-device detection ### Step 4: Integration (Week 3) - [ ] Connect all components - [ ] Test end-to-end flow - [ ] Add error handling - [ ] Implement fallbacks ### Step 5: Enhancement (Week 4+) - [ ] Add more intents - [ ] Improve NLU accuracy - [ ] Add multi-room support - [ ] Implement conversation context ## Development Tools ### Testing Wake Word ```python # Use existing diarization.py for testing audio quality python3 /path/to/diarization.py test_audio.wav \ --format vtt \ --model medium ``` ### Monitoring - Heimdall logs: `/var/log/voice-assistant/` - Maix Duino serial console: 115200 baud - Home Assistant logs: Standard HA logging ## Security Considerations 1. **No external cloud services** - Everything local 2. **Network isolation** - Keep on 10.1.10.0/24 3. **Authentication** - Use HA long-lived tokens 4. **Rate limiting** - Prevent abuse 5. **Audio privacy** - Only stream after wake word ## Resource Requirements ### Heimdall - **CPU**: Minimal (< 5% idle, spikes during STT) - **RAM**: ~2GB for Whisper medium model - **Storage**: ~5GB for models - **Network**: Low bandwidth (16kHz audio stream) ### Maix Duino - **Power**: ~1-2W typical - **Storage**: 16MB flash (plenty for wake word model) - **RAM**: 8MB SRAM (sufficient for audio buffering) ## Alternative Architectures ### Option A: Fully On-Device (Limited) - Everything on Maix Duino - Very limited vocabulary - No internet required - Lower accuracy ### Option B: Hybrid (Recommended) - Wake word on Maix Duino - Processing on Heimdall - Best balance of speed/accuracy ### Option C: Raspberry Pi Alternative - If K210 proves limiting - More processing power - Still local/FOSS - Higher cost ## Expansion Ideas ### Future Enhancements 1. **Multi-room**: Deploy multiple Maix Duino units 2. **Music playback**: Integrate with Plex 3. **Timers/Reminders**: Local scheduling 4. **Weather**: Pull from local weather station 5. **Calendar**: Sync with Nextcloud 6. **Intercom**: Room-to-room communication 7. **Sound events**: Doorbell, smoke alarm detection ### Integration with Existing Infrastructure - **Plex**: Voice control for media playback - **qBittorrent**: Status queries, torrent management - **Nextcloud**: Calendar/contact queries - **Matrix**: Send messages via voice ## Cost Estimate - Maix Duino board: ~$20-30 (already have!) - Microphone: ~$5-10 (if not included) - Speaker: ~$10-15 (or use existing) - **Total**: $0-55 (mostly already have) Compare to commercial solutions: - Google Home Mini: $50 (requires cloud) - Amazon Echo Dot: $50 (requires cloud) - Apple HomePod Mini: $99 (requires cloud) ## Success Criteria ### Minimum Viable Product (MVP) - ✓ Wake word detection < 1 second - ✓ Speech-to-text accuracy > 90% - ✓ Home Assistant command execution - ✓ Response time < 3 seconds total - ✓ All processing local (no cloud) ### Enhanced Version - ✓ Multi-intent conversations - ✓ Context awareness - ✓ Multiple wake words - ✓ Room-aware responses - ✓ Custom voice training ## Resources & Documentation ### Official Documentation - Maix Duino: https://wiki.sipeed.com/hardware/en/maix/ - MaixPy: https://maixpy.sipeed.com/ - Home Assistant API: https://developers.home-assistant.io/ ### Wake Word Tools - Mycroft Precise: https://github.com/MycroftAI/mycroft-precise - Porcupine: https://github.com/Picovoice/porcupine ### TTS Options - Piper: https://github.com/rhasspy/piper - Coqui TTS: https://github.com/coqui-ai/TTS ### Community Projects - Rhasspy: https://rhasspy.readthedocs.io/ (full voice assistant framework) - Willow: https://github.com/toverainc/willow (ESP32-based alternative) ## Next Steps 1. **Test current setup**: Verify Maix Duino boots and can connect to WiFi 2. **Audio test**: Record and playback test on the board 3. **Server setup**: Create conda environment and install dependencies 4. **Simple prototype**: Wake word → beep (no processing yet) 5. **Iterate**: Add complexity step by step