minerva/docs/maix-voice-assistant-architecture.md

# Maix Duino Voice Assistant - System Architecture

## Overview
Local voice assistant using Sipeed Maix Duino board integrated with Home Assistant, leveraging existing home lab infrastructure for AI processing.

## Hardware Components

### Maix Duino Board
- **Processor**: K210 dual-core RISC-V @ 400MHz
- **AI Accelerator**: KPU for neural network inference
- **Audio**: I2S microphone + speaker output
- **Connectivity**: ESP32 for WiFi/BLE
- **Programming**: MaixPy (MicroPython)

### Recommended Accessories
- I2S MEMS microphone (or microphone array for better pickup)
- Small speaker (3-5W) or audio output to existing speakers
- USB-C power supply (5V/2A minimum)

## Software Architecture

### Edge Layer (Maix Duino)
```
┌─────────────────────────────────────┐
│   Maix Duino (MaixPy)              │
├─────────────────────────────────────┤
│ • Wake Word Detection (KPU)        │
│ • Audio Capture (I2S)               │
│ • Audio Streaming → Heimdall        │
│ • Audio Playback ← Heimdall         │
│ • LED Feedback (listening status)   │
└─────────────────────────────────────┘
           ↕ WiFi/HTTP/WebSocket
┌─────────────────────────────────────┐
│   Voice Processing Server           │
│   (Heimdall - 10.1.10.71)          │
├─────────────────────────────────────┤
│ • Whisper STT (existing setup!)     │
│ • Intent Recognition (Rasa/custom)  │
│ • Piper TTS                         │
│ • Home Assistant API Client         │
└─────────────────────────────────────┘
           ↕ REST API/MQTT
┌─────────────────────────────────────┐
│   Home Assistant                    │
│   (Your HA instance)                │
├─────────────────────────────────────┤
│ • Device Control                    │
│ • State Management                  │
│ • Automation Triggers               │
└─────────────────────────────────────┘
```

## Communication Flow

### 1. Wake Word Detection (Local)
```
User says "Hey Assistant"
    ↓
Maix Duino KPU detects wake word
    ↓
LED turns on (listening mode)
    ↓
Start audio streaming to Heimdall
```

### 2. Speech Processing (Heimdall)
```
Audio stream received
    ↓
Whisper transcribes to text
    ↓
Intent parser extracts command
    ↓
Query Home Assistant API
    ↓
Generate response text
    ↓
Piper TTS creates audio
    ↓
Stream audio back to Maix Duino
```

### 3. Playback & Feedback
```
Receive audio stream
    ↓
Play through speaker
    ↓
LED indicates completion
    ↓
Return to wake word detection
```

## Network Configuration

### Maix Duino Network Settings
- **IP**: 10.1.10.xxx (assign static via DHCP reservation)
- **Gateway**: 10.1.10.1
- **DNS**: 10.1.10.4 (Pi-hole)

### Service Endpoints
- **Voice Processing Server**: http://10.1.10.71:5000
- **Home Assistant**: (your existing HA URL)
- **MQTT Broker**: (optional, if using MQTT)

### Caddy Reverse Proxy Entry
Add to `/mnt/project/epona_-_Caddyfile`:
```caddy
# Voice Assistant API
handle /voice-assistant* {
    uri strip_prefix /voice-assistant
    reverse_proxy http://10.1.10.71:5000
}
```

## Software Stack

### Maix Duino (MaixPy)
- **Firmware**: Latest MaixPy release
- **Libraries**:
  - `Maix.KPU` - Neural network inference
  - `Maix.I2S` - Audio capture/playback
  - `socket` - Network communication
  - `ujson` - JSON handling

### Heimdall Server (Python)
- **Environment**: Create new conda env
  ```bash
  conda create -n voice-assistant python=3.10
  conda activate voice-assistant
  ```
- **Dependencies**:
  - `openai-whisper` (already installed!)
  - `piper-tts` - Text-to-speech
  - `flask` - REST API server
  - `requests` - HTTP client
  - `pyaudio` - Audio handling
  - `websockets` - Real-time streaming

### Optional: Intent Recognition
- **Rasa** - Full NLU framework (heavier but powerful)
- **Simple pattern matching** - Lightweight, start here
- **LLM-based** - Use your existing LLM setup on Heimdall

## Data Flow Examples

### Example 1: Turn on lights
```
User: "Hey Assistant, turn on the living room lights"
    ↓
Wake word detected → Start recording
    ↓
Whisper STT: "turn on the living room lights"
    ↓
Intent Parser: {
  "action": "turn_on",
  "entity": "light.living_room"
}
    ↓
Home Assistant API:
  POST /api/services/light/turn_on
  {"entity_id": "light.living_room"}
    ↓
Response: "Living room lights turned on"
    ↓
Piper TTS → Audio playback
```

### Example 2: Get status
```
User: "What's the temperature?"
    ↓
Whisper STT: "what's the temperature"
    ↓
Intent Parser: {
  "action": "get_state",
  "entity": "sensor.temperature"
}
    ↓
Home Assistant API:
  GET /api/states/sensor.temperature
    ↓
Response: "The temperature is 72 degrees"
    ↓
Piper TTS → Audio playback
```

## Phase 1 Implementation Plan

### Step 1: Maix Duino Setup (Week 1)
- [ ] Flash latest MaixPy firmware
- [ ] Test audio input/output
- [ ] Implement basic network communication
- [ ] Test streaming audio to server

### Step 2: Server Setup (Week 1-2)
- [ ] Create conda environment on Heimdall
- [ ] Set up Flask API server
- [ ] Integrate Whisper (already have this!)
- [ ] Install and test Piper TTS
- [ ] Create basic Home Assistant API client

### Step 3: Wake Word Training (Week 2)
- [ ] Record wake word samples
- [ ] Train custom wake word model
- [ ] Convert model for K210 KPU
- [ ] Test on-device detection

### Step 4: Integration (Week 3)
- [ ] Connect all components
- [ ] Test end-to-end flow
- [ ] Add error handling
- [ ] Implement fallbacks

### Step 5: Enhancement (Week 4+)
- [ ] Add more intents
- [ ] Improve NLU accuracy
- [ ] Add multi-room support
- [ ] Implement conversation context

## Development Tools

### Testing Wake Word
```python
# Use existing diarization.py for testing audio quality
python3 /path/to/diarization.py test_audio.wav \
  --format vtt \
  --model medium
```

### Monitoring
- Heimdall logs: `/var/log/voice-assistant/`
- Maix Duino serial console: 115200 baud
- Home Assistant logs: Standard HA logging

## Security Considerations

1. **No external cloud services** - Everything local
2. **Network isolation** - Keep on 10.1.10.0/24
3. **Authentication** - Use HA long-lived tokens
4. **Rate limiting** - Prevent abuse
5. **Audio privacy** - Only stream after wake word

## Resource Requirements

### Heimdall
- **CPU**: Minimal (< 5% idle, spikes during STT)
- **RAM**: ~2GB for Whisper medium model
- **Storage**: ~5GB for models
- **Network**: Low bandwidth (16kHz audio stream)

### Maix Duino
- **Power**: ~1-2W typical
- **Storage**: 16MB flash (plenty for wake word model)
- **RAM**: 8MB SRAM (sufficient for audio buffering)

## Alternative Architectures

### Option A: Fully On-Device (Limited)
- Everything on Maix Duino
- Very limited vocabulary
- No internet required
- Lower accuracy

### Option B: Hybrid (Recommended)
- Wake word on Maix Duino
- Processing on Heimdall
- Best balance of speed/accuracy

### Option C: Raspberry Pi Alternative
- If K210 proves limiting
- More processing power
- Still local/FOSS
- Higher cost

## Expansion Ideas

### Future Enhancements
1. **Multi-room**: Deploy multiple Maix Duino units
2. **Music playback**: Integrate with Plex
3. **Timers/Reminders**: Local scheduling
4. **Weather**: Pull from local weather station
5. **Calendar**: Sync with Nextcloud
6. **Intercom**: Room-to-room communication
7. **Sound events**: Doorbell, smoke alarm detection

### Integration with Existing Infrastructure
- **Plex**: Voice control for media playback
- **qBittorrent**: Status queries, torrent management
- **Nextcloud**: Calendar/contact queries
- **Matrix**: Send messages via voice

## Cost Estimate

- Maix Duino board: ~$20-30 (already have!)
- Microphone: ~$5-10 (if not included)
- Speaker: ~$10-15 (or use existing)
- **Total**: $0-55 (mostly already have)

Compare to commercial solutions:
- Google Home Mini: $50 (requires cloud)
- Amazon Echo Dot: $50 (requires cloud)
- Apple HomePod Mini: $99 (requires cloud)

## Success Criteria

### Minimum Viable Product (MVP)
- ✓ Wake word detection < 1 second
- ✓ Speech-to-text accuracy > 90%
- ✓ Home Assistant command execution
- ✓ Response time < 3 seconds total
- ✓ All processing local (no cloud)

### Enhanced Version
- ✓ Multi-intent conversations
- ✓ Context awareness
- ✓ Multiple wake words
- ✓ Room-aware responses
- ✓ Custom voice training

## Resources & Documentation

### Official Documentation
- Maix Duino: https://wiki.sipeed.com/hardware/en/maix/
- MaixPy: https://maixpy.sipeed.com/
- Home Assistant API: https://developers.home-assistant.io/

### Wake Word Tools
- Mycroft Precise: https://github.com/MycroftAI/mycroft-precise
- Porcupine: https://github.com/Picovoice/porcupine

### TTS Options
- Piper: https://github.com/rhasspy/piper
- Coqui TTS: https://github.com/coqui-ai/TTS

### Community Projects
- Rhasspy: https://rhasspy.readthedocs.io/ (full voice assistant framework)
- Willow: https://github.com/toverainc/willow (ESP32-based alternative)

## Next Steps

1. **Test current setup**: Verify Maix Duino boots and can connect to WiFi
2. **Audio test**: Record and playback test on the board
3. **Server setup**: Create conda environment and install dependencies
4. **Simple prototype**: Wake word → beep (no processing yet)
5. **Iterate**: Add complexity step by step