minerva/docs/maix-voice-assistant-architecture.md
pyr0ball 173f7f37d4 feat: import mycroft-precise work as Minerva foundation
Ports prior voice assistant research and prototypes from devl/Devops
into the Minerva repo. Includes:

- docs/: architecture, wake word guides, ESP32-S3 spec, hardware buying guide
- scripts/: voice_server.py, voice_server_enhanced.py, setup scripts
- hardware/maixduino/: edge device scripts with WiFi credentials scrubbed
  (replaced hardcoded password with secrets.py pattern)
- config/.env.example: server config template
- .gitignore: excludes .env, secrets.py, model blobs, ELF firmware
- CLAUDE.md: Minerva product context and connection to cf-voice roadmap
2026-04-06 22:21:12 -07:00

9.8 KiB
Executable file

Maix Duino Voice Assistant - System Architecture

Overview

Local voice assistant using Sipeed Maix Duino board integrated with Home Assistant, leveraging existing home lab infrastructure for AI processing.

Hardware Components

Maix Duino Board

  • Processor: K210 dual-core RISC-V @ 400MHz
  • AI Accelerator: KPU for neural network inference
  • Audio: I2S microphone + speaker output
  • Connectivity: ESP32 for WiFi/BLE
  • Programming: MaixPy (MicroPython)
  • I2S MEMS microphone (or microphone array for better pickup)
  • Small speaker (3-5W) or audio output to existing speakers
  • USB-C power supply (5V/2A minimum)

Software Architecture

Edge Layer (Maix Duino)

┌─────────────────────────────────────┐
│   Maix Duino (MaixPy)              │
├─────────────────────────────────────┤
│ • Wake Word Detection (KPU)        │
│ • Audio Capture (I2S)               │
│ • Audio Streaming → Heimdall        │
│ • Audio Playback ← Heimdall         │
│ • LED Feedback (listening status)   │
└─────────────────────────────────────┘
           ↕ WiFi/HTTP/WebSocket
┌─────────────────────────────────────┐
│   Voice Processing Server           │
│   (Heimdall - 10.1.10.71)          │
├─────────────────────────────────────┤
│ • Whisper STT (existing setup!)     │
│ • Intent Recognition (Rasa/custom)  │
│ • Piper TTS                         │
│ • Home Assistant API Client         │
└─────────────────────────────────────┘
           ↕ REST API/MQTT
┌─────────────────────────────────────┐
│   Home Assistant                    │
│   (Your HA instance)                │
├─────────────────────────────────────┤
│ • Device Control                    │
│ • State Management                  │
│ • Automation Triggers               │
└─────────────────────────────────────┘

Communication Flow

1. Wake Word Detection (Local)

User says "Hey Assistant"
    ↓
Maix Duino KPU detects wake word
    ↓
LED turns on (listening mode)
    ↓
Start audio streaming to Heimdall

2. Speech Processing (Heimdall)

Audio stream received
    ↓
Whisper transcribes to text
    ↓
Intent parser extracts command
    ↓
Query Home Assistant API
    ↓
Generate response text
    ↓
Piper TTS creates audio
    ↓
Stream audio back to Maix Duino

3. Playback & Feedback

Receive audio stream
    ↓
Play through speaker
    ↓
LED indicates completion
    ↓
Return to wake word detection

Network Configuration

Maix Duino Network Settings

  • IP: 10.1.10.xxx (assign static via DHCP reservation)
  • Gateway: 10.1.10.1
  • DNS: 10.1.10.4 (Pi-hole)

Service Endpoints

  • Voice Processing Server: http://10.1.10.71:5000
  • Home Assistant: (your existing HA URL)
  • MQTT Broker: (optional, if using MQTT)

Caddy Reverse Proxy Entry

Add to /mnt/project/epona_-_Caddyfile:

# Voice Assistant API
handle /voice-assistant* {
    uri strip_prefix /voice-assistant
    reverse_proxy http://10.1.10.71:5000
}

Software Stack

Maix Duino (MaixPy)

  • Firmware: Latest MaixPy release
  • Libraries:
    • Maix.KPU - Neural network inference
    • Maix.I2S - Audio capture/playback
    • socket - Network communication
    • ujson - JSON handling

Heimdall Server (Python)

  • Environment: Create new conda env
    conda create -n voice-assistant python=3.10
    conda activate voice-assistant
    
  • Dependencies:
    • openai-whisper (already installed!)
    • piper-tts - Text-to-speech
    • flask - REST API server
    • requests - HTTP client
    • pyaudio - Audio handling
    • websockets - Real-time streaming

Optional: Intent Recognition

  • Rasa - Full NLU framework (heavier but powerful)
  • Simple pattern matching - Lightweight, start here
  • LLM-based - Use your existing LLM setup on Heimdall

Data Flow Examples

Example 1: Turn on lights

User: "Hey Assistant, turn on the living room lights"
    ↓
Wake word detected → Start recording
    ↓
Whisper STT: "turn on the living room lights"
    ↓
Intent Parser: {
  "action": "turn_on",
  "entity": "light.living_room"
}
    ↓
Home Assistant API:
  POST /api/services/light/turn_on
  {"entity_id": "light.living_room"}
    ↓
Response: "Living room lights turned on"
    ↓
Piper TTS → Audio playback

Example 2: Get status

User: "What's the temperature?"
    ↓
Whisper STT: "what's the temperature"
    ↓
Intent Parser: {
  "action": "get_state",
  "entity": "sensor.temperature"
}
    ↓
Home Assistant API:
  GET /api/states/sensor.temperature
    ↓
Response: "The temperature is 72 degrees"
    ↓
Piper TTS → Audio playback

Phase 1 Implementation Plan

Step 1: Maix Duino Setup (Week 1)

  • Flash latest MaixPy firmware
  • Test audio input/output
  • Implement basic network communication
  • Test streaming audio to server

Step 2: Server Setup (Week 1-2)

  • Create conda environment on Heimdall
  • Set up Flask API server
  • Integrate Whisper (already have this!)
  • Install and test Piper TTS
  • Create basic Home Assistant API client

Step 3: Wake Word Training (Week 2)

  • Record wake word samples
  • Train custom wake word model
  • Convert model for K210 KPU
  • Test on-device detection

Step 4: Integration (Week 3)

  • Connect all components
  • Test end-to-end flow
  • Add error handling
  • Implement fallbacks

Step 5: Enhancement (Week 4+)

  • Add more intents
  • Improve NLU accuracy
  • Add multi-room support
  • Implement conversation context

Development Tools

Testing Wake Word

# Use existing diarization.py for testing audio quality
python3 /path/to/diarization.py test_audio.wav \
  --format vtt \
  --model medium

Monitoring

  • Heimdall logs: /var/log/voice-assistant/
  • Maix Duino serial console: 115200 baud
  • Home Assistant logs: Standard HA logging

Security Considerations

  1. No external cloud services - Everything local
  2. Network isolation - Keep on 10.1.10.0/24
  3. Authentication - Use HA long-lived tokens
  4. Rate limiting - Prevent abuse
  5. Audio privacy - Only stream after wake word

Resource Requirements

Heimdall

  • CPU: Minimal (< 5% idle, spikes during STT)
  • RAM: ~2GB for Whisper medium model
  • Storage: ~5GB for models
  • Network: Low bandwidth (16kHz audio stream)

Maix Duino

  • Power: ~1-2W typical
  • Storage: 16MB flash (plenty for wake word model)
  • RAM: 8MB SRAM (sufficient for audio buffering)

Alternative Architectures

Option A: Fully On-Device (Limited)

  • Everything on Maix Duino
  • Very limited vocabulary
  • No internet required
  • Lower accuracy
  • Wake word on Maix Duino
  • Processing on Heimdall
  • Best balance of speed/accuracy

Option C: Raspberry Pi Alternative

  • If K210 proves limiting
  • More processing power
  • Still local/FOSS
  • Higher cost

Expansion Ideas

Future Enhancements

  1. Multi-room: Deploy multiple Maix Duino units
  2. Music playback: Integrate with Plex
  3. Timers/Reminders: Local scheduling
  4. Weather: Pull from local weather station
  5. Calendar: Sync with Nextcloud
  6. Intercom: Room-to-room communication
  7. Sound events: Doorbell, smoke alarm detection

Integration with Existing Infrastructure

  • Plex: Voice control for media playback
  • qBittorrent: Status queries, torrent management
  • Nextcloud: Calendar/contact queries
  • Matrix: Send messages via voice

Cost Estimate

  • Maix Duino board: ~$20-30 (already have!)
  • Microphone: ~$5-10 (if not included)
  • Speaker: ~$10-15 (or use existing)
  • Total: $0-55 (mostly already have)

Compare to commercial solutions:

  • Google Home Mini: $50 (requires cloud)
  • Amazon Echo Dot: $50 (requires cloud)
  • Apple HomePod Mini: $99 (requires cloud)

Success Criteria

Minimum Viable Product (MVP)

  • ✓ Wake word detection < 1 second
  • ✓ Speech-to-text accuracy > 90%
  • ✓ Home Assistant command execution
  • ✓ Response time < 3 seconds total
  • ✓ All processing local (no cloud)

Enhanced Version

  • ✓ Multi-intent conversations
  • ✓ Context awareness
  • ✓ Multiple wake words
  • ✓ Room-aware responses
  • ✓ Custom voice training

Resources & Documentation

Official Documentation

Wake Word Tools

TTS Options

Community Projects

Next Steps

  1. Test current setup: Verify Maix Duino boots and can connect to WiFi
  2. Audio test: Record and playback test on the board
  3. Server setup: Create conda environment and install dependencies
  4. Simple prototype: Wake word → beep (no processing yet)
  5. Iterate: Add complexity step by step