pyr0ball 173f7f37d4 feat: import mycroft-precise work as Minerva foundation

Ports prior voice assistant research and prototypes from devl/Devops
into the Minerva repo. Includes:

- docs/: architecture, wake word guides, ESP32-S3 spec, hardware buying guide
- scripts/: voice_server.py, voice_server_enhanced.py, setup scripts
- hardware/maixduino/: edge device scripts with WiFi credentials scrubbed
  (replaced hardcoded password with secrets.py pattern)
- config/.env.example: server config template
- .gitignore: excludes .env, secrets.py, model blobs, ELF firmware
- CLAUDE.md: Minerva product context and connection to cf-voice roadmap

2026-04-06 22:21:12 -07:00

9.8 KiB

Executable file

Raw Permalink Blame History

Maix Duino Voice Assistant - System Architecture

Overview

Local voice assistant using Sipeed Maix Duino board integrated with Home Assistant, leveraging existing home lab infrastructure for AI processing.

Hardware Components

Maix Duino Board

Processor: K210 dual-core RISC-V @ 400MHz
AI Accelerator: KPU for neural network inference
Audio: I2S microphone + speaker output
Connectivity: ESP32 for WiFi/BLE
Programming: MaixPy (MicroPython)

Recommended Accessories

I2S MEMS microphone (or microphone array for better pickup)
Small speaker (3-5W) or audio output to existing speakers
USB-C power supply (5V/2A minimum)

Software Architecture

Edge Layer (Maix Duino)

┌─────────────────────────────────────┐
│   Maix Duino (MaixPy)              │
├─────────────────────────────────────┤
│ • Wake Word Detection (KPU)        │
│ • Audio Capture (I2S)               │
│ • Audio Streaming → Heimdall        │
│ • Audio Playback ← Heimdall         │
│ • LED Feedback (listening status)   │
└─────────────────────────────────────┘
           ↕ WiFi/HTTP/WebSocket
┌─────────────────────────────────────┐
│   Voice Processing Server           │
│   (Heimdall - 10.1.10.71)          │
├─────────────────────────────────────┤
│ • Whisper STT (existing setup!)     │
│ • Intent Recognition (Rasa/custom)  │
│ • Piper TTS                         │
│ • Home Assistant API Client         │
└─────────────────────────────────────┘
           ↕ REST API/MQTT
┌─────────────────────────────────────┐
│   Home Assistant                    │
│   (Your HA instance)                │
├─────────────────────────────────────┤
│ • Device Control                    │
│ • State Management                  │
│ • Automation Triggers               │
└─────────────────────────────────────┘

Communication Flow

1. Wake Word Detection (Local)

User says "Hey Assistant"
    ↓
Maix Duino KPU detects wake word
    ↓
LED turns on (listening mode)
    ↓
Start audio streaming to Heimdall

2. Speech Processing (Heimdall)

Audio stream received
    ↓
Whisper transcribes to text
    ↓
Intent parser extracts command
    ↓
Query Home Assistant API
    ↓
Generate response text
    ↓
Piper TTS creates audio
    ↓
Stream audio back to Maix Duino

3. Playback & Feedback

Receive audio stream
    ↓
Play through speaker
    ↓
LED indicates completion
    ↓
Return to wake word detection

Network Configuration

Maix Duino Network Settings

IP: 10.1.10.xxx (assign static via DHCP reservation)
Gateway: 10.1.10.1
DNS: 10.1.10.4 (Pi-hole)

Service Endpoints

Voice Processing Server: http://10.1.10.71:5000
Home Assistant: (your existing HA URL)
MQTT Broker: (optional, if using MQTT)

Caddy Reverse Proxy Entry

Add to /mnt/project/epona_-_Caddyfile:

# Voice Assistant API
handle /voice-assistant* {
    uri strip_prefix /voice-assistant
    reverse_proxy http://10.1.10.71:5000
}

Software Stack

Maix Duino (MaixPy)

Firmware: Latest MaixPy release
Libraries:
- Maix.KPU - Neural network inference
- Maix.I2S - Audio capture/playback
- socket - Network communication
- ujson - JSON handling

Heimdall Server (Python)

Environment: Create new conda env

conda create -n voice-assistant python=3.10
conda activate voice-assistant

Dependencies:
- openai-whisper (already installed!)
- piper-tts - Text-to-speech
- flask - REST API server
- requests - HTTP client
- pyaudio - Audio handling
- websockets - Real-time streaming

Optional: Intent Recognition

Rasa - Full NLU framework (heavier but powerful)
Simple pattern matching - Lightweight, start here
LLM-based - Use your existing LLM setup on Heimdall

Data Flow Examples

Example 1: Turn on lights

User: "Hey Assistant, turn on the living room lights"
    ↓
Wake word detected → Start recording
    ↓
Whisper STT: "turn on the living room lights"
    ↓
Intent Parser: {
  "action": "turn_on",
  "entity": "light.living_room"
}
    ↓
Home Assistant API:
  POST /api/services/light/turn_on
  {"entity_id": "light.living_room"}
    ↓
Response: "Living room lights turned on"
    ↓
Piper TTS → Audio playback

Example 2: Get status

User: "What's the temperature?"
    ↓
Whisper STT: "what's the temperature"
    ↓
Intent Parser: {
  "action": "get_state",
  "entity": "sensor.temperature"
}
    ↓
Home Assistant API:
  GET /api/states/sensor.temperature
    ↓
Response: "The temperature is 72 degrees"
    ↓
Piper TTS → Audio playback

Phase 1 Implementation Plan

Step 1: Maix Duino Setup (Week 1)

Flash latest MaixPy firmware
Test audio input/output
Implement basic network communication
Test streaming audio to server

Step 2: Server Setup (Week 1-2)

Create conda environment on Heimdall
Set up Flask API server
Integrate Whisper (already have this!)
Install and test Piper TTS
Create basic Home Assistant API client

Step 3: Wake Word Training (Week 2)

Record wake word samples
Train custom wake word model
Convert model for K210 KPU
Test on-device detection

Step 4: Integration (Week 3)

Connect all components
Test end-to-end flow
Add error handling
Implement fallbacks

Step 5: Enhancement (Week 4+)

Add more intents
Improve NLU accuracy
Add multi-room support
Implement conversation context

Development Tools

Testing Wake Word

# Use existing diarization.py for testing audio quality
python3 /path/to/diarization.py test_audio.wav \
  --format vtt \
  --model medium

Monitoring

Heimdall logs: /var/log/voice-assistant/
Maix Duino serial console: 115200 baud
Home Assistant logs: Standard HA logging

Security Considerations

No external cloud services - Everything local
Network isolation - Keep on 10.1.10.0/24
Authentication - Use HA long-lived tokens
Rate limiting - Prevent abuse
Audio privacy - Only stream after wake word

Resource Requirements

Heimdall

CPU: Minimal (< 5% idle, spikes during STT)
RAM: ~2GB for Whisper medium model
Storage: ~5GB for models
Network: Low bandwidth (16kHz audio stream)

Maix Duino

Power: ~1-2W typical
Storage: 16MB flash (plenty for wake word model)
RAM: 8MB SRAM (sufficient for audio buffering)

Alternative Architectures

Option A: Fully On-Device (Limited)

Everything on Maix Duino
Very limited vocabulary
No internet required
Lower accuracy

Option B: Hybrid (Recommended)

Wake word on Maix Duino
Processing on Heimdall
Best balance of speed/accuracy

Option C: Raspberry Pi Alternative

If K210 proves limiting
More processing power
Still local/FOSS
Higher cost

Expansion Ideas

Future Enhancements

Multi-room: Deploy multiple Maix Duino units
Music playback: Integrate with Plex
Timers/Reminders: Local scheduling
Weather: Pull from local weather station
Calendar: Sync with Nextcloud
Intercom: Room-to-room communication
Sound events: Doorbell, smoke alarm detection

Integration with Existing Infrastructure

Plex: Voice control for media playback
qBittorrent: Status queries, torrent management
Nextcloud: Calendar/contact queries
Matrix: Send messages via voice

Cost Estimate

Maix Duino board: ~$20-30 (already have!)
Microphone: ~$5-10 (if not included)
Speaker: ~$10-15 (or use existing)
Total: $0-55 (mostly already have)

Compare to commercial solutions:

Google Home Mini: $50 (requires cloud)
Amazon Echo Dot: $50 (requires cloud)
Apple HomePod Mini: $99 (requires cloud)

Success Criteria

Minimum Viable Product (MVP)

✓ Wake word detection < 1 second
✓ Speech-to-text accuracy > 90%
✓ Home Assistant command execution
✓ Response time < 3 seconds total
✓ All processing local (no cloud)

Enhanced Version

✓ Multi-intent conversations
✓ Context awareness
✓ Multiple wake words
✓ Room-aware responses
✓ Custom voice training

Resources & Documentation

Community Projects

Rhasspy: https://rhasspy.readthedocs.io/ (full voice assistant framework)
Willow: https://github.com/toverainc/willow (ESP32-based alternative)

Next Steps

Test current setup: Verify Maix Duino boots and can connect to WiFi
Audio test: Record and playback test on the board
Server setup: Create conda environment and install dependencies
Simple prototype: Wake word → beep (no processing yet)
Iterate: Add complexity step by step

9.8 KiB Executable file Raw Permalink Blame History