Ports prior voice assistant research and prototypes from devl/Devops into the Minerva repo. Includes: - docs/: architecture, wake word guides, ESP32-S3 spec, hardware buying guide - scripts/: voice_server.py, voice_server_enhanced.py, setup scripts - hardware/maixduino/: edge device scripts with WiFi credentials scrubbed (replaced hardcoded password with secrets.py pattern) - config/.env.example: server config template - .gitignore: excludes .env, secrets.py, model blobs, ELF firmware - CLAUDE.md: Minerva product context and connection to cf-voice roadmap
566 lines
15 KiB
Markdown
Executable file
566 lines
15 KiB
Markdown
Executable file
# Maix Duino LCD & Camera Feature Analysis
|
||
|
||
**Date:** 2025-11-29
|
||
**Hardware:** Sipeed Maix Duino (K210)
|
||
**Question:** What's the overhead for using LCD display and camera?
|
||
|
||
---
|
||
|
||
## Hardware Capabilities
|
||
|
||
### LCD Display
|
||
- **Resolution:** Typically 320x240 or 240x135 (depending on model)
|
||
- **Interface:** SPI
|
||
- **Color:** RGB565 (16-bit color)
|
||
- **Frame Rate:** Up to 60 FPS (limited by SPI bandwidth)
|
||
- **Status:** ✅ Included with most Maix Duino kits
|
||
|
||
### Camera
|
||
- **Resolution:** Various (OV2640 common: 2MP, up to 1600x1200)
|
||
- **Interface:** DVP (Digital Video Port)
|
||
- **Frame Rate:** Up to 60 FPS (lower at high resolution)
|
||
- **Status:** ✅ Often included with Maix Duino kits
|
||
|
||
### K210 Resources
|
||
- **CPU:** Dual-core RISC-V @ 400MHz
|
||
- **KPU:** Neural network accelerator
|
||
- **SRAM:** 8MB total (6MB available for apps)
|
||
- **Flash:** 16MB
|
||
|
||
---
|
||
|
||
## LCD Usage for Voice Assistant
|
||
|
||
### Use Case 1: Status Display (Minimal Overhead)
|
||
**What to Show:**
|
||
- Current state (idle/listening/processing/responding)
|
||
- Wake word detected indicator
|
||
- WiFi status and signal strength
|
||
- Server connection status
|
||
- Volume level
|
||
- Time/date
|
||
|
||
**Overhead:**
|
||
- **CPU:** ~2-5% (simple text/icons)
|
||
- **RAM:** ~200KB (framebuffer + assets)
|
||
- **Power:** ~50mW additional
|
||
- **Complexity:** Low (MaixPy has built-in LCD support)
|
||
|
||
**Code Example:**
|
||
```python
|
||
import lcd
|
||
import image
|
||
|
||
lcd.init()
|
||
lcd.rotation(2) # Rotate if needed
|
||
|
||
# Simple status display
|
||
img = image.Image(size=(320, 240))
|
||
img.draw_string(10, 10, "Listening...", color=(0, 255, 0), scale=3)
|
||
img.draw_circle(300, 20, 10, color=(0, 255, 0), fill=True) # Status LED
|
||
lcd.display(img)
|
||
```
|
||
|
||
**Verdict:** ✅ **Very Low Overhead - Highly Recommended**
|
||
|
||
---
|
||
|
||
### Use Case 2: Audio Waveform Visualizer (Moderate Overhead)
|
||
|
||
#### Input Waveform (Microphone)
|
||
**What to Show:**
|
||
- Real-time audio level meter
|
||
- Waveform display (oscilloscope style)
|
||
- VU meter
|
||
- Frequency spectrum (simple bars)
|
||
|
||
**Overhead:**
|
||
- **CPU:** ~10-15% (real-time drawing)
|
||
- **RAM:** ~300KB (framebuffer + audio buffer)
|
||
- **Frame Rate:** 15-30 FPS (sufficient for audio visualization)
|
||
- **Complexity:** Moderate (drawing primitives + FFT)
|
||
|
||
**Implementation:**
|
||
```python
|
||
import lcd, audio, image
|
||
import array
|
||
|
||
lcd.init()
|
||
audio.init()
|
||
|
||
def draw_waveform(audio_buffer):
|
||
img = image.Image(size=(320, 240))
|
||
|
||
# Draw waveform
|
||
width = 320
|
||
height = 240
|
||
center = height // 2
|
||
|
||
# Sample every Nth point to fit on screen
|
||
step = len(audio_buffer) // width
|
||
|
||
for x in range(width - 1):
|
||
y1 = center + (audio_buffer[x * step] // 256)
|
||
y2 = center + (audio_buffer[(x + 1) * step] // 256)
|
||
img.draw_line(x, y1, x + 1, y2, color=(0, 255, 0))
|
||
|
||
# Add level meter
|
||
level = max(abs(min(audio_buffer)), abs(max(audio_buffer)))
|
||
bar_height = (level * height) // 32768
|
||
img.draw_rectangle(0, height - bar_height, 20, bar_height,
|
||
color=(0, 255, 0), fill=True)
|
||
|
||
lcd.display(img)
|
||
```
|
||
|
||
**Verdict:** ✅ **Moderate Overhead - Feasible and Cool!**
|
||
|
||
---
|
||
|
||
#### Output Waveform (TTS Response)
|
||
**What to Show:**
|
||
- TTS audio being played back
|
||
- Speaking animation (mouth/sound waves)
|
||
- Response text scrolling
|
||
|
||
**Overhead:**
|
||
- **CPU:** ~10-15% (similar to input)
|
||
- **RAM:** ~300KB
|
||
- **Complexity:** Moderate
|
||
|
||
**Note:** Can reuse same visualization code as input waveform.
|
||
|
||
**Verdict:** ✅ **Same as Input - Totally Doable**
|
||
|
||
---
|
||
|
||
### Use Case 3: Spectrum Analyzer (Higher Overhead)
|
||
**What to Show:**
|
||
- Frequency bars (FFT visualization)
|
||
- 8-16 frequency bands
|
||
- Classic "equalizer" look
|
||
|
||
**Overhead:**
|
||
- **CPU:** ~20-30% (FFT computation + drawing)
|
||
- **RAM:** ~500KB (FFT buffers + framebuffer)
|
||
- **Complexity:** Moderate-High (FFT required)
|
||
|
||
**Implementation Note:**
|
||
- K210 KPU can accelerate FFT operations
|
||
- Can do simple 8-band analysis with minimal CPU
|
||
- More bands = more CPU
|
||
|
||
**Verdict:** ⚠️ **Higher Overhead - Use Sparingly**
|
||
|
||
---
|
||
|
||
### Use Case 4: Interactive UI (High Overhead)
|
||
**What to Show:**
|
||
- Touchscreen controls (if touchscreen available)
|
||
- Settings menu
|
||
- Volume slider
|
||
- Wake word selection
|
||
- Network configuration
|
||
|
||
**Overhead:**
|
||
- **CPU:** ~20-40% (touch detection + UI rendering)
|
||
- **RAM:** ~1MB (UI framework + assets)
|
||
- **Complexity:** High (need UI framework)
|
||
|
||
**Verdict:** ⚠️ **High Overhead - Nice-to-Have Later**
|
||
|
||
---
|
||
|
||
## Camera Usage for Voice Assistant
|
||
|
||
### Use Case 1: Person Detection (Wake on Face)
|
||
**What to Do:**
|
||
- Detect person in frame
|
||
- Only listen when someone present
|
||
- Privacy mode: disable when no one around
|
||
|
||
**Overhead:**
|
||
- **CPU:** ~30-40% (KPU handles inference)
|
||
- **RAM:** ~1.5MB (model + frame buffers)
|
||
- **Power:** ~200mW additional
|
||
- **Complexity:** Moderate (pre-trained models available)
|
||
|
||
**Pros:**
|
||
- ✅ Privacy enhancement (only listen when occupied)
|
||
- ✅ Power saving (sleep when empty room)
|
||
- ✅ Pre-trained models available for K210
|
||
|
||
**Cons:**
|
||
- ❌ Adds latency (check camera before listening)
|
||
- ❌ Privacy concerns (camera always on)
|
||
- ❌ Moderate resource usage
|
||
|
||
**Verdict:** 🤔 **Interesting but Complex - Phase 2+**
|
||
|
||
---
|
||
|
||
### Use Case 2: Visual Context (Future AI Integration)
|
||
**What to Do:**
|
||
- "What am I holding?" queries
|
||
- Visual scene understanding
|
||
- QR code scanning
|
||
- Gesture control
|
||
|
||
**Overhead:**
|
||
- **CPU:** 40-60% (vision processing)
|
||
- **RAM:** 2-3MB (models + buffers)
|
||
- **Complexity:** High (requires vision models)
|
||
|
||
**Verdict:** ❌ **Too Complex for Initial Release - Future Feature**
|
||
|
||
---
|
||
|
||
### Use Case 3: Visual Wake Word (Gesture Detection)
|
||
**What to Do:**
|
||
- Wave hand to activate
|
||
- Thumbs up/down for feedback
|
||
- Alternative to voice wake word
|
||
|
||
**Overhead:**
|
||
- **CPU:** ~30-40% (gesture detection)
|
||
- **RAM:** ~1.5MB
|
||
- **Complexity:** Moderate-High
|
||
|
||
**Verdict:** 🤔 **Novel Idea - Phase 3+**
|
||
|
||
---
|
||
|
||
## Recommended LCD Implementation
|
||
|
||
### Phase 1: Basic Status Display (Recommended NOW)
|
||
```
|
||
┌─────────────────────────┐
|
||
│ Voice Assistant │
|
||
│ │
|
||
│ Status: Listening ● │
|
||
│ WiFi: ████░░ 75% │
|
||
│ Server: Connected │
|
||
│ │
|
||
│ Volume: [██████░░░] │
|
||
│ │
|
||
│ Time: 14:23 │
|
||
└─────────────────────────┘
|
||
```
|
||
|
||
**Features:**
|
||
- Current state indicator
|
||
- WiFi signal strength
|
||
- Server connection status
|
||
- Volume level bar
|
||
- Clock
|
||
- Wake word indicator (pulsing circle)
|
||
|
||
**Overhead:** ~2-5% CPU, 200KB RAM
|
||
|
||
---
|
||
|
||
### Phase 2: Waveform Visualization (Cool Addition)
|
||
```
|
||
┌─────────────────────────┐
|
||
│ Listening... [●] │
|
||
├─────────────────────────┤
|
||
│ ╱╲ ╱╲ ╱╲ ╱╲ │
|
||
│ ╱ ╲╱ ╲ ╱ ╲╱ ╲ │
|
||
│ │
|
||
│ Level: [████░░░░░░] │
|
||
└─────────────────────────┘
|
||
```
|
||
|
||
**Features:**
|
||
- Real-time waveform (15-30 FPS)
|
||
- Audio level meter
|
||
- State indicator
|
||
- Simple and clean
|
||
|
||
**Overhead:** ~10-15% CPU, 300KB RAM
|
||
|
||
---
|
||
|
||
### Phase 3: Enhanced Visualizer (Polish)
|
||
```
|
||
┌─────────────────────────┐
|
||
│ Hey Computer! [●] │
|
||
├─────────────────────────┤
|
||
│ ▁▂▃▄▅▆▇█ ▁▂▃▄▅▆▇█ │
|
||
│ ▁▂▃▄▅▆▇█ ▁▂▃▄▅▆▇█ │
|
||
│ │
|
||
│ "Turn off the lights" │
|
||
└─────────────────────────┘
|
||
```
|
||
|
||
**Features:**
|
||
- Spectrum analyzer (8-16 bands)
|
||
- Transcription display
|
||
- Animated response
|
||
- More polished UI
|
||
|
||
**Overhead:** ~20-30% CPU, 500KB RAM
|
||
|
||
---
|
||
|
||
## Resource Budget Analysis
|
||
|
||
### Total K210 Resources
|
||
- **CPU:** 2 cores @ 400MHz (assume ~100% available)
|
||
- **RAM:** 6MB available for app
|
||
- **Bandwidth:** SPI (LCD), I2S (audio), WiFi
|
||
|
||
### Current Voice Assistant Usage (Server-Side Wake Word)
|
||
|
||
| Component | CPU % | RAM (KB) |
|
||
|-----------|-------|----------|
|
||
| Audio Capture (I2S) | 5% | 128 |
|
||
| Audio Playback | 5% | 128 |
|
||
| WiFi Streaming | 10% | 256 |
|
||
| Network Stack | 5% | 512 |
|
||
| MaixPy Runtime | 10% | 1024 |
|
||
| **Base Total** | **35%** | **~2MB** |
|
||
|
||
### With LCD Features
|
||
|
||
| Display Mode | CPU % | RAM (KB) | Total CPU | Total RAM |
|
||
|--------------|-------|----------|-----------|-----------|
|
||
| **None** | 0% | 0 | 35% | 2MB |
|
||
| **Status Only** | 2-5% | 200 | 37-40% | 2.2MB |
|
||
| **Waveform** | 10-15% | 300 | 45-50% | 2.3MB |
|
||
| **Spectrum** | 20-30% | 500 | 55-65% | 2.5MB |
|
||
|
||
### With Camera Features
|
||
|
||
| Feature | CPU % | RAM (KB) | Feasible? |
|
||
|---------|-------|----------|-----------|
|
||
| Person Detection | 30-40% | 1500 | ⚠️ Tight |
|
||
| Gesture Control | 30-40% | 1500 | ⚠️ Tight |
|
||
| Visual Context | 40-60% | 2500 | ❌ Too much |
|
||
|
||
---
|
||
|
||
## Recommendations
|
||
|
||
### ✅ IMPLEMENT NOW: Basic Status Display
|
||
- **Why:** Very low overhead, huge UX improvement
|
||
- **Overhead:** 2-5% CPU, 200KB RAM
|
||
- **Benefit:** Users know what's happening at a glance
|
||
- **Difficulty:** Easy (MaixPy has good LCD support)
|
||
|
||
### ✅ IMPLEMENT SOON: Waveform Visualizer
|
||
- **Why:** Cool factor, moderate overhead
|
||
- **Overhead:** 10-15% CPU, 300KB RAM
|
||
- **Benefit:** Engaging, confirms mic is working, looks professional
|
||
- **Difficulty:** Moderate (simple drawing code)
|
||
|
||
### 🤔 CONSIDER LATER: Spectrum Analyzer
|
||
- **Why:** Higher overhead, diminishing returns
|
||
- **Overhead:** 20-30% CPU, 500KB RAM
|
||
- **Benefit:** Looks cool but not essential
|
||
- **Difficulty:** Moderate-High (FFT required)
|
||
|
||
### ❌ SKIP FOR NOW: Camera Features
|
||
- **Why:** High overhead, complex, privacy concerns
|
||
- **Overhead:** 30-60% CPU, 1.5-2.5MB RAM
|
||
- **Benefit:** Novel but not core functionality
|
||
- **Difficulty:** High (model integration, privacy handling)
|
||
|
||
---
|
||
|
||
## Implementation Priority
|
||
|
||
### Phase 1 (Week 1): Core Functionality
|
||
- [x] Audio capture and streaming
|
||
- [x] Server integration
|
||
- [ ] Basic LCD status display
|
||
- Idle/Listening/Processing states
|
||
- WiFi status
|
||
- Connection indicator
|
||
|
||
### Phase 2 (Week 2-3): Visual Enhancement
|
||
- [ ] Audio waveform visualizer
|
||
- Input (microphone) waveform
|
||
- Output (TTS) waveform
|
||
- Level meters
|
||
- Clean, minimal design
|
||
|
||
### Phase 3 (Month 2): Polish
|
||
- [ ] Spectrum analyzer option
|
||
- [ ] Animated transitions
|
||
- [ ] Settings display
|
||
- [ ] Network configuration UI (optional)
|
||
|
||
### Phase 4 (Month 3+): Advanced Features
|
||
- [ ] Camera person detection (privacy mode)
|
||
- [ ] Gesture control experiments
|
||
- [ ] Visual wake word alternative
|
||
|
||
---
|
||
|
||
## Code Structure Recommendation
|
||
|
||
```python
|
||
# main.py structure with modular display
|
||
|
||
import lcd, audio, network
|
||
from display_manager import DisplayManager
|
||
from audio_processor import AudioProcessor
|
||
from voice_client import VoiceClient
|
||
|
||
# Initialize
|
||
lcd.init()
|
||
display = DisplayManager(mode='waveform') # or 'status' or 'spectrum'
|
||
|
||
# Main loop
|
||
while True:
|
||
# Audio processing
|
||
audio_buffer = audio.capture()
|
||
|
||
# Update display (non-blocking)
|
||
if display.mode == 'status':
|
||
display.show_status(state='listening', wifi_level=75)
|
||
elif display.mode == 'waveform':
|
||
display.show_waveform(audio_buffer)
|
||
elif display.mode == 'spectrum':
|
||
display.show_spectrum(audio_buffer)
|
||
|
||
# Network communication
|
||
voice_client.stream_audio(audio_buffer)
|
||
```
|
||
|
||
---
|
||
|
||
## Measured Overhead (Estimated)
|
||
|
||
### Status Display Only
|
||
- **CPU:** 38% total (3% for display)
|
||
- **RAM:** 2.2MB total (200KB for display)
|
||
- **Battery Life:** -2% (minimal impact)
|
||
- **WiFi Latency:** No impact
|
||
- **Verdict:** ✅ Negligible impact, worth it!
|
||
|
||
### Waveform Visualizer
|
||
- **CPU:** 48% total (13% for display)
|
||
- **RAM:** 2.3MB total (300KB for display)
|
||
- **Battery Life:** -5% (minor impact)
|
||
- **WiFi Latency:** No impact (still <200ms)
|
||
- **Verdict:** ✅ Acceptable, looks great!
|
||
|
||
### Spectrum Analyzer
|
||
- **CPU:** 60% total (25% for display)
|
||
- **RAM:** 2.5MB total (500KB for display)
|
||
- **Battery Life:** -8% (noticeable)
|
||
- **WiFi Latency:** Possible minor impact
|
||
- **Verdict:** ⚠️ Usable but pushing limits
|
||
|
||
---
|
||
|
||
## Camera: Should You Use It?
|
||
|
||
### Pros
|
||
- ✅ Already have the hardware (free!)
|
||
- ✅ Novel features (person detection, gestures)
|
||
- ✅ Privacy enhancement potential
|
||
- ✅ Future-proofing
|
||
|
||
### Cons
|
||
- ❌ High resource usage (30-60% CPU, 1.5-2.5MB RAM)
|
||
- ❌ Complex implementation
|
||
- ❌ Privacy concerns (camera always on)
|
||
- ❌ Not core to voice assistant
|
||
- ❌ Competes with audio processing resources
|
||
|
||
### Recommendation
|
||
**Skip camera for initial implementation.** Focus on core voice assistant functionality. Revisit in Phase 3+ when:
|
||
1. Core features are stable
|
||
2. You want to experiment
|
||
3. You have time for optimization
|
||
4. You want to differentiate from commercial assistants
|
||
|
||
---
|
||
|
||
## Final Recommendations
|
||
|
||
### Start With (NOW):
|
||
```python
|
||
# Simple status display
|
||
# - State indicator
|
||
# - WiFi status
|
||
# - Connection status
|
||
# - Time/date
|
||
# Overhead: ~3% CPU, 200KB RAM
|
||
```
|
||
|
||
### Add Next (Week 2):
|
||
```python
|
||
# Waveform visualizer
|
||
# - Real-time audio waveform
|
||
# - Level meter
|
||
# - Clean design
|
||
# Overhead: +10% CPU, +100KB RAM
|
||
```
|
||
|
||
### Maybe Later (Month 2+):
|
||
```python
|
||
# Spectrum analyzer
|
||
# - 8-16 frequency bands
|
||
# - FFT visualization
|
||
# - Optional mode
|
||
# Overhead: +15% CPU, +200KB RAM
|
||
```
|
||
|
||
### Skip (For Now):
|
||
```python
|
||
# Camera features
|
||
# - Person detection
|
||
# - Gestures
|
||
# - Visual context
|
||
# Too complex, revisit later
|
||
```
|
||
|
||
---
|
||
|
||
## Example: Combined Status + Waveform Display
|
||
|
||
```
|
||
┌───────────────────────────────┐
|
||
│ Voice Assistant [LISTENING]│
|
||
├───────────────────────────────┤
|
||
│ │
|
||
│ ╱╲ ╱╲ ╱╲ ╱╲ ╱╲ │
|
||
│ ╱ ╲ ╱ ╲╱ ╲ ╱ ╲╱ ╲ │
|
||
│ ╲╱ ╲╱ │
|
||
│ │
|
||
│ Vol: [████████░░] WiFi: ▂▃▅█ │
|
||
│ │
|
||
│ Server: 10.1.10.71 ● 14:23 │
|
||
└───────────────────────────────┘
|
||
```
|
||
|
||
**Total Overhead:** ~15% CPU, 300KB RAM
|
||
**Impact:** Minimal, excellent UX improvement
|
||
**Coolness Factor:** 9/10
|
||
|
||
---
|
||
|
||
## Conclusion
|
||
|
||
### LCD: YES! Definitely Use It! ✅
|
||
- **Status display:** Low overhead, huge benefit
|
||
- **Waveform:** Moderate overhead, looks amazing
|
||
- **Spectrum:** Higher overhead, nice-to-have
|
||
|
||
**Recommendation:** Start with status, add waveform, consider spectrum later.
|
||
|
||
### Camera: Skip For Now ❌
|
||
- High overhead
|
||
- Complex implementation
|
||
- Not core functionality
|
||
- Revisit in Phase 3+
|
||
|
||
**Focus on nailing the voice assistant first, then add visual features incrementally!**
|
||
|
||
---
|
||
|
||
**TL;DR:** Use the LCD for status + waveform visualization (~15% overhead total). Skip the camera for now. Your K210 can easily handle this! 🎉
|