Ports prior voice assistant research and prototypes from devl/Devops into the Minerva repo. Includes: - docs/: architecture, wake word guides, ESP32-S3 spec, hardware buying guide - scripts/: voice_server.py, voice_server_enhanced.py, setup scripts - hardware/maixduino/: edge device scripts with WiFi credentials scrubbed (replaced hardcoded password with secrets.py pattern) - config/.env.example: server config template - .gitignore: excludes .env, secrets.py, model blobs, ELF firmware - CLAUDE.md: Minerva product context and connection to cf-voice roadmap
15 KiB
Executable file
Maix Duino LCD & Camera Feature Analysis
Date: 2025-11-29
Hardware: Sipeed Maix Duino (K210)
Question: What's the overhead for using LCD display and camera?
Hardware Capabilities
LCD Display
- Resolution: Typically 320x240 or 240x135 (depending on model)
- Interface: SPI
- Color: RGB565 (16-bit color)
- Frame Rate: Up to 60 FPS (limited by SPI bandwidth)
- Status: ✅ Included with most Maix Duino kits
Camera
- Resolution: Various (OV2640 common: 2MP, up to 1600x1200)
- Interface: DVP (Digital Video Port)
- Frame Rate: Up to 60 FPS (lower at high resolution)
- Status: ✅ Often included with Maix Duino kits
K210 Resources
- CPU: Dual-core RISC-V @ 400MHz
- KPU: Neural network accelerator
- SRAM: 8MB total (6MB available for apps)
- Flash: 16MB
LCD Usage for Voice Assistant
Use Case 1: Status Display (Minimal Overhead)
What to Show:
- Current state (idle/listening/processing/responding)
- Wake word detected indicator
- WiFi status and signal strength
- Server connection status
- Volume level
- Time/date
Overhead:
- CPU: ~2-5% (simple text/icons)
- RAM: ~200KB (framebuffer + assets)
- Power: ~50mW additional
- Complexity: Low (MaixPy has built-in LCD support)
Code Example:
import lcd
import image
lcd.init()
lcd.rotation(2) # Rotate if needed
# Simple status display
img = image.Image(size=(320, 240))
img.draw_string(10, 10, "Listening...", color=(0, 255, 0), scale=3)
img.draw_circle(300, 20, 10, color=(0, 255, 0), fill=True) # Status LED
lcd.display(img)
Verdict: ✅ Very Low Overhead - Highly Recommended
Use Case 2: Audio Waveform Visualizer (Moderate Overhead)
Input Waveform (Microphone)
What to Show:
- Real-time audio level meter
- Waveform display (oscilloscope style)
- VU meter
- Frequency spectrum (simple bars)
Overhead:
- CPU: ~10-15% (real-time drawing)
- RAM: ~300KB (framebuffer + audio buffer)
- Frame Rate: 15-30 FPS (sufficient for audio visualization)
- Complexity: Moderate (drawing primitives + FFT)
Implementation:
import lcd, audio, image
import array
lcd.init()
audio.init()
def draw_waveform(audio_buffer):
img = image.Image(size=(320, 240))
# Draw waveform
width = 320
height = 240
center = height // 2
# Sample every Nth point to fit on screen
step = len(audio_buffer) // width
for x in range(width - 1):
y1 = center + (audio_buffer[x * step] // 256)
y2 = center + (audio_buffer[(x + 1) * step] // 256)
img.draw_line(x, y1, x + 1, y2, color=(0, 255, 0))
# Add level meter
level = max(abs(min(audio_buffer)), abs(max(audio_buffer)))
bar_height = (level * height) // 32768
img.draw_rectangle(0, height - bar_height, 20, bar_height,
color=(0, 255, 0), fill=True)
lcd.display(img)
Verdict: ✅ Moderate Overhead - Feasible and Cool!
Output Waveform (TTS Response)
What to Show:
- TTS audio being played back
- Speaking animation (mouth/sound waves)
- Response text scrolling
Overhead:
- CPU: ~10-15% (similar to input)
- RAM: ~300KB
- Complexity: Moderate
Note: Can reuse same visualization code as input waveform.
Verdict: ✅ Same as Input - Totally Doable
Use Case 3: Spectrum Analyzer (Higher Overhead)
What to Show:
- Frequency bars (FFT visualization)
- 8-16 frequency bands
- Classic "equalizer" look
Overhead:
- CPU: ~20-30% (FFT computation + drawing)
- RAM: ~500KB (FFT buffers + framebuffer)
- Complexity: Moderate-High (FFT required)
Implementation Note:
- K210 KPU can accelerate FFT operations
- Can do simple 8-band analysis with minimal CPU
- More bands = more CPU
Verdict: ⚠️ Higher Overhead - Use Sparingly
Use Case 4: Interactive UI (High Overhead)
What to Show:
- Touchscreen controls (if touchscreen available)
- Settings menu
- Volume slider
- Wake word selection
- Network configuration
Overhead:
- CPU: ~20-40% (touch detection + UI rendering)
- RAM: ~1MB (UI framework + assets)
- Complexity: High (need UI framework)
Verdict: ⚠️ High Overhead - Nice-to-Have Later
Camera Usage for Voice Assistant
Use Case 1: Person Detection (Wake on Face)
What to Do:
- Detect person in frame
- Only listen when someone present
- Privacy mode: disable when no one around
Overhead:
- CPU: ~30-40% (KPU handles inference)
- RAM: ~1.5MB (model + frame buffers)
- Power: ~200mW additional
- Complexity: Moderate (pre-trained models available)
Pros:
- ✅ Privacy enhancement (only listen when occupied)
- ✅ Power saving (sleep when empty room)
- ✅ Pre-trained models available for K210
Cons:
- ❌ Adds latency (check camera before listening)
- ❌ Privacy concerns (camera always on)
- ❌ Moderate resource usage
Verdict: 🤔 Interesting but Complex - Phase 2+
Use Case 2: Visual Context (Future AI Integration)
What to Do:
- "What am I holding?" queries
- Visual scene understanding
- QR code scanning
- Gesture control
Overhead:
- CPU: 40-60% (vision processing)
- RAM: 2-3MB (models + buffers)
- Complexity: High (requires vision models)
Verdict: ❌ Too Complex for Initial Release - Future Feature
Use Case 3: Visual Wake Word (Gesture Detection)
What to Do:
- Wave hand to activate
- Thumbs up/down for feedback
- Alternative to voice wake word
Overhead:
- CPU: ~30-40% (gesture detection)
- RAM: ~1.5MB
- Complexity: Moderate-High
Verdict: 🤔 Novel Idea - Phase 3+
Recommended LCD Implementation
Phase 1: Basic Status Display (Recommended NOW)
┌─────────────────────────┐
│ Voice Assistant │
│ │
│ Status: Listening ● │
│ WiFi: ████░░ 75% │
│ Server: Connected │
│ │
│ Volume: [██████░░░] │
│ │
│ Time: 14:23 │
└─────────────────────────┘
Features:
- Current state indicator
- WiFi signal strength
- Server connection status
- Volume level bar
- Clock
- Wake word indicator (pulsing circle)
Overhead: ~2-5% CPU, 200KB RAM
Phase 2: Waveform Visualization (Cool Addition)
┌─────────────────────────┐
│ Listening... [●] │
├─────────────────────────┤
│ ╱╲ ╱╲ ╱╲ ╱╲ │
│ ╱ ╲╱ ╲ ╱ ╲╱ ╲ │
│ │
│ Level: [████░░░░░░] │
└─────────────────────────┘
Features:
- Real-time waveform (15-30 FPS)
- Audio level meter
- State indicator
- Simple and clean
Overhead: ~10-15% CPU, 300KB RAM
Phase 3: Enhanced Visualizer (Polish)
┌─────────────────────────┐
│ Hey Computer! [●] │
├─────────────────────────┤
│ ▁▂▃▄▅▆▇█ ▁▂▃▄▅▆▇█ │
│ ▁▂▃▄▅▆▇█ ▁▂▃▄▅▆▇█ │
│ │
│ "Turn off the lights" │
└─────────────────────────┘
Features:
- Spectrum analyzer (8-16 bands)
- Transcription display
- Animated response
- More polished UI
Overhead: ~20-30% CPU, 500KB RAM
Resource Budget Analysis
Total K210 Resources
- CPU: 2 cores @ 400MHz (assume ~100% available)
- RAM: 6MB available for app
- Bandwidth: SPI (LCD), I2S (audio), WiFi
Current Voice Assistant Usage (Server-Side Wake Word)
| Component | CPU % | RAM (KB) |
|---|---|---|
| Audio Capture (I2S) | 5% | 128 |
| Audio Playback | 5% | 128 |
| WiFi Streaming | 10% | 256 |
| Network Stack | 5% | 512 |
| MaixPy Runtime | 10% | 1024 |
| Base Total | 35% | ~2MB |
With LCD Features
| Display Mode | CPU % | RAM (KB) | Total CPU | Total RAM |
|---|---|---|---|---|
| None | 0% | 0 | 35% | 2MB |
| Status Only | 2-5% | 200 | 37-40% | 2.2MB |
| Waveform | 10-15% | 300 | 45-50% | 2.3MB |
| Spectrum | 20-30% | 500 | 55-65% | 2.5MB |
With Camera Features
| Feature | CPU % | RAM (KB) | Feasible? |
|---|---|---|---|
| Person Detection | 30-40% | 1500 | ⚠️ Tight |
| Gesture Control | 30-40% | 1500 | ⚠️ Tight |
| Visual Context | 40-60% | 2500 | ❌ Too much |
Recommendations
✅ IMPLEMENT NOW: Basic Status Display
- Why: Very low overhead, huge UX improvement
- Overhead: 2-5% CPU, 200KB RAM
- Benefit: Users know what's happening at a glance
- Difficulty: Easy (MaixPy has good LCD support)
✅ IMPLEMENT SOON: Waveform Visualizer
- Why: Cool factor, moderate overhead
- Overhead: 10-15% CPU, 300KB RAM
- Benefit: Engaging, confirms mic is working, looks professional
- Difficulty: Moderate (simple drawing code)
🤔 CONSIDER LATER: Spectrum Analyzer
- Why: Higher overhead, diminishing returns
- Overhead: 20-30% CPU, 500KB RAM
- Benefit: Looks cool but not essential
- Difficulty: Moderate-High (FFT required)
❌ SKIP FOR NOW: Camera Features
- Why: High overhead, complex, privacy concerns
- Overhead: 30-60% CPU, 1.5-2.5MB RAM
- Benefit: Novel but not core functionality
- Difficulty: High (model integration, privacy handling)
Implementation Priority
Phase 1 (Week 1): Core Functionality
- Audio capture and streaming
- Server integration
- Basic LCD status display
- Idle/Listening/Processing states
- WiFi status
- Connection indicator
Phase 2 (Week 2-3): Visual Enhancement
- Audio waveform visualizer
- Input (microphone) waveform
- Output (TTS) waveform
- Level meters
- Clean, minimal design
Phase 3 (Month 2): Polish
- Spectrum analyzer option
- Animated transitions
- Settings display
- Network configuration UI (optional)
Phase 4 (Month 3+): Advanced Features
- Camera person detection (privacy mode)
- Gesture control experiments
- Visual wake word alternative
Code Structure Recommendation
# main.py structure with modular display
import lcd, audio, network
from display_manager import DisplayManager
from audio_processor import AudioProcessor
from voice_client import VoiceClient
# Initialize
lcd.init()
display = DisplayManager(mode='waveform') # or 'status' or 'spectrum'
# Main loop
while True:
# Audio processing
audio_buffer = audio.capture()
# Update display (non-blocking)
if display.mode == 'status':
display.show_status(state='listening', wifi_level=75)
elif display.mode == 'waveform':
display.show_waveform(audio_buffer)
elif display.mode == 'spectrum':
display.show_spectrum(audio_buffer)
# Network communication
voice_client.stream_audio(audio_buffer)
Measured Overhead (Estimated)
Status Display Only
- CPU: 38% total (3% for display)
- RAM: 2.2MB total (200KB for display)
- Battery Life: -2% (minimal impact)
- WiFi Latency: No impact
- Verdict: ✅ Negligible impact, worth it!
Waveform Visualizer
- CPU: 48% total (13% for display)
- RAM: 2.3MB total (300KB for display)
- Battery Life: -5% (minor impact)
- WiFi Latency: No impact (still <200ms)
- Verdict: ✅ Acceptable, looks great!
Spectrum Analyzer
- CPU: 60% total (25% for display)
- RAM: 2.5MB total (500KB for display)
- Battery Life: -8% (noticeable)
- WiFi Latency: Possible minor impact
- Verdict: ⚠️ Usable but pushing limits
Camera: Should You Use It?
Pros
- ✅ Already have the hardware (free!)
- ✅ Novel features (person detection, gestures)
- ✅ Privacy enhancement potential
- ✅ Future-proofing
Cons
- ❌ High resource usage (30-60% CPU, 1.5-2.5MB RAM)
- ❌ Complex implementation
- ❌ Privacy concerns (camera always on)
- ❌ Not core to voice assistant
- ❌ Competes with audio processing resources
Recommendation
Skip camera for initial implementation. Focus on core voice assistant functionality. Revisit in Phase 3+ when:
- Core features are stable
- You want to experiment
- You have time for optimization
- You want to differentiate from commercial assistants
Final Recommendations
Start With (NOW):
# Simple status display
# - State indicator
# - WiFi status
# - Connection status
# - Time/date
# Overhead: ~3% CPU, 200KB RAM
Add Next (Week 2):
# Waveform visualizer
# - Real-time audio waveform
# - Level meter
# - Clean design
# Overhead: +10% CPU, +100KB RAM
Maybe Later (Month 2+):
# Spectrum analyzer
# - 8-16 frequency bands
# - FFT visualization
# - Optional mode
# Overhead: +15% CPU, +200KB RAM
Skip (For Now):
# Camera features
# - Person detection
# - Gestures
# - Visual context
# Too complex, revisit later
Example: Combined Status + Waveform Display
┌───────────────────────────────┐
│ Voice Assistant [LISTENING]│
├───────────────────────────────┤
│ │
│ ╱╲ ╱╲ ╱╲ ╱╲ ╱╲ │
│ ╱ ╲ ╱ ╲╱ ╲ ╱ ╲╱ ╲ │
│ ╲╱ ╲╱ │
│ │
│ Vol: [████████░░] WiFi: ▂▃▅█ │
│ │
│ Server: 10.1.10.71 ● 14:23 │
└───────────────────────────────┘
Total Overhead: ~15% CPU, 300KB RAM
Impact: Minimal, excellent UX improvement
Coolness Factor: 9/10
Conclusion
LCD: YES! Definitely Use It! ✅
- Status display: Low overhead, huge benefit
- Waveform: Moderate overhead, looks amazing
- Spectrum: Higher overhead, nice-to-have
Recommendation: Start with status, add waveform, consider spectrum later.
Camera: Skip For Now ❌
- High overhead
- Complex implementation
- Not core functionality
- Revisit in Phase 3+
Focus on nailing the voice assistant first, then add visual features incrementally!
TL;DR: Use the LCD for status + waveform visualization (~15% overhead total). Skip the camera for now. Your K210 can easily handle this! 🎉