minerva/docs/LCD_CAMERA_FEATURES.md
pyr0ball 173f7f37d4 feat: import mycroft-precise work as Minerva foundation
Ports prior voice assistant research and prototypes from devl/Devops
into the Minerva repo. Includes:

- docs/: architecture, wake word guides, ESP32-S3 spec, hardware buying guide
- scripts/: voice_server.py, voice_server_enhanced.py, setup scripts
- hardware/maixduino/: edge device scripts with WiFi credentials scrubbed
  (replaced hardcoded password with secrets.py pattern)
- config/.env.example: server config template
- .gitignore: excludes .env, secrets.py, model blobs, ELF firmware
- CLAUDE.md: Minerva product context and connection to cf-voice roadmap
2026-04-06 22:21:12 -07:00

15 KiB
Executable file
Raw Permalink Blame History

Maix Duino LCD & Camera Feature Analysis

Date: 2025-11-29
Hardware: Sipeed Maix Duino (K210)
Question: What's the overhead for using LCD display and camera?


Hardware Capabilities

LCD Display

  • Resolution: Typically 320x240 or 240x135 (depending on model)
  • Interface: SPI
  • Color: RGB565 (16-bit color)
  • Frame Rate: Up to 60 FPS (limited by SPI bandwidth)
  • Status: Included with most Maix Duino kits

Camera

  • Resolution: Various (OV2640 common: 2MP, up to 1600x1200)
  • Interface: DVP (Digital Video Port)
  • Frame Rate: Up to 60 FPS (lower at high resolution)
  • Status: Often included with Maix Duino kits

K210 Resources

  • CPU: Dual-core RISC-V @ 400MHz
  • KPU: Neural network accelerator
  • SRAM: 8MB total (6MB available for apps)
  • Flash: 16MB

LCD Usage for Voice Assistant

Use Case 1: Status Display (Minimal Overhead)

What to Show:

  • Current state (idle/listening/processing/responding)
  • Wake word detected indicator
  • WiFi status and signal strength
  • Server connection status
  • Volume level
  • Time/date

Overhead:

  • CPU: ~2-5% (simple text/icons)
  • RAM: ~200KB (framebuffer + assets)
  • Power: ~50mW additional
  • Complexity: Low (MaixPy has built-in LCD support)

Code Example:

import lcd
import image

lcd.init()
lcd.rotation(2)  # Rotate if needed

# Simple status display
img = image.Image(size=(320, 240))
img.draw_string(10, 10, "Listening...", color=(0, 255, 0), scale=3)
img.draw_circle(300, 20, 10, color=(0, 255, 0), fill=True)  # Status LED
lcd.display(img)

Verdict: Very Low Overhead - Highly Recommended


Use Case 2: Audio Waveform Visualizer (Moderate Overhead)

Input Waveform (Microphone)

What to Show:

  • Real-time audio level meter
  • Waveform display (oscilloscope style)
  • VU meter
  • Frequency spectrum (simple bars)

Overhead:

  • CPU: ~10-15% (real-time drawing)
  • RAM: ~300KB (framebuffer + audio buffer)
  • Frame Rate: 15-30 FPS (sufficient for audio visualization)
  • Complexity: Moderate (drawing primitives + FFT)

Implementation:

import lcd, audio, image
import array

lcd.init()
audio.init()

def draw_waveform(audio_buffer):
    img = image.Image(size=(320, 240))
    
    # Draw waveform
    width = 320
    height = 240
    center = height // 2
    
    # Sample every Nth point to fit on screen
    step = len(audio_buffer) // width
    
    for x in range(width - 1):
        y1 = center + (audio_buffer[x * step] // 256)
        y2 = center + (audio_buffer[(x + 1) * step] // 256)
        img.draw_line(x, y1, x + 1, y2, color=(0, 255, 0))
    
    # Add level meter
    level = max(abs(min(audio_buffer)), abs(max(audio_buffer)))
    bar_height = (level * height) // 32768
    img.draw_rectangle(0, height - bar_height, 20, bar_height, 
                       color=(0, 255, 0), fill=True)
    
    lcd.display(img)

Verdict: Moderate Overhead - Feasible and Cool!


Output Waveform (TTS Response)

What to Show:

  • TTS audio being played back
  • Speaking animation (mouth/sound waves)
  • Response text scrolling

Overhead:

  • CPU: ~10-15% (similar to input)
  • RAM: ~300KB
  • Complexity: Moderate

Note: Can reuse same visualization code as input waveform.

Verdict: Same as Input - Totally Doable


Use Case 3: Spectrum Analyzer (Higher Overhead)

What to Show:

  • Frequency bars (FFT visualization)
  • 8-16 frequency bands
  • Classic "equalizer" look

Overhead:

  • CPU: ~20-30% (FFT computation + drawing)
  • RAM: ~500KB (FFT buffers + framebuffer)
  • Complexity: Moderate-High (FFT required)

Implementation Note:

  • K210 KPU can accelerate FFT operations
  • Can do simple 8-band analysis with minimal CPU
  • More bands = more CPU

Verdict: ⚠️ Higher Overhead - Use Sparingly


Use Case 4: Interactive UI (High Overhead)

What to Show:

  • Touchscreen controls (if touchscreen available)
  • Settings menu
  • Volume slider
  • Wake word selection
  • Network configuration

Overhead:

  • CPU: ~20-40% (touch detection + UI rendering)
  • RAM: ~1MB (UI framework + assets)
  • Complexity: High (need UI framework)

Verdict: ⚠️ High Overhead - Nice-to-Have Later


Camera Usage for Voice Assistant

Use Case 1: Person Detection (Wake on Face)

What to Do:

  • Detect person in frame
  • Only listen when someone present
  • Privacy mode: disable when no one around

Overhead:

  • CPU: ~30-40% (KPU handles inference)
  • RAM: ~1.5MB (model + frame buffers)
  • Power: ~200mW additional
  • Complexity: Moderate (pre-trained models available)

Pros:

  • Privacy enhancement (only listen when occupied)
  • Power saving (sleep when empty room)
  • Pre-trained models available for K210

Cons:

  • Adds latency (check camera before listening)
  • Privacy concerns (camera always on)
  • Moderate resource usage

Verdict: 🤔 Interesting but Complex - Phase 2+


Use Case 2: Visual Context (Future AI Integration)

What to Do:

  • "What am I holding?" queries
  • Visual scene understanding
  • QR code scanning
  • Gesture control

Overhead:

  • CPU: 40-60% (vision processing)
  • RAM: 2-3MB (models + buffers)
  • Complexity: High (requires vision models)

Verdict: Too Complex for Initial Release - Future Feature


Use Case 3: Visual Wake Word (Gesture Detection)

What to Do:

  • Wave hand to activate
  • Thumbs up/down for feedback
  • Alternative to voice wake word

Overhead:

  • CPU: ~30-40% (gesture detection)
  • RAM: ~1.5MB
  • Complexity: Moderate-High

Verdict: 🤔 Novel Idea - Phase 3+


┌─────────────────────────┐
│  Voice Assistant        │
│                         │
│  Status: Listening  ●   │
│  WiFi: ████░░  75%      │
│  Server: Connected      │
│                         │
│  Volume: [██████░░░]    │
│                         │
│  Time: 14:23            │
└─────────────────────────┘

Features:

  • Current state indicator
  • WiFi signal strength
  • Server connection status
  • Volume level bar
  • Clock
  • Wake word indicator (pulsing circle)

Overhead: ~2-5% CPU, 200KB RAM


Phase 2: Waveform Visualization (Cool Addition)

┌─────────────────────────┐
│ Listening...       [●]  │
├─────────────────────────┤
│  ╱╲  ╱╲    ╱╲  ╱╲      │
│   ╲╱  ╲    ╲╱  ╲     │
│                         │
│ Level: [████░░░░░░]     │
└─────────────────────────┘

Features:

  • Real-time waveform (15-30 FPS)
  • Audio level meter
  • State indicator
  • Simple and clean

Overhead: ~10-15% CPU, 300KB RAM


Phase 3: Enhanced Visualizer (Polish)

┌─────────────────────────┐
│ Hey Computer!      [●]  │
├─────────────────────────┤
│ ▁▂▃▄▅▆▇█ ▁▂▃▄▅▆▇█      │
│ ▁▂▃▄▅▆▇█ ▁▂▃▄▅▆▇█      │
│                         │
│ "Turn off the lights"   │
└─────────────────────────┘

Features:

  • Spectrum analyzer (8-16 bands)
  • Transcription display
  • Animated response
  • More polished UI

Overhead: ~20-30% CPU, 500KB RAM


Resource Budget Analysis

Total K210 Resources

  • CPU: 2 cores @ 400MHz (assume ~100% available)
  • RAM: 6MB available for app
  • Bandwidth: SPI (LCD), I2S (audio), WiFi

Current Voice Assistant Usage (Server-Side Wake Word)

Component CPU % RAM (KB)
Audio Capture (I2S) 5% 128
Audio Playback 5% 128
WiFi Streaming 10% 256
Network Stack 5% 512
MaixPy Runtime 10% 1024
Base Total 35% ~2MB

With LCD Features

Display Mode CPU % RAM (KB) Total CPU Total RAM
None 0% 0 35% 2MB
Status Only 2-5% 200 37-40% 2.2MB
Waveform 10-15% 300 45-50% 2.3MB
Spectrum 20-30% 500 55-65% 2.5MB

With Camera Features

Feature CPU % RAM (KB) Feasible?
Person Detection 30-40% 1500 ⚠️ Tight
Gesture Control 30-40% 1500 ⚠️ Tight
Visual Context 40-60% 2500 Too much

Recommendations

IMPLEMENT NOW: Basic Status Display

  • Why: Very low overhead, huge UX improvement
  • Overhead: 2-5% CPU, 200KB RAM
  • Benefit: Users know what's happening at a glance
  • Difficulty: Easy (MaixPy has good LCD support)

IMPLEMENT SOON: Waveform Visualizer

  • Why: Cool factor, moderate overhead
  • Overhead: 10-15% CPU, 300KB RAM
  • Benefit: Engaging, confirms mic is working, looks professional
  • Difficulty: Moderate (simple drawing code)

🤔 CONSIDER LATER: Spectrum Analyzer

  • Why: Higher overhead, diminishing returns
  • Overhead: 20-30% CPU, 500KB RAM
  • Benefit: Looks cool but not essential
  • Difficulty: Moderate-High (FFT required)

SKIP FOR NOW: Camera Features

  • Why: High overhead, complex, privacy concerns
  • Overhead: 30-60% CPU, 1.5-2.5MB RAM
  • Benefit: Novel but not core functionality
  • Difficulty: High (model integration, privacy handling)

Implementation Priority

Phase 1 (Week 1): Core Functionality

  • Audio capture and streaming
  • Server integration
  • Basic LCD status display
    • Idle/Listening/Processing states
    • WiFi status
    • Connection indicator

Phase 2 (Week 2-3): Visual Enhancement

  • Audio waveform visualizer
    • Input (microphone) waveform
    • Output (TTS) waveform
    • Level meters
    • Clean, minimal design

Phase 3 (Month 2): Polish

  • Spectrum analyzer option
  • Animated transitions
  • Settings display
  • Network configuration UI (optional)

Phase 4 (Month 3+): Advanced Features

  • Camera person detection (privacy mode)
  • Gesture control experiments
  • Visual wake word alternative

Code Structure Recommendation

# main.py structure with modular display

import lcd, audio, network
from display_manager import DisplayManager
from audio_processor import AudioProcessor
from voice_client import VoiceClient

# Initialize
lcd.init()
display = DisplayManager(mode='waveform')  # or 'status' or 'spectrum'

# Main loop
while True:
    # Audio processing
    audio_buffer = audio.capture()
    
    # Update display (non-blocking)
    if display.mode == 'status':
        display.show_status(state='listening', wifi_level=75)
    elif display.mode == 'waveform':
        display.show_waveform(audio_buffer)
    elif display.mode == 'spectrum':
        display.show_spectrum(audio_buffer)
    
    # Network communication
    voice_client.stream_audio(audio_buffer)

Measured Overhead (Estimated)

Status Display Only

  • CPU: 38% total (3% for display)
  • RAM: 2.2MB total (200KB for display)
  • Battery Life: -2% (minimal impact)
  • WiFi Latency: No impact
  • Verdict: Negligible impact, worth it!

Waveform Visualizer

  • CPU: 48% total (13% for display)
  • RAM: 2.3MB total (300KB for display)
  • Battery Life: -5% (minor impact)
  • WiFi Latency: No impact (still <200ms)
  • Verdict: Acceptable, looks great!

Spectrum Analyzer

  • CPU: 60% total (25% for display)
  • RAM: 2.5MB total (500KB for display)
  • Battery Life: -8% (noticeable)
  • WiFi Latency: Possible minor impact
  • Verdict: ⚠️ Usable but pushing limits

Camera: Should You Use It?

Pros

  • Already have the hardware (free!)
  • Novel features (person detection, gestures)
  • Privacy enhancement potential
  • Future-proofing

Cons

  • High resource usage (30-60% CPU, 1.5-2.5MB RAM)
  • Complex implementation
  • Privacy concerns (camera always on)
  • Not core to voice assistant
  • Competes with audio processing resources

Recommendation

Skip camera for initial implementation. Focus on core voice assistant functionality. Revisit in Phase 3+ when:

  1. Core features are stable
  2. You want to experiment
  3. You have time for optimization
  4. You want to differentiate from commercial assistants

Final Recommendations

Start With (NOW):

# Simple status display
# - State indicator
# - WiFi status  
# - Connection status
# - Time/date
# Overhead: ~3% CPU, 200KB RAM

Add Next (Week 2):

# Waveform visualizer
# - Real-time audio waveform
# - Level meter
# - Clean design
# Overhead: +10% CPU, +100KB RAM

Maybe Later (Month 2+):

# Spectrum analyzer
# - 8-16 frequency bands
# - FFT visualization
# - Optional mode
# Overhead: +15% CPU, +200KB RAM

Skip (For Now):

# Camera features
# - Person detection
# - Gestures
# - Visual context
# Too complex, revisit later

Example: Combined Status + Waveform Display

┌───────────────────────────────┐
│ Voice Assistant    [LISTENING]│
├───────────────────────────────┤
│                               │
│  ╱╲    ╱╲  ╱╲    ╱╲  ╱╲      │
│   ╲╱  ╲    ╲╱  ╲     │
│      ╲╱          ╲╱           │
│                               │
│ Vol: [████████░░] WiFi: ▂▃▅█ │
│                               │
│ Server: 10.1.10.71 ● 14:23   │
└───────────────────────────────┘

Total Overhead: ~15% CPU, 300KB RAM
Impact: Minimal, excellent UX improvement
Coolness Factor: 9/10


Conclusion

LCD: YES! Definitely Use It!

  • Status display: Low overhead, huge benefit
  • Waveform: Moderate overhead, looks amazing
  • Spectrum: Higher overhead, nice-to-have

Recommendation: Start with status, add waveform, consider spectrum later.

Camera: Skip For Now

  • High overhead
  • Complex implementation
  • Not core functionality
  • Revisit in Phase 3+

Focus on nailing the voice assistant first, then add visual features incrementally!


TL;DR: Use the LCD for status + waveform visualization (~15% overhead total). Skip the camera for now. Your K210 can easily handle this! 🎉