pyr0ball 173f7f37d4 feat: import mycroft-precise work as Minerva foundation

Ports prior voice assistant research and prototypes from devl/Devops
into the Minerva repo. Includes:

- docs/: architecture, wake word guides, ESP32-S3 spec, hardware buying guide
- scripts/: voice_server.py, voice_server_enhanced.py, setup scripts
- hardware/maixduino/: edge device scripts with WiFi credentials scrubbed
  (replaced hardcoded password with secrets.py pattern)
- config/.env.example: server config template
- .gitignore: excludes .env, secrets.py, model blobs, ELF firmware
- CLAUDE.md: Minerva product context and connection to cf-voice roadmap

2026-04-06 22:21:12 -07:00

15 KiB

Executable file

Raw Blame History

Maix Duino LCD & Camera Feature Analysis

Date: 2025-11-29
Hardware: Sipeed Maix Duino (K210)
Question: What's the overhead for using LCD display and camera?

Hardware Capabilities

LCD Display

Resolution: Typically 320x240 or 240x135 (depending on model)
Interface: SPI
Color: RGB565 (16-bit color)
Frame Rate: Up to 60 FPS (limited by SPI bandwidth)
Status: ✅ Included with most Maix Duino kits

Camera

Resolution: Various (OV2640 common: 2MP, up to 1600x1200)
Interface: DVP (Digital Video Port)
Frame Rate: Up to 60 FPS (lower at high resolution)
Status: ✅ Often included with Maix Duino kits

K210 Resources

CPU: Dual-core RISC-V @ 400MHz
KPU: Neural network accelerator
SRAM: 8MB total (6MB available for apps)
Flash: 16MB

LCD Usage for Voice Assistant

Use Case 1: Status Display (Minimal Overhead)

What to Show:

Current state (idle/listening/processing/responding)
Wake word detected indicator
WiFi status and signal strength
Server connection status
Volume level
Time/date

Overhead:

CPU: ~2-5% (simple text/icons)
RAM: ~200KB (framebuffer + assets)
Power: ~50mW additional
Complexity: Low (MaixPy has built-in LCD support)

Code Example:

import lcd
import image

lcd.init()
lcd.rotation(2)  # Rotate if needed

# Simple status display
img = image.Image(size=(320, 240))
img.draw_string(10, 10, "Listening...", color=(0, 255, 0), scale=3)
img.draw_circle(300, 20, 10, color=(0, 255, 0), fill=True)  # Status LED
lcd.display(img)

Verdict: ✅ Very Low Overhead - Highly Recommended

Use Case 2: Audio Waveform Visualizer (Moderate Overhead)

Input Waveform (Microphone)

What to Show:

Real-time audio level meter
Waveform display (oscilloscope style)
VU meter
Frequency spectrum (simple bars)

Overhead:

CPU: ~10-15% (real-time drawing)
RAM: ~300KB (framebuffer + audio buffer)
Frame Rate: 15-30 FPS (sufficient for audio visualization)
Complexity: Moderate (drawing primitives + FFT)

Implementation:

import lcd, audio, image
import array

lcd.init()
audio.init()

def draw_waveform(audio_buffer):
    img = image.Image(size=(320, 240))
    
    # Draw waveform
    width = 320
    height = 240
    center = height // 2
    
    # Sample every Nth point to fit on screen
    step = len(audio_buffer) // width
    
    for x in range(width - 1):
        y1 = center + (audio_buffer[x * step] // 256)
        y2 = center + (audio_buffer[(x + 1) * step] // 256)
        img.draw_line(x, y1, x + 1, y2, color=(0, 255, 0))
    
    # Add level meter
    level = max(abs(min(audio_buffer)), abs(max(audio_buffer)))
    bar_height = (level * height) // 32768
    img.draw_rectangle(0, height - bar_height, 20, bar_height, 
                       color=(0, 255, 0), fill=True)
    
    lcd.display(img)

Verdict: ✅ Moderate Overhead - Feasible and Cool!

Output Waveform (TTS Response)

What to Show:

TTS audio being played back
Speaking animation (mouth/sound waves)
Response text scrolling

Overhead:

CPU: ~10-15% (similar to input)
RAM: ~300KB
Complexity: Moderate

Note: Can reuse same visualization code as input waveform.

Verdict: ✅ Same as Input - Totally Doable

Use Case 3: Spectrum Analyzer (Higher Overhead)

What to Show:

Frequency bars (FFT visualization)
8-16 frequency bands
Classic "equalizer" look

Overhead:

CPU: ~20-30% (FFT computation + drawing)
RAM: ~500KB (FFT buffers + framebuffer)
Complexity: Moderate-High (FFT required)

Implementation Note:

K210 KPU can accelerate FFT operations
Can do simple 8-band analysis with minimal CPU
More bands = more CPU

Verdict: ⚠️ Higher Overhead - Use Sparingly

Use Case 4: Interactive UI (High Overhead)

What to Show:

Touchscreen controls (if touchscreen available)
Settings menu
Volume slider
Wake word selection
Network configuration

Overhead:

CPU: ~20-40% (touch detection + UI rendering)
RAM: ~1MB (UI framework + assets)
Complexity: High (need UI framework)

Verdict: ⚠️ High Overhead - Nice-to-Have Later

Camera Usage for Voice Assistant

Use Case 1: Person Detection (Wake on Face)

What to Do:

Detect person in frame
Only listen when someone present
Privacy mode: disable when no one around

Overhead:

CPU: ~30-40% (KPU handles inference)
RAM: ~1.5MB (model + frame buffers)
Power: ~200mW additional
Complexity: Moderate (pre-trained models available)

Pros:

✅ Privacy enhancement (only listen when occupied)
✅ Power saving (sleep when empty room)
✅ Pre-trained models available for K210

Cons:

❌ Adds latency (check camera before listening)
❌ Privacy concerns (camera always on)
❌ Moderate resource usage

Verdict: 🤔 Interesting but Complex - Phase 2+

Use Case 2: Visual Context (Future AI Integration)

What to Do:

"What am I holding?" queries
Visual scene understanding
QR code scanning
Gesture control

Overhead:

CPU: 40-60% (vision processing)
RAM: 2-3MB (models + buffers)
Complexity: High (requires vision models)

Verdict: ❌ Too Complex for Initial Release - Future Feature

Use Case 3: Visual Wake Word (Gesture Detection)

What to Do:

Wave hand to activate
Thumbs up/down for feedback
Alternative to voice wake word

Overhead:

CPU: ~30-40% (gesture detection)
RAM: ~1.5MB
Complexity: Moderate-High

Verdict: 🤔 Novel Idea - Phase 3+

Recommended LCD Implementation

Phase 1: Basic Status Display (Recommended NOW)

┌─────────────────────────┐
│  Voice Assistant        │
│                         │
│  Status: Listening  ●   │
│  WiFi: ████░░  75%      │
│  Server: Connected      │
│                         │
│  Volume: [██████░░░]    │
│                         │
│  Time: 14:23            │
└─────────────────────────┘

Features:

Current state indicator
WiFi signal strength
Server connection status
Volume level bar
Clock
Wake word indicator (pulsing circle)

Overhead: ~2-5% CPU, 200KB RAM

Phase 2: Waveform Visualization (Cool Addition)

┌─────────────────────────┐
│ Listening...       [●]  │
├─────────────────────────┤
│  ╱╲  ╱╲    ╱╲  ╱╲      │
│ ╱  ╲╱  ╲  ╱  ╲╱  ╲     │
│                         │
│ Level: [████░░░░░░]     │
└─────────────────────────┘

Features:

Real-time waveform (15-30 FPS)
Audio level meter
State indicator
Simple and clean

Overhead: ~10-15% CPU, 300KB RAM

Phase 3: Enhanced Visualizer (Polish)

┌─────────────────────────┐
│ Hey Computer!      [●]  │
├─────────────────────────┤
│ ▁▂▃▄▅▆▇█ ▁▂▃▄▅▆▇█      │
│ ▁▂▃▄▅▆▇█ ▁▂▃▄▅▆▇█      │
│                         │
│ "Turn off the lights"   │
└─────────────────────────┘

Features:

Spectrum analyzer (8-16 bands)
Transcription display
Animated response
More polished UI

Overhead: ~20-30% CPU, 500KB RAM

Resource Budget Analysis

Total K210 Resources

CPU: 2 cores @ 400MHz (assume ~100% available)
RAM: 6MB available for app
Bandwidth: SPI (LCD), I2S (audio), WiFi

Current Voice Assistant Usage (Server-Side Wake Word)

Component	CPU %	RAM (KB)
Audio Capture (I2S)	5%	128
Audio Playback	5%	128
WiFi Streaming	10%	256
Network Stack	5%	512
MaixPy Runtime	10%	1024
Base Total	35%	~2MB

With LCD Features

Display Mode	CPU %	RAM (KB)	Total CPU	Total RAM
None	0%	0	35%	2MB
Status Only	2-5%	200	37-40%	2.2MB
Waveform	10-15%	300	45-50%	2.3MB
Spectrum	20-30%	500	55-65%	2.5MB

With Camera Features

Feature	CPU %	RAM (KB)	Feasible?
Person Detection	30-40%	1500	⚠️ Tight
Gesture Control	30-40%	1500	⚠️ Tight
Visual Context	40-60%	2500	❌ Too much

Recommendations

✅ IMPLEMENT NOW: Basic Status Display

Why: Very low overhead, huge UX improvement
Overhead: 2-5% CPU, 200KB RAM
Benefit: Users know what's happening at a glance
Difficulty: Easy (MaixPy has good LCD support)

✅ IMPLEMENT SOON: Waveform Visualizer

Why: Cool factor, moderate overhead
Overhead: 10-15% CPU, 300KB RAM
Benefit: Engaging, confirms mic is working, looks professional
Difficulty: Moderate (simple drawing code)

🤔 CONSIDER LATER: Spectrum Analyzer

Why: Higher overhead, diminishing returns
Overhead: 20-30% CPU, 500KB RAM
Benefit: Looks cool but not essential
Difficulty: Moderate-High (FFT required)

❌ SKIP FOR NOW: Camera Features

Why: High overhead, complex, privacy concerns
Overhead: 30-60% CPU, 1.5-2.5MB RAM
Benefit: Novel but not core functionality
Difficulty: High (model integration, privacy handling)

Implementation Priority

Phase 1 (Week 1): Core Functionality

Audio capture and streaming
Server integration
Basic LCD status display
- Idle/Listening/Processing states
- WiFi status
- Connection indicator

Phase 2 (Week 2-3): Visual Enhancement

Audio waveform visualizer
- Input (microphone) waveform
- Output (TTS) waveform
- Level meters
- Clean, minimal design

Phase 3 (Month 2): Polish

Spectrum analyzer option
Animated transitions
Settings display
Network configuration UI (optional)

Phase 4 (Month 3+): Advanced Features

Camera person detection (privacy mode)
Gesture control experiments
Visual wake word alternative

Code Structure Recommendation

# main.py structure with modular display

import lcd, audio, network
from display_manager import DisplayManager
from audio_processor import AudioProcessor
from voice_client import VoiceClient

# Initialize
lcd.init()
display = DisplayManager(mode='waveform')  # or 'status' or 'spectrum'

# Main loop
while True:
    # Audio processing
    audio_buffer = audio.capture()
    
    # Update display (non-blocking)
    if display.mode == 'status':
        display.show_status(state='listening', wifi_level=75)
    elif display.mode == 'waveform':
        display.show_waveform(audio_buffer)
    elif display.mode == 'spectrum':
        display.show_spectrum(audio_buffer)
    
    # Network communication
    voice_client.stream_audio(audio_buffer)

Measured Overhead (Estimated)

Status Display Only

CPU: 38% total (3% for display)
RAM: 2.2MB total (200KB for display)
Battery Life: -2% (minimal impact)
WiFi Latency: No impact
Verdict: ✅ Negligible impact, worth it!

Waveform Visualizer

CPU: 48% total (13% for display)
RAM: 2.3MB total (300KB for display)
Battery Life: -5% (minor impact)
WiFi Latency: No impact (still <200ms)
Verdict: ✅ Acceptable, looks great!

Spectrum Analyzer

CPU: 60% total (25% for display)
RAM: 2.5MB total (500KB for display)
Battery Life: -8% (noticeable)
WiFi Latency: Possible minor impact
Verdict: ⚠️ Usable but pushing limits

Camera: Should You Use It?

Pros

✅ Already have the hardware (free!)
✅ Novel features (person detection, gestures)
✅ Privacy enhancement potential
✅ Future-proofing

Cons

❌ High resource usage (30-60% CPU, 1.5-2.5MB RAM)
❌ Complex implementation
❌ Privacy concerns (camera always on)
❌ Not core to voice assistant
❌ Competes with audio processing resources

Recommendation

Skip camera for initial implementation. Focus on core voice assistant functionality. Revisit in Phase 3+ when:

Core features are stable
You want to experiment
You have time for optimization
You want to differentiate from commercial assistants

Final Recommendations

Start With (NOW):

# Simple status display
# - State indicator
# - WiFi status  
# - Connection status
# - Time/date
# Overhead: ~3% CPU, 200KB RAM

Add Next (Week 2):

# Waveform visualizer
# - Real-time audio waveform
# - Level meter
# - Clean design
# Overhead: +10% CPU, +100KB RAM

Maybe Later (Month 2+):

# Spectrum analyzer
# - 8-16 frequency bands
# - FFT visualization
# - Optional mode
# Overhead: +15% CPU, +200KB RAM

Skip (For Now):

# Camera features
# - Person detection
# - Gestures
# - Visual context
# Too complex, revisit later

Example: Combined Status + Waveform Display

┌───────────────────────────────┐
│ Voice Assistant    [LISTENING]│
├───────────────────────────────┤
│                               │
│  ╱╲    ╱╲  ╱╲    ╱╲  ╱╲      │
│ ╱  ╲  ╱  ╲╱  ╲  ╱  ╲╱  ╲     │
│      ╲╱          ╲╱           │
│                               │
│ Vol: [████████░░] WiFi: ▂▃▅█ │
│                               │
│ Server: 10.1.10.71 ● 14:23   │
└───────────────────────────────┘

Total Overhead: ~15% CPU, 300KB RAM
Impact: Minimal, excellent UX improvement
Coolness Factor: 9/10

Conclusion

LCD: YES! Definitely Use It! ✅

Status display: Low overhead, huge benefit
Waveform: Moderate overhead, looks amazing
Spectrum: Higher overhead, nice-to-have

Recommendation: Start with status, add waveform, consider spectrum later.

Camera: Skip For Now ❌

High overhead
Complex implementation
Not core functionality
Revisit in Phase 3+

Focus on nailing the voice assistant first, then add visual features incrementally!

TL;DR: Use the LCD for status + waveform visualization (~15% overhead total). Skip the camera for now. Your K210 can easily handle this! 🎉

15 KiB Executable file Raw Blame History Unescape Escape

Maix Duino LCD & Camera Feature Analysis

Hardware Capabilities

LCD Display

Camera

K210 Resources

LCD Usage for Voice Assistant

Use Case 1: Status Display (Minimal Overhead)

Use Case 2: Audio Waveform Visualizer (Moderate Overhead)

Input Waveform (Microphone)

Output Waveform (TTS Response)

Use Case 3: Spectrum Analyzer (Higher Overhead)

Use Case 4: Interactive UI (High Overhead)

Camera Usage for Voice Assistant

Use Case 1: Person Detection (Wake on Face)

Use Case 2: Visual Context (Future AI Integration)

Use Case 3: Visual Wake Word (Gesture Detection)

Recommended LCD Implementation

Phase 1: Basic Status Display (Recommended NOW)

Phase 2: Waveform Visualization (Cool Addition)

Phase 3: Enhanced Visualizer (Polish)

Resource Budget Analysis

Total K210 Resources

Current Voice Assistant Usage (Server-Side Wake Word)

With LCD Features

With Camera Features

Recommendations

✅ IMPLEMENT NOW: Basic Status Display

✅ IMPLEMENT SOON: Waveform Visualizer

🤔 CONSIDER LATER: Spectrum Analyzer

❌ SKIP FOR NOW: Camera Features

Implementation Priority

Phase 1 (Week 1): Core Functionality

Phase 2 (Week 2-3): Visual Enhancement

Phase 3 (Month 2): Polish

Phase 4 (Month 3+): Advanced Features

Code Structure Recommendation

Measured Overhead (Estimated)

Status Display Only

Waveform Visualizer

Spectrum Analyzer

Camera: Should You Use It?

Pros

Cons

Recommendation

Final Recommendations

Start With (NOW):

Add Next (Week 2):

Maybe Later (Month 2+):

Skip (For Now):

Example: Combined Status + Waveform Display

Conclusion

LCD: YES! Definitely Use It! ✅

Camera: Skip For Now ❌

15 KiB

Executable file

Raw Blame History