minerva/hardware/maixduino/SESSION_PROGRESS_2025-12-03.md
pyr0ball 173f7f37d4 feat: import mycroft-precise work as Minerva foundation
Ports prior voice assistant research and prototypes from devl/Devops
into the Minerva repo. Includes:

- docs/: architecture, wake word guides, ESP32-S3 spec, hardware buying guide
- scripts/: voice_server.py, voice_server_enhanced.py, setup scripts
- hardware/maixduino/: edge device scripts with WiFi credentials scrubbed
  (replaced hardcoded password with secrets.py pattern)
- config/.env.example: server config template
- .gitignore: excludes .env, secrets.py, model blobs, ELF firmware
- CLAUDE.md: Minerva product context and connection to cf-voice roadmap
2026-04-06 22:21:12 -07:00

12 KiB
Executable file
Raw Permalink Blame History

Maixduino Voice Assistant - Session Progress

Date: 2025-12-03 Session Duration: ~4 hours Goal: Get audio recording and transcription working on Maixduino → Heimdall server


🎉 Major Achievements

Full Audio Pipeline Working!

We successfully built and tested the complete audio capture → compression → transmission → transcription pipeline:

  1. WiFi Connection - Maixduino connects to network (10.1.10.98)
  2. Audio Recording - I2S microphone captures audio (MSM261S4030H0 MEMS mic)
  3. Format Conversion - Converts 32-bit stereo to 16-bit mono (4x size reduction)
  4. μ-law Compression - Compresses PCM audio by 50%
  5. HTTP Transmission - Sends compressed WAV to Heimdall server
  6. Whisper Transcription - Server transcribes and returns text
  7. LCD Display - Shows transcription on Maixduino screen
  8. Button Loop - Press BOOT button for repeated recordings

Total size reduction: 128KB → 32KB (mono) → 16KB (compressed) = 87.5% reduction!


🔧 Technical Accomplishments

Audio Recording Pipeline

  • Initial Problem: i2s_dev.record() returned immediately (1ms instead of 1000ms)
  • Root Cause: Recording API is asynchronous/non-blocking
  • Solution: Use chunked recording with wait_record() blocking calls
  • Pattern:
    for i in range(frame_cnt):
        audio_chunk = i2s_dev.record(chunk_size)
        i2s_dev.wait_record()  # CRITICAL: blocks until complete
        chunks.append(audio_chunk.to_bytes())
    

Memory Management

  • K210 has very limited RAM (~6MB total, much less available)
  • Successfully handled 128KB → 16KB data transformation without OOM errors
  • Techniques used:
    • Record in small chunks (2048 samples)
    • Stream HTTP transmission (512-byte chunks with delays)
    • In-place data conversion where possible
    • Explicit garbage collection hints (audio_data = None)

Network Communication

  • Raw socket HTTP (no urequests library available)
  • Chunked streaming with flow control (10ms delays)
  • Simple WAV format with μ-law compression (format code 7)
  • Robust error handling with serial output debugging

🐛 MicroPython/MaixPy Quirks Discovered

String Operations

  • F-strings NOT supported - Must use "text " + str(var) concatenation
  • Ternary operators fail - Use explicit if/else blocks instead
  • split() needs explicit delimiter - text.split(" ") not text.split()
  • Escape sequences problematic - Avoid \n in strings, causes syntax errors

Data Types & Methods

  • decode() doesn't accept kwargs - Use decode('utf-8') not decode('utf-8', errors='ignore')
  • RGB tuples not accepted - Must convert to packed integers: (r << 16) | (g << 8) | b
  • Bytearray item deletion unsupported - del arr[n:] fails, use slicing instead
  • Arithmetic in string concat - Separate calculations: next = count + 1; "text" + str(next)

I2S Audio Specific

  • record() is non-blocking - Returns immediately, must use wait_record()
  • Audio object not directly iterable - Must call .to_bytes() first
  • ⚠️ Data format mismatch - Hardware returns 32-bit stereo even when configured for 16-bit mono (4x expected size)

Network/WiFi

  • network.WLAN not available - Must use network.ESP32_SPI with full pin config
  • active() method doesn't exist - Just call connect() directly
  • ⚠️ Requires ALL 6 pins configured - CS, RST, RDY, MOSI, MISO, SCLK

General Syntax

  • ⚠️ if __name__ == "__main__" sometimes causes syntax errors - Safer to just call main() directly
  • ⚠️ Import statements mid-function can cause syntax errors - Keep imports at top of file
  • ⚠️ Some valid Python causes "invalid syntax" for unknown reasons - Simplify complex expressions

📊 Current Status

Working

  • WiFi connectivity (ESP32 SPI)
  • I2S audio initialization
  • Chunked audio recording with wait_record()
  • Audio format detection and conversion (32-bit stereo → 16-bit mono)
  • μ-law compression (50% size reduction)
  • HTTP transmission to server (chunked streaming)
  • Whisper transcription (server-side)
  • JSON response parsing
  • LCD display (with word wrapping)
  • Button-triggered recording loop
  • Countdown timer before recording

⚠️ Partially Working

  • Recording duration - Currently getting ~0.9 seconds instead of full 1 second
    • Formula: frame_cnt = seconds * sample_rate // chunk_size
    • Current: 7 frames × (2048/16000) = 0.896s
    • May need to increase frame_cnt or adjust chunk size

Not Yet Implemented

  • Mycroft Precise wake word detection
  • Full voice assistant loop
  • Command processing
  • Home Assistant integration
  • Multi-second recording support
  • Real-time audio streaming

🔬 Technical Details

Hardware Configuration

Maixduino Board:

  • Processor: K210 dual-core RISC-V @ 400MHz
  • RAM: ~6MB total (limited available memory)
  • WiFi: ESP32 module via SPI
  • Microphone: MSM261S4030H0 MEMS (onboard)
  • IP Address: 10.1.10.98

I2S Pins:

  • Pin 20: I2S0_IN_D0 (data)
  • Pin 19: I2S0_WS (word select)
  • Pin 18: I2S0_SCLK (clock)

ESP32 SPI Pins:

  • Pin 25: CS (chip select)
  • Pin 8: RST (reset)
  • Pin 9: RDY (ready)
  • Pin 28: MOSI (master out)
  • Pin 26: MISO (master in)
  • Pin 27: SCLK (clock)

GPIO:

  • Pin 16: BOOT button (active low, pull-up)

Server Configuration

Heimdall Server:

  • IP: 10.1.10.71
  • Port: 3006
  • Framework: Flask
  • Model: Whisper base
  • Environment: Conda whisper_cli

Endpoints:

  • /health - Health check
  • /transcribe - POST audio for transcription

Audio Format

Recording:

  • Sample Rate: 16kHz
  • Hardware Output: 32-bit stereo (128KB for 1 second)
  • After Conversion: 16-bit mono (32KB for 1 second)
  • After Compression: 8-bit μ-law (16KB for 1 second)

WAV Header:

  • Format Code: 7 (μ-law)
  • Channels: 1 (mono)
  • Sample Rate: 16000 Hz
  • Bits per Sample: 8
  • Includes fact chunk (required for μ-law)

📝 Code Files

Main Script

File: /Library/Development/devl/Devops/projects/mycroft-precise/maixduino-scripts/maix_simple_record_test.py

Key Functions:

  • init_wifi() - ESP32 SPI WiFi connection
  • init_audio() - I2S microphone setup
  • record_audio() - Chunked recording with wait_record()
  • convert_to_mono_16bit() - Format conversion (32-bit stereo → 16-bit mono)
  • compress_ulaw() - μ-law compression
  • create_wav_header() - WAV file header generation
  • send_to_server() - HTTP POST with chunked streaming
  • display_transcription() - LCD output with word wrapping
  • main() - Button loop for repeated recordings

Server Script

File: /devl/voice-assistant/simple_transcribe_server.py

Features:

  • Accepts raw WAV or multipart uploads
  • Whisper base model transcription
  • JSON response with transcription text
  • Handles μ-law compressed audio

Documentation

File: /Library/Development/devl/Devops/projects/mycroft-precise/maixduino-scripts/MICROPYTHON_QUIRKS.md

Complete reference of all MicroPython compatibility issues discovered during development.


🎯 Next Steps

Immediate (Tonight)

  1. Switch to Linux laptop with direct serial access
  2. ⏭️ Tune recording duration to get full 1 second
    • Try frame_cnt = 8 instead of 7
    • Or adjust chunk size to get exact timing
  3. ⏭️ Test transcription quality with proper-length recordings

Short Term (This Week)

  1. Increase recording duration to 2-3 seconds for better transcription
  2. Test memory limits with longer recordings
  3. Optimize compression/transmission for speed
  4. Add visual feedback during transmission

Medium Term (Next Week)

  1. Install Mycroft Precise in whisper_cli environment
  2. Test "hey mycroft" wake word detection on server
  3. Integrate wake word into recording loop
  4. Add command processing and Home Assistant integration

Long Term (Future)

  1. Explore edge wake word detection (Precise on K210)
  2. Multi-device deployment
  3. Continuous listening mode
  4. Voice profiles and speaker identification

🐛 Known Issues

Recording Duration

  • Issue: Recording is ~0.9 seconds instead of 1.0 seconds
  • Cause: Integer division 16000 // 2048 = 7.8 rounds down to 7 frames
  • Impact: Minor - transcription still works
  • Fix: Increase frame_cnt to 8 or adjust chunk size

Data Format Mismatch

  • Issue: Hardware returns 4x expected data (128KB vs 32KB)
  • Cause: I2S outputting 32-bit stereo despite 16-bit mono config
  • Impact: None - conversion function handles it
  • Status: Working as intended

Syntax Error Sensitivity

  • Issue: Some valid Python causes "invalid syntax" in MicroPython
  • Patterns: Import statements mid-function, certain arithmetic expressions
  • Workaround: Simplify code, avoid complex expressions
  • Status: Documented in MICROPYTHON_QUIRKS.md

💡 Key Learnings

I2S Recording Pattern

The correct pattern for MaixPy I2S recording:

chunk_size = 2048
frame_cnt = seconds * sample_rate // chunk_size

for i in range(frame_cnt):
    audio_chunk = i2s_dev.record(chunk_size)
    i2s_dev.wait_record()  # BLOCKS until recording complete
    data.append(audio_chunk.to_bytes())

Critical: wait_record() is REQUIRED or recording returns immediately!

Memory Management

K210 has very limited RAM. Successful strategies:

  • Work in small chunks (512-2048 bytes)
  • Stream data instead of buffering
  • Free variables explicitly when done
  • Avoid creating large intermediate buffers

MicroPython Compatibility

MicroPython is NOT Python. Many standard features missing:

  • F-strings, ternary operators, keyword arguments
  • Some string methods, complex expressions
  • Standard libraries (urequests, json parsing)

Rule: Test incrementally, simplify everything, check quirks doc.


📚 Resources Used

Documentation

Code Examples

Tools

  • MaixPy IDE (copy/paste to board)
  • Serial monitor (debugging)
  • Heimdall server (Whisper transcription)

🔄 Ready for Next Session

Current State

  • Code is working and stable
  • Can record, compress, transmit, transcribe, display
  • Button loop allows repeated testing
  • ⚠️ Recording duration slightly short (~0.9s)

Files Ready

  • /Library/Development/devl/Devops/projects/mycroft-precise/maixduino-scripts/maix_simple_record_test.py
  • /Library/Development/devl/Devops/projects/mycroft-precise/maixduino-scripts/MICROPYTHON_QUIRKS.md
  • /devl/voice-assistant/simple_transcribe_server.py

For Serial Access Session

  1. Connect Maixduino via USB to Linux laptop
  2. Install pyserial: pip install pyserial
  3. Find device: ls /dev/ttyUSB* or /dev/ttyACM*
  4. Connect: screen /dev/ttyUSB0 115200 or use MaixPy IDE
  5. Can directly modify code, test immediately, see serial output

Quick Test Commands

# Test WiFi
from network import ESP32_SPI
# ... (full init code in maix_test_simple.py)

# Test I2S
from Maix import I2S
rx = I2S(I2S.DEVICE_0)
# ...

# Test recording
audio = rx.record(2048)
rx.wait_record()
print(len(audio.to_bytes()))

🎊 Success Metrics

Today we achieved:

  • WiFi connection working
  • Audio recording working (with proper blocking)
  • Format conversion working (4x reduction)
  • Compression working (2x reduction)
  • Network transmission working (chunked streaming)
  • Server transcription working
  • Display output working
  • Button loop working
  • End-to-end pipeline complete!

Total: 9/9 core features working! 🚀

Minor tuning needed, but the foundation is solid and ready for wake word integration.


Session Summary: Massive progress! From zero to working audio transcription pipeline in one session. Overcame significant MicroPython compatibility challenges and memory limitations. Ready for next phase: wake word detection.

Status: Ready for Linux serial access and fine-tuning Next Session: Tune recording duration, then integrate Mycroft Precise wake word detection


End of Session Report - 2025-12-03