Ports prior voice assistant research and prototypes from devl/Devops into the Minerva repo. Includes: - docs/: architecture, wake word guides, ESP32-S3 spec, hardware buying guide - scripts/: voice_server.py, voice_server_enhanced.py, setup scripts - hardware/maixduino/: edge device scripts with WiFi credentials scrubbed (replaced hardcoded password with secrets.py pattern) - config/.env.example: server config template - .gitignore: excludes .env, secrets.py, model blobs, ELF firmware - CLAUDE.md: Minerva product context and connection to cf-voice roadmap
12 KiB
Executable file
Maixduino Voice Assistant - Session Progress
Date: 2025-12-03 Session Duration: ~4 hours Goal: Get audio recording and transcription working on Maixduino → Heimdall server
🎉 Major Achievements
✅ Full Audio Pipeline Working!
We successfully built and tested the complete audio capture → compression → transmission → transcription pipeline:
- WiFi Connection - Maixduino connects to network (10.1.10.98)
- Audio Recording - I2S microphone captures audio (MSM261S4030H0 MEMS mic)
- Format Conversion - Converts 32-bit stereo to 16-bit mono (4x size reduction)
- μ-law Compression - Compresses PCM audio by 50%
- HTTP Transmission - Sends compressed WAV to Heimdall server
- Whisper Transcription - Server transcribes and returns text
- LCD Display - Shows transcription on Maixduino screen
- Button Loop - Press BOOT button for repeated recordings
Total size reduction: 128KB → 32KB (mono) → 16KB (compressed) = 87.5% reduction!
🔧 Technical Accomplishments
Audio Recording Pipeline
- Initial Problem:
i2s_dev.record()returned immediately (1ms instead of 1000ms) - Root Cause: Recording API is asynchronous/non-blocking
- Solution: Use chunked recording with
wait_record()blocking calls - Pattern:
for i in range(frame_cnt): audio_chunk = i2s_dev.record(chunk_size) i2s_dev.wait_record() # CRITICAL: blocks until complete chunks.append(audio_chunk.to_bytes())
Memory Management
- K210 has very limited RAM (~6MB total, much less available)
- Successfully handled 128KB → 16KB data transformation without OOM errors
- Techniques used:
- Record in small chunks (2048 samples)
- Stream HTTP transmission (512-byte chunks with delays)
- In-place data conversion where possible
- Explicit garbage collection hints (
audio_data = None)
Network Communication
- Raw socket HTTP (no urequests library available)
- Chunked streaming with flow control (10ms delays)
- Simple WAV format with μ-law compression (format code 7)
- Robust error handling with serial output debugging
🐛 MicroPython/MaixPy Quirks Discovered
String Operations
- ❌ F-strings NOT supported - Must use
"text " + str(var)concatenation - ❌ Ternary operators fail - Use explicit
if/elseblocks instead - ❌
split()needs explicit delimiter -text.split(" ")nottext.split() - ❌ Escape sequences problematic - Avoid
\nin strings, causes syntax errors
Data Types & Methods
- ❌
decode()doesn't accept kwargs - Usedecode('utf-8')notdecode('utf-8', errors='ignore') - ❌ RGB tuples not accepted - Must convert to packed integers:
(r << 16) | (g << 8) | b - ❌ Bytearray item deletion unsupported -
del arr[n:]fails, use slicing instead - ❌ Arithmetic in string concat - Separate calculations:
next = count + 1; "text" + str(next)
I2S Audio Specific
- ❌
record()is non-blocking - Returns immediately, must usewait_record() - ❌ Audio object not directly iterable - Must call
.to_bytes()first - ⚠️ Data format mismatch - Hardware returns 32-bit stereo even when configured for 16-bit mono (4x expected size)
Network/WiFi
- ❌
network.WLANnot available - Must usenetwork.ESP32_SPIwith full pin config - ❌
active()method doesn't exist - Just callconnect()directly - ⚠️ Requires ALL 6 pins configured - CS, RST, RDY, MOSI, MISO, SCLK
General Syntax
- ⚠️
if __name__ == "__main__"sometimes causes syntax errors - Safer to just callmain()directly - ⚠️ Import statements mid-function can cause syntax errors - Keep imports at top of file
- ⚠️ Some valid Python causes "invalid syntax" for unknown reasons - Simplify complex expressions
📊 Current Status
✅ Working
- WiFi connectivity (ESP32 SPI)
- I2S audio initialization
- Chunked audio recording with
wait_record() - Audio format detection and conversion (32-bit stereo → 16-bit mono)
- μ-law compression (50% size reduction)
- HTTP transmission to server (chunked streaming)
- Whisper transcription (server-side)
- JSON response parsing
- LCD display (with word wrapping)
- Button-triggered recording loop
- Countdown timer before recording
⚠️ Partially Working
- Recording duration - Currently getting ~0.9 seconds instead of full 1 second
- Formula:
frame_cnt = seconds * sample_rate // chunk_size - Current:
7 frames × (2048/16000) = 0.896s - May need to increase
frame_cntor adjust chunk size
- Formula:
❌ Not Yet Implemented
- Mycroft Precise wake word detection
- Full voice assistant loop
- Command processing
- Home Assistant integration
- Multi-second recording support
- Real-time audio streaming
🔬 Technical Details
Hardware Configuration
Maixduino Board:
- Processor: K210 dual-core RISC-V @ 400MHz
- RAM: ~6MB total (limited available memory)
- WiFi: ESP32 module via SPI
- Microphone: MSM261S4030H0 MEMS (onboard)
- IP Address: 10.1.10.98
I2S Pins:
- Pin 20: I2S0_IN_D0 (data)
- Pin 19: I2S0_WS (word select)
- Pin 18: I2S0_SCLK (clock)
ESP32 SPI Pins:
- Pin 25: CS (chip select)
- Pin 8: RST (reset)
- Pin 9: RDY (ready)
- Pin 28: MOSI (master out)
- Pin 26: MISO (master in)
- Pin 27: SCLK (clock)
GPIO:
- Pin 16: BOOT button (active low, pull-up)
Server Configuration
Heimdall Server:
- IP: 10.1.10.71
- Port: 3006
- Framework: Flask
- Model: Whisper base
- Environment: Conda
whisper_cli
Endpoints:
/health- Health check/transcribe- POST audio for transcription
Audio Format
Recording:
- Sample Rate: 16kHz
- Hardware Output: 32-bit stereo (128KB for 1 second)
- After Conversion: 16-bit mono (32KB for 1 second)
- After Compression: 8-bit μ-law (16KB for 1 second)
WAV Header:
- Format Code: 7 (μ-law)
- Channels: 1 (mono)
- Sample Rate: 16000 Hz
- Bits per Sample: 8
- Includes
factchunk (required for μ-law)
📝 Code Files
Main Script
File: /Library/Development/devl/Devops/projects/mycroft-precise/maixduino-scripts/maix_simple_record_test.py
Key Functions:
init_wifi()- ESP32 SPI WiFi connectioninit_audio()- I2S microphone setuprecord_audio()- Chunked recording withwait_record()convert_to_mono_16bit()- Format conversion (32-bit stereo → 16-bit mono)compress_ulaw()- μ-law compressioncreate_wav_header()- WAV file header generationsend_to_server()- HTTP POST with chunked streamingdisplay_transcription()- LCD output with word wrappingmain()- Button loop for repeated recordings
Server Script
File: /devl/voice-assistant/simple_transcribe_server.py
Features:
- Accepts raw WAV or multipart uploads
- Whisper base model transcription
- JSON response with transcription text
- Handles μ-law compressed audio
Documentation
File: /Library/Development/devl/Devops/projects/mycroft-precise/maixduino-scripts/MICROPYTHON_QUIRKS.md
Complete reference of all MicroPython compatibility issues discovered during development.
🎯 Next Steps
Immediate (Tonight)
- ✅ Switch to Linux laptop with direct serial access
- ⏭️ Tune recording duration to get full 1 second
- Try
frame_cnt = 8instead of 7 - Or adjust chunk size to get exact timing
- Try
- ⏭️ Test transcription quality with proper-length recordings
Short Term (This Week)
- Increase recording duration to 2-3 seconds for better transcription
- Test memory limits with longer recordings
- Optimize compression/transmission for speed
- Add visual feedback during transmission
Medium Term (Next Week)
- Install Mycroft Precise in
whisper_clienvironment - Test "hey mycroft" wake word detection on server
- Integrate wake word into recording loop
- Add command processing and Home Assistant integration
Long Term (Future)
- Explore edge wake word detection (Precise on K210)
- Multi-device deployment
- Continuous listening mode
- Voice profiles and speaker identification
🐛 Known Issues
Recording Duration
- Issue: Recording is ~0.9 seconds instead of 1.0 seconds
- Cause: Integer division
16000 // 2048 = 7.8rounds down to 7 frames - Impact: Minor - transcription still works
- Fix: Increase
frame_cntto 8 or adjust chunk size
Data Format Mismatch
- Issue: Hardware returns 4x expected data (128KB vs 32KB)
- Cause: I2S outputting 32-bit stereo despite 16-bit mono config
- Impact: None - conversion function handles it
- Status: Working as intended
Syntax Error Sensitivity
- Issue: Some valid Python causes "invalid syntax" in MicroPython
- Patterns: Import statements mid-function, certain arithmetic expressions
- Workaround: Simplify code, avoid complex expressions
- Status: Documented in MICROPYTHON_QUIRKS.md
💡 Key Learnings
I2S Recording Pattern
The correct pattern for MaixPy I2S recording:
chunk_size = 2048
frame_cnt = seconds * sample_rate // chunk_size
for i in range(frame_cnt):
audio_chunk = i2s_dev.record(chunk_size)
i2s_dev.wait_record() # BLOCKS until recording complete
data.append(audio_chunk.to_bytes())
Critical: wait_record() is REQUIRED or recording returns immediately!
Memory Management
K210 has very limited RAM. Successful strategies:
- Work in small chunks (512-2048 bytes)
- Stream data instead of buffering
- Free variables explicitly when done
- Avoid creating large intermediate buffers
MicroPython Compatibility
MicroPython is NOT Python. Many standard features missing:
- F-strings, ternary operators, keyword arguments
- Some string methods, complex expressions
- Standard libraries (urequests, json parsing)
Rule: Test incrementally, simplify everything, check quirks doc.
📚 Resources Used
Documentation
Code Examples
Tools
- MaixPy IDE (copy/paste to board)
- Serial monitor (debugging)
- Heimdall server (Whisper transcription)
🔄 Ready for Next Session
Current State
- ✅ Code is working and stable
- ✅ Can record, compress, transmit, transcribe, display
- ✅ Button loop allows repeated testing
- ⚠️ Recording duration slightly short (~0.9s)
Files Ready
/Library/Development/devl/Devops/projects/mycroft-precise/maixduino-scripts/maix_simple_record_test.py/Library/Development/devl/Devops/projects/mycroft-precise/maixduino-scripts/MICROPYTHON_QUIRKS.md/devl/voice-assistant/simple_transcribe_server.py
For Serial Access Session
- Connect Maixduino via USB to Linux laptop
- Install pyserial:
pip install pyserial - Find device:
ls /dev/ttyUSB*or/dev/ttyACM* - Connect:
screen /dev/ttyUSB0 115200or use MaixPy IDE - Can directly modify code, test immediately, see serial output
Quick Test Commands
# Test WiFi
from network import ESP32_SPI
# ... (full init code in maix_test_simple.py)
# Test I2S
from Maix import I2S
rx = I2S(I2S.DEVICE_0)
# ...
# Test recording
audio = rx.record(2048)
rx.wait_record()
print(len(audio.to_bytes()))
🎊 Success Metrics
Today we achieved:
- ✅ WiFi connection working
- ✅ Audio recording working (with proper blocking)
- ✅ Format conversion working (4x reduction)
- ✅ Compression working (2x reduction)
- ✅ Network transmission working (chunked streaming)
- ✅ Server transcription working
- ✅ Display output working
- ✅ Button loop working
- ✅ End-to-end pipeline complete!
Total: 9/9 core features working! 🚀
Minor tuning needed, but the foundation is solid and ready for wake word integration.
Session Summary: Massive progress! From zero to working audio transcription pipeline in one session. Overcame significant MicroPython compatibility challenges and memory limitations. Ready for next phase: wake word detection.
Status: ✅ Ready for Linux serial access and fine-tuning Next Session: Tune recording duration, then integrate Mycroft Precise wake word detection
End of Session Report - 2025-12-03