Ports prior voice assistant research and prototypes from devl/Devops into the Minerva repo. Includes: - docs/: architecture, wake word guides, ESP32-S3 spec, hardware buying guide - scripts/: voice_server.py, voice_server_enhanced.py, setup scripts - hardware/maixduino/: edge device scripts with WiFi credentials scrubbed (replaced hardcoded password with secrets.py pattern) - config/.env.example: server config template - .gitignore: excludes .env, secrets.py, model blobs, ELF firmware - CLAUDE.md: Minerva product context and connection to cf-voice roadmap
376 lines
12 KiB
Markdown
Executable file
376 lines
12 KiB
Markdown
Executable file
# Maixduino Voice Assistant - Session Progress
|
||
|
||
**Date:** 2025-12-03
|
||
**Session Duration:** ~4 hours
|
||
**Goal:** Get audio recording and transcription working on Maixduino → Heimdall server
|
||
|
||
---
|
||
|
||
## 🎉 Major Achievements
|
||
|
||
### ✅ Full Audio Pipeline Working!
|
||
We successfully built and tested the complete audio capture → compression → transmission → transcription pipeline:
|
||
|
||
1. **WiFi Connection** - Maixduino connects to network (10.1.10.98)
|
||
2. **Audio Recording** - I2S microphone captures audio (MSM261S4030H0 MEMS mic)
|
||
3. **Format Conversion** - Converts 32-bit stereo to 16-bit mono (4x size reduction)
|
||
4. **μ-law Compression** - Compresses PCM audio by 50%
|
||
5. **HTTP Transmission** - Sends compressed WAV to Heimdall server
|
||
6. **Whisper Transcription** - Server transcribes and returns text
|
||
7. **LCD Display** - Shows transcription on Maixduino screen
|
||
8. **Button Loop** - Press BOOT button for repeated recordings
|
||
|
||
**Total size reduction:** 128KB → 32KB (mono) → 16KB (compressed) = **87.5% reduction!**
|
||
|
||
---
|
||
|
||
## 🔧 Technical Accomplishments
|
||
|
||
### Audio Recording Pipeline
|
||
- **Initial Problem:** `i2s_dev.record()` returned immediately (1ms instead of 1000ms)
|
||
- **Root Cause:** Recording API is asynchronous/non-blocking
|
||
- **Solution:** Use chunked recording with `wait_record()` blocking calls
|
||
- **Pattern:**
|
||
```python
|
||
for i in range(frame_cnt):
|
||
audio_chunk = i2s_dev.record(chunk_size)
|
||
i2s_dev.wait_record() # CRITICAL: blocks until complete
|
||
chunks.append(audio_chunk.to_bytes())
|
||
```
|
||
|
||
### Memory Management
|
||
- **K210 has very limited RAM** (~6MB total, much less available)
|
||
- Successfully handled 128KB → 16KB data transformation without OOM errors
|
||
- Techniques used:
|
||
- Record in small chunks (2048 samples)
|
||
- Stream HTTP transmission (512-byte chunks with delays)
|
||
- In-place data conversion where possible
|
||
- Explicit garbage collection hints (`audio_data = None`)
|
||
|
||
### Network Communication
|
||
- **Raw socket HTTP** (no urequests library available)
|
||
- **Chunked streaming** with flow control (10ms delays)
|
||
- **Simple WAV format** with μ-law compression (format code 7)
|
||
- **Robust error handling** with serial output debugging
|
||
|
||
---
|
||
|
||
## 🐛 MicroPython/MaixPy Quirks Discovered
|
||
|
||
### String Operations
|
||
- ❌ **F-strings NOT supported** - Must use `"text " + str(var)` concatenation
|
||
- ❌ **Ternary operators fail** - Use explicit `if/else` blocks instead
|
||
- ❌ **`split()` needs explicit delimiter** - `text.split(" ")` not `text.split()`
|
||
- ❌ **Escape sequences problematic** - Avoid `\n` in strings, causes syntax errors
|
||
|
||
### Data Types & Methods
|
||
- ❌ **`decode()` doesn't accept kwargs** - Use `decode('utf-8')` not `decode('utf-8', errors='ignore')`
|
||
- ❌ **RGB tuples not accepted** - Must convert to packed integers: `(r << 16) | (g << 8) | b`
|
||
- ❌ **Bytearray item deletion unsupported** - `del arr[n:]` fails, use slicing instead
|
||
- ❌ **Arithmetic in string concat** - Separate calculations: `next = count + 1; "text" + str(next)`
|
||
|
||
### I2S Audio Specific
|
||
- ❌ **`record()` is non-blocking** - Returns immediately, must use `wait_record()`
|
||
- ❌ **Audio object not directly iterable** - Must call `.to_bytes()` first
|
||
- ⚠️ **Data format mismatch** - Hardware returns 32-bit stereo even when configured for 16-bit mono (4x expected size)
|
||
|
||
### Network/WiFi
|
||
- ❌ **`network.WLAN` not available** - Must use `network.ESP32_SPI` with full pin config
|
||
- ❌ **`active()` method doesn't exist** - Just call `connect()` directly
|
||
- ⚠️ **Requires ALL 6 pins configured** - CS, RST, RDY, MOSI, MISO, SCLK
|
||
|
||
### General Syntax
|
||
- ⚠️ **`if __name__ == "__main__"` sometimes causes syntax errors** - Safer to just call `main()` directly
|
||
- ⚠️ **Import statements mid-function can cause syntax errors** - Keep imports at top of file
|
||
- ⚠️ **Some valid Python causes "invalid syntax" for unknown reasons** - Simplify complex expressions
|
||
|
||
---
|
||
|
||
## 📊 Current Status
|
||
|
||
### ✅ Working
|
||
- WiFi connectivity (ESP32 SPI)
|
||
- I2S audio initialization
|
||
- Chunked audio recording with `wait_record()`
|
||
- Audio format detection and conversion (32-bit stereo → 16-bit mono)
|
||
- μ-law compression (50% size reduction)
|
||
- HTTP transmission to server (chunked streaming)
|
||
- Whisper transcription (server-side)
|
||
- JSON response parsing
|
||
- LCD display (with word wrapping)
|
||
- Button-triggered recording loop
|
||
- Countdown timer before recording
|
||
|
||
### ⚠️ Partially Working
|
||
- **Recording duration** - Currently getting ~0.9 seconds instead of full 1 second
|
||
- Formula: `frame_cnt = seconds * sample_rate // chunk_size`
|
||
- Current: `7 frames × (2048/16000) = 0.896s`
|
||
- May need to increase `frame_cnt` or adjust chunk size
|
||
|
||
### ❌ Not Yet Implemented
|
||
- Mycroft Precise wake word detection
|
||
- Full voice assistant loop
|
||
- Command processing
|
||
- Home Assistant integration
|
||
- Multi-second recording support
|
||
- Real-time audio streaming
|
||
|
||
---
|
||
|
||
## 🔬 Technical Details
|
||
|
||
### Hardware Configuration
|
||
|
||
**Maixduino Board:**
|
||
- Processor: K210 dual-core RISC-V @ 400MHz
|
||
- RAM: ~6MB total (limited available memory)
|
||
- WiFi: ESP32 module via SPI
|
||
- Microphone: MSM261S4030H0 MEMS (onboard)
|
||
- IP Address: 10.1.10.98
|
||
|
||
**I2S Pins:**
|
||
- Pin 20: I2S0_IN_D0 (data)
|
||
- Pin 19: I2S0_WS (word select)
|
||
- Pin 18: I2S0_SCLK (clock)
|
||
|
||
**ESP32 SPI Pins:**
|
||
- Pin 25: CS (chip select)
|
||
- Pin 8: RST (reset)
|
||
- Pin 9: RDY (ready)
|
||
- Pin 28: MOSI (master out)
|
||
- Pin 26: MISO (master in)
|
||
- Pin 27: SCLK (clock)
|
||
|
||
**GPIO:**
|
||
- Pin 16: BOOT button (active low, pull-up)
|
||
|
||
### Server Configuration
|
||
|
||
**Heimdall Server:**
|
||
- IP: 10.1.10.71
|
||
- Port: 3006
|
||
- Framework: Flask
|
||
- Model: Whisper base
|
||
- Environment: Conda `whisper_cli`
|
||
|
||
**Endpoints:**
|
||
- `/health` - Health check
|
||
- `/transcribe` - POST audio for transcription
|
||
|
||
### Audio Format
|
||
|
||
**Recording:**
|
||
- Sample Rate: 16kHz
|
||
- Hardware Output: 32-bit stereo (128KB for 1 second)
|
||
- After Conversion: 16-bit mono (32KB for 1 second)
|
||
- After Compression: 8-bit μ-law (16KB for 1 second)
|
||
|
||
**WAV Header:**
|
||
- Format Code: 7 (μ-law)
|
||
- Channels: 1 (mono)
|
||
- Sample Rate: 16000 Hz
|
||
- Bits per Sample: 8
|
||
- Includes `fact` chunk (required for μ-law)
|
||
|
||
---
|
||
|
||
## 📝 Code Files
|
||
|
||
### Main Script
|
||
**File:** `/Library/Development/devl/Devops/projects/mycroft-precise/maixduino-scripts/maix_simple_record_test.py`
|
||
|
||
**Key Functions:**
|
||
- `init_wifi()` - ESP32 SPI WiFi connection
|
||
- `init_audio()` - I2S microphone setup
|
||
- `record_audio()` - Chunked recording with `wait_record()`
|
||
- `convert_to_mono_16bit()` - Format conversion (32-bit stereo → 16-bit mono)
|
||
- `compress_ulaw()` - μ-law compression
|
||
- `create_wav_header()` - WAV file header generation
|
||
- `send_to_server()` - HTTP POST with chunked streaming
|
||
- `display_transcription()` - LCD output with word wrapping
|
||
- `main()` - Button loop for repeated recordings
|
||
|
||
### Server Script
|
||
**File:** `/devl/voice-assistant/simple_transcribe_server.py`
|
||
|
||
**Features:**
|
||
- Accepts raw WAV or multipart uploads
|
||
- Whisper base model transcription
|
||
- JSON response with transcription text
|
||
- Handles μ-law compressed audio
|
||
|
||
### Documentation
|
||
**File:** `/Library/Development/devl/Devops/projects/mycroft-precise/maixduino-scripts/MICROPYTHON_QUIRKS.md`
|
||
|
||
Complete reference of all MicroPython compatibility issues discovered during development.
|
||
|
||
---
|
||
|
||
## 🎯 Next Steps
|
||
|
||
### Immediate (Tonight)
|
||
1. ✅ Switch to Linux laptop with direct serial access
|
||
2. ⏭️ Tune recording duration to get full 1 second
|
||
- Try `frame_cnt = 8` instead of 7
|
||
- Or adjust chunk size to get exact timing
|
||
3. ⏭️ Test transcription quality with proper-length recordings
|
||
|
||
### Short Term (This Week)
|
||
1. Increase recording duration to 2-3 seconds for better transcription
|
||
2. Test memory limits with longer recordings
|
||
3. Optimize compression/transmission for speed
|
||
4. Add visual feedback during transmission
|
||
|
||
### Medium Term (Next Week)
|
||
1. Install Mycroft Precise in `whisper_cli` environment
|
||
2. Test "hey mycroft" wake word detection on server
|
||
3. Integrate wake word into recording loop
|
||
4. Add command processing and Home Assistant integration
|
||
|
||
### Long Term (Future)
|
||
1. Explore edge wake word detection (Precise on K210)
|
||
2. Multi-device deployment
|
||
3. Continuous listening mode
|
||
4. Voice profiles and speaker identification
|
||
|
||
---
|
||
|
||
## 🐛 Known Issues
|
||
|
||
### Recording Duration
|
||
- **Issue:** Recording is ~0.9 seconds instead of 1.0 seconds
|
||
- **Cause:** Integer division `16000 // 2048 = 7.8` rounds down to 7 frames
|
||
- **Impact:** Minor - transcription still works
|
||
- **Fix:** Increase `frame_cnt` to 8 or adjust chunk size
|
||
|
||
### Data Format Mismatch
|
||
- **Issue:** Hardware returns 4x expected data (128KB vs 32KB)
|
||
- **Cause:** I2S outputting 32-bit stereo despite 16-bit mono config
|
||
- **Impact:** None - conversion function handles it
|
||
- **Status:** Working as intended
|
||
|
||
### Syntax Error Sensitivity
|
||
- **Issue:** Some valid Python causes "invalid syntax" in MicroPython
|
||
- **Patterns:** Import statements mid-function, certain arithmetic expressions
|
||
- **Workaround:** Simplify code, avoid complex expressions
|
||
- **Status:** Documented in MICROPYTHON_QUIRKS.md
|
||
|
||
---
|
||
|
||
## 💡 Key Learnings
|
||
|
||
### I2S Recording Pattern
|
||
The correct pattern for MaixPy I2S recording:
|
||
```python
|
||
chunk_size = 2048
|
||
frame_cnt = seconds * sample_rate // chunk_size
|
||
|
||
for i in range(frame_cnt):
|
||
audio_chunk = i2s_dev.record(chunk_size)
|
||
i2s_dev.wait_record() # BLOCKS until recording complete
|
||
data.append(audio_chunk.to_bytes())
|
||
```
|
||
|
||
**Critical:** `wait_record()` is REQUIRED or recording returns immediately!
|
||
|
||
### Memory Management
|
||
K210 has very limited RAM. Successful strategies:
|
||
- Work in small chunks (512-2048 bytes)
|
||
- Stream data instead of buffering
|
||
- Free variables explicitly when done
|
||
- Avoid creating large intermediate buffers
|
||
|
||
### MicroPython Compatibility
|
||
MicroPython is NOT Python. Many standard features missing:
|
||
- F-strings, ternary operators, keyword arguments
|
||
- Some string methods, complex expressions
|
||
- Standard libraries (urequests, json parsing)
|
||
|
||
**Rule:** Test incrementally, simplify everything, check quirks doc.
|
||
|
||
---
|
||
|
||
## 📚 Resources Used
|
||
|
||
### Documentation
|
||
- [MaixPy I2S API Reference](https://wiki.sipeed.com/soft/maixpy/en/api_reference/Maix/i2s.html)
|
||
- [MaixPy I2S Usage Guide](https://wiki.sipeed.com/soft/maixpy/en/modules/on_chip/i2s.html)
|
||
- [Maixduino Hardware Wiki](https://wiki.sipeed.com/hardware/en/maix/maixpy_develop_kit_board/maix_duino.html)
|
||
|
||
### Code Examples
|
||
- [Official record_wav.py](https://github.com/sipeed/MaixPy-v1_scripts/blob/master/multimedia/audio/record_wav.py)
|
||
- [MaixPy Scripts Repository](https://github.com/sipeed/MaixPy-v1_scripts)
|
||
|
||
### Tools
|
||
- MaixPy IDE (copy/paste to board)
|
||
- Serial monitor (debugging)
|
||
- Heimdall server (Whisper transcription)
|
||
|
||
---
|
||
|
||
## 🔄 Ready for Next Session
|
||
|
||
### Current State
|
||
- ✅ Code is working and stable
|
||
- ✅ Can record, compress, transmit, transcribe, display
|
||
- ✅ Button loop allows repeated testing
|
||
- ⚠️ Recording duration slightly short (~0.9s)
|
||
|
||
### Files Ready
|
||
- `/Library/Development/devl/Devops/projects/mycroft-precise/maixduino-scripts/maix_simple_record_test.py`
|
||
- `/Library/Development/devl/Devops/projects/mycroft-precise/maixduino-scripts/MICROPYTHON_QUIRKS.md`
|
||
- `/devl/voice-assistant/simple_transcribe_server.py`
|
||
|
||
### For Serial Access Session
|
||
1. Connect Maixduino via USB to Linux laptop
|
||
2. Install pyserial: `pip install pyserial`
|
||
3. Find device: `ls /dev/ttyUSB*` or `/dev/ttyACM*`
|
||
4. Connect: `screen /dev/ttyUSB0 115200` or use MaixPy IDE
|
||
5. Can directly modify code, test immediately, see serial output
|
||
|
||
### Quick Test Commands
|
||
```python
|
||
# Test WiFi
|
||
from network import ESP32_SPI
|
||
# ... (full init code in maix_test_simple.py)
|
||
|
||
# Test I2S
|
||
from Maix import I2S
|
||
rx = I2S(I2S.DEVICE_0)
|
||
# ...
|
||
|
||
# Test recording
|
||
audio = rx.record(2048)
|
||
rx.wait_record()
|
||
print(len(audio.to_bytes()))
|
||
```
|
||
|
||
---
|
||
|
||
## 🎊 Success Metrics
|
||
|
||
Today we achieved:
|
||
- ✅ WiFi connection working
|
||
- ✅ Audio recording working (with proper blocking)
|
||
- ✅ Format conversion working (4x reduction)
|
||
- ✅ Compression working (2x reduction)
|
||
- ✅ Network transmission working (chunked streaming)
|
||
- ✅ Server transcription working
|
||
- ✅ Display output working
|
||
- ✅ Button loop working
|
||
- ✅ End-to-end pipeline complete!
|
||
|
||
**Total:** 9/9 core features working! 🚀
|
||
|
||
Minor tuning needed, but the foundation is solid and ready for wake word integration.
|
||
|
||
---
|
||
|
||
**Session Summary:** Massive progress! From zero to working audio transcription pipeline in one session. Overcame significant MicroPython compatibility challenges and memory limitations. Ready for next phase: wake word detection.
|
||
|
||
**Status:** ✅ Ready for Linux serial access and fine-tuning
|
||
**Next Session:** Tune recording duration, then integrate Mycroft Precise wake word detection
|
||
|
||
---
|
||
|
||
*End of Session Report - 2025-12-03*
|