minerva/docs/ESP32_S3_VOICE_ASSISTANT_SPEC.md
pyr0ball 173f7f37d4 feat: import mycroft-precise work as Minerva foundation
Ports prior voice assistant research and prototypes from devl/Devops
into the Minerva repo. Includes:

- docs/: architecture, wake word guides, ESP32-S3 spec, hardware buying guide
- scripts/: voice_server.py, voice_server_enhanced.py, setup scripts
- hardware/maixduino/: edge device scripts with WiFi credentials scrubbed
  (replaced hardcoded password with secrets.py pattern)
- config/.env.example: server config template
- .gitignore: excludes .env, secrets.py, model blobs, ELF firmware
- CLAUDE.md: Minerva product context and connection to cf-voice roadmap
2026-04-06 22:21:12 -07:00

32 KiB
Executable file
Raw Permalink Blame History

ESP32-S3-Touch-LCD Voice Assistant - Technical Specification

Date: 2026-01-01 Hardware: Waveshare ESP32-S3-Touch-LCD-1.69 Display: 240×280 ST7789V2 with Capacitive Touch Framework: ESP-IDF v5.3.1+ with LVGL 8.4.0+ Purpose: Voice assistant endpoint with real-time audio waveform visualization


Overview

Voice assistant client for ESP32-S3 with integrated LVGL-based visual feedback showing:

  • Real-time audio waveform during listening
  • Wake word detection animation
  • Processing/thinking state
  • Response state with audio output visualization
  • Touch controls for volume, sensitivity, settings

Architecture:

┌─────────────────────────────────┐
│  ESP32-S3-Touch-LCD-1.69        │
│                                 │
│  ┌──────────────────────────┐  │
│  │   LVGL UI (240×280)      │  │
│  │   - Waveform Canvas      │  │
│  │   - State Indicators     │  │──┐
│  │   - Touch Controls       │  │  │
│  └──────────────────────────┘  │  │
│                                 │  │
│  ┌──────────────────────────┐  │  │ WiFi
│  │   Audio Pipeline         │  │  │ Audio Stream
│  │   - I2S Mic Input        │  │  │
│  │   - I2S Speaker Output   │  │──┤
│  │   - Buffer Management    │  │  │
│  └──────────────────────────┘  │  │
│                                 │  │
│  ┌──────────────────────────┐  │  │
│  │   State Machine          │  │  │
│  │   - Idle → Listening     │  │  │
│  │   - Processing → Speaking│  │──┘
│  └──────────────────────────┘  │
└─────────────────────────────────┘
         │
         │ TCP/HTTP
         ↓
┌─────────────────────────────────┐
│  Heimdall Voice Server          │
│  (10.1.10.71:3006)              │
│                                 │
│  - Mycroft Precise Wake Word    │
│  - Whisper STT                  │
│  - Home Assistant Integration   │
│  - Piper TTS                    │
└─────────────────────────────────┘

Visual States & UI Design

State Machine

        ┌─────────┐
        │  IDLE   │ ◄──────────────┐
        └────┬────┘                │
             │                     │
    Wake Word Detected             │
             │                     │
             ↓                     │
      ┌──────────┐                │
      │LISTENING │                │
      └────┬─────┘                │
           │                      │
   End of Speech                  │
           │                      │
           ↓                      │
    ┌───────────┐                │
    │PROCESSING │                │
    └─────┬─────┘                │
          │                      │
    Response Ready               │
          │                      │
          ↓                      │
    ┌──────────┐                │
    │ SPEAKING │ ───────────────┘
    └──────────┘

Visual Feedback Per State

1. IDLE State

Display:

  • Subtle pulsing ring animation (like Google Home)
  • Time display from RTC
  • Status icons (WiFi strength, battery level)
  • Dim backlight (30-50%)

Colors:

  • Background: Dark blue (#001F3F)
  • Pulse ring: Cyan (#00BFFF)
  • Text: White (#FFFFFF)

LVGL Widgets:

lv_obj_t *idle_screen;
lv_obj_t *pulse_ring;      // Arc widget, animated rotation
lv_obj_t *time_label;      // Label with RTC time
lv_obj_t *status_bar;      // Container for icons

Animation:

  • Slow pulse: 2-second breathing cycle
  • Rotation: 360° over 10 seconds

2. LISTENING State

Display:

  • Real-time audio waveform visualization
  • Bright backlight (100%)
  • "Listening..." text
  • Cancel button (touch)

Waveform Visualization:

Option A: Canvas-Based Waveform (Recommended)

  • Use LVGL lv_canvas for custom drawing
  • Draw waveform from audio buffer samples
  • Scrolling waveform (left-to-right)
  • Update rate: 30-60 FPS

Option B: Bar Chart Spectrum

  • Use lv_chart with bar type
  • FFT-based spectrum analyzer
  • 8-16 bars for frequency bins
  • Update rate: 15-30 FPS

Colors:

  • Background: Dark gray (#1A1A1A)
  • Waveform: Green (#00FF00)
  • Peak indicators: Yellow (#FFFF00)
  • Clipping: Red (#FF0000)

LVGL Implementation:

// Canvas-based waveform
lv_obj_t *listening_screen;
lv_obj_t *waveform_canvas;    // 240×180 canvas
lv_obj_t *listening_label;    // "Listening..."
lv_obj_t *cancel_btn;         // Touch to cancel

// Waveform buffer (circular buffer)
#define WAVEFORM_WIDTH 240
#define WAVEFORM_HEIGHT 180
#define WAVEFORM_CENTER (WAVEFORM_HEIGHT / 2)
int16_t waveform_buffer[WAVEFORM_WIDTH];
uint16_t waveform_index = 0;

// Drawing function (called from audio callback)
void draw_waveform(lv_obj_t *canvas, int16_t *audio_samples, size_t count) {
    lv_canvas_fill_bg(canvas, lv_color_hex(0x1A1A1A), LV_OPA_COVER);

    lv_draw_line_dsc_t line_dsc;
    lv_draw_line_dsc_init(&line_dsc);
    line_dsc.color = lv_color_hex(0x00FF00);
    line_dsc.width = 2;

    // Draw waveform line
    for (int x = 0; x < WAVEFORM_WIDTH - 1; x++) {
        int16_t y1 = WAVEFORM_CENTER + (waveform_buffer[x] / 256);
        int16_t y2 = WAVEFORM_CENTER + (waveform_buffer[x + 1] / 256);

        lv_point_t points[] = {{x, y1}, {x + 1, y2}};
        lv_canvas_draw_line(canvas, points, 2, &line_dsc);
    }
}

// Audio callback (I2S task)
void audio_i2s_callback(int16_t *samples, size_t count) {
    // Downsample audio for waveform display
    for (int i = 0; i < count; i += (count / WAVEFORM_WIDTH)) {
        waveform_buffer[waveform_index] = samples[i];
        waveform_index = (waveform_index + 1) % WAVEFORM_WIDTH;
    }

    // Trigger LVGL update (use event or flag)
    xEventGroupSetBits(ui_event_group, WAVEFORM_UPDATE_BIT);
}

Touch Controls:

  • Tap anywhere: Cancel listening
  • Swipe down: Lower sensitivity
  • Swipe up: Increase sensitivity

3. PROCESSING State

Display:

  • Animated spinner/thinking indicator
  • "Processing..." text
  • Waveform fades out smoothly

Animation:

  • Circular spinner with gradient
  • Rotation: 360° per 1 second
  • Pulsing opacity

Colors:

  • Background: Dark gray (#1A1A1A)
  • Spinner: Blue (#0080FF)
  • Text: Light gray (#CCCCCC)

LVGL Implementation:

lv_obj_t *processing_screen;
lv_obj_t *spinner;           // lv_spinner widget
lv_obj_t *processing_label;  // "Processing..."

// Transition from listening to processing
void transition_to_processing(void) {
    // Fade out waveform
    lv_anim_t fade_out;
    lv_anim_init(&fade_out);
    lv_anim_set_var(&fade_out, waveform_canvas);
    lv_anim_set_values(&fade_out, LV_OPA_COVER, LV_OPA_TRANSP);
    lv_anim_set_time(&fade_out, 300);
    lv_anim_set_exec_cb(&fade_out, lv_obj_set_style_opa);
    lv_anim_start(&fade_out);

    // Show spinner after fade
    lv_timer_t *timer = lv_timer_create(show_spinner_callback, 300, NULL);
    lv_timer_set_repeat_count(timer, 1);
}

4. SPEAKING State

Display:

  • Audio output waveform (TTS playback visualization)
  • "Speaking..." or response text snippet
  • Volume indicator

Waveform:

  • Same canvas as LISTENING but different color
  • Shows output audio being played
  • Synchronized with speaker output

Colors:

  • Background: Dark gray (#1A1A1A)
  • Waveform: Blue (#0080FF)
  • Text: White (#FFFFFF)

LVGL Implementation:

lv_obj_t *speaking_screen;
lv_obj_t *output_waveform_canvas;  // Same size as input waveform
lv_obj_t *response_label;          // Show part of response text
lv_obj_t *volume_bar;              // lv_bar widget for volume level

// Similar drawing to listening state, but fed from speaker buffer
void draw_output_waveform(lv_obj_t *canvas, int16_t *speaker_samples, size_t count) {
    // Same logic as input waveform, different color
    line_dsc.color = lv_color_hex(0x0080FF);
    // ... draw logic
}

Touch Controls:

  • Tap: Skip response (go back to idle)
  • Volume slider: Adjust speaker volume

Additional UI Elements

Status Bar (All States)

Location: Top 20 pixels Contents:

  • WiFi icon + signal strength
  • Battery icon + percentage
  • Time (from RTC)
  • Mute icon (if muted)

LVGL Implementation:

lv_obj_t *status_bar;
lv_obj_t *wifi_icon;
lv_obj_t *battery_icon;
lv_obj_t *time_label;
lv_obj_t *mute_icon;

// Update every second
void update_status_bar(lv_timer_t *timer) {
    // Update WiFi strength
    int8_t rssi = wifi_get_rssi();
    lv_img_set_src(wifi_icon, get_wifi_icon(rssi));

    // Update battery
    uint8_t battery_pct = battery_get_percentage();
    lv_img_set_src(battery_icon, get_battery_icon(battery_pct));

    // Update time from RTC
    rtc_time_t time;
    pcf85063_get_time(&time);
    lv_label_set_text_fmt(time_label, "%02d:%02d", time.hour, time.min);
}

// Create timer for status bar updates
lv_timer_create(update_status_bar, 1000, NULL);

Settings Screen (Touch Access)

Trigger: Long-press on idle screen Contents:

  • Volume slider
  • Brightness slider
  • Wake word sensitivity slider
  • WiFi settings button
  • About/Info button

LVGL Implementation:

lv_obj_t *settings_screen;
lv_obj_t *volume_slider;
lv_obj_t *brightness_slider;
lv_obj_t *sensitivity_slider;
lv_obj_t *wifi_btn;
lv_obj_t *about_btn;
lv_obj_t *back_btn;

// Slider event handler
static void slider_event_cb(lv_event_t *e) {
    lv_obj_t *slider = lv_event_get_target(e);
    int32_t value = lv_slider_get_value(slider);

    if (slider == volume_slider) {
        set_speaker_volume(value);
    } else if (slider == brightness_slider) {
        set_backlight_brightness(value);
    } else if (slider == sensitivity_slider) {
        set_wake_word_sensitivity(value);
    }
}

Audio Pipeline Integration

I2S Configuration

Microphone (INMP441):

#define I2S_MIC_NUM         I2S_NUM_0
#define I2S_MIC_BCLK_PIN    GPIO_NUM_4   // Verify with board schematic
#define I2S_MIC_WS_PIN      GPIO_NUM_5
#define I2S_MIC_DIN_PIN     GPIO_NUM_6
#define I2S_MIC_SAMPLE_RATE 16000
#define I2S_MIC_BITS        16
#define I2S_MIC_CHANNELS    1

i2s_config_t i2s_mic_config = {
    .mode = I2S_MODE_MASTER | I2S_MODE_RX,
    .sample_rate = I2S_MIC_SAMPLE_RATE,
    .bits_per_sample = I2S_BITS_PER_SAMPLE_16BIT,
    .channel_format = I2S_CHANNEL_FMT_ONLY_LEFT,
    .communication_format = I2S_COMM_FORMAT_STAND_I2S,
    .intr_alloc_flags = ESP_INTR_FLAG_LEVEL1,
    .dma_buf_count = 8,
    .dma_buf_len = 256,
    .use_apll = false,
    .tx_desc_auto_clear = false,
    .fixed_mclk = 0
};

i2s_pin_config_t i2s_mic_pins = {
    .bck_io_num = I2S_MIC_BCLK_PIN,
    .ws_io_num = I2S_MIC_WS_PIN,
    .data_out_num = I2S_PIN_NO_CHANGE,
    .data_in_num = I2S_MIC_DIN_PIN
};

void audio_init_microphone(void) {
    i2s_driver_install(I2S_MIC_NUM, &i2s_mic_config, 0, NULL);
    i2s_set_pin(I2S_MIC_NUM, &i2s_mic_pins);
    i2s_zero_dma_buffer(I2S_MIC_NUM);
}

Speaker (MAX98357A I2S Amp):

#define I2S_SPK_NUM         I2S_NUM_1
#define I2S_SPK_BCLK_PIN    GPIO_NUM_7   // Verify with board schematic
#define I2S_SPK_WS_PIN      GPIO_NUM_8
#define I2S_SPK_DOUT_PIN    GPIO_NUM_9
#define I2S_SPK_SAMPLE_RATE 16000
#define I2S_SPK_BITS        16
#define I2S_SPK_CHANNELS    1

i2s_config_t i2s_spk_config = {
    .mode = I2S_MODE_MASTER | I2S_MODE_TX,
    .sample_rate = I2S_SPK_SAMPLE_RATE,
    .bits_per_sample = I2S_BITS_PER_SAMPLE_16BIT,
    .channel_format = I2S_CHANNEL_FMT_ONLY_LEFT,
    .communication_format = I2S_COMM_FORMAT_STAND_I2S,
    .intr_alloc_flags = ESP_INTR_FLAG_LEVEL1,
    .dma_buf_count = 8,
    .dma_buf_len = 256,
    .use_apll = false,
    .tx_desc_auto_clear = true,
    .fixed_mclk = 0
};

i2s_pin_config_t i2s_spk_pins = {
    .bck_io_num = I2S_SPK_BCLK_PIN,
    .ws_io_num = I2S_SPK_WS_PIN,
    .data_out_num = I2S_SPK_DOUT_PIN,
    .data_in_num = I2S_PIN_NO_CHANGE
};

void audio_init_speaker(void) {
    i2s_driver_install(I2S_SPK_NUM, &i2s_spk_config, 0, NULL);
    i2s_set_pin(I2S_SPK_NUM, &i2s_spk_pins);
    i2s_zero_dma_buffer(I2S_SPK_NUM);
}

Audio Buffer Management

Circular Buffer for Waveform:

#define AUDIO_BUFFER_SIZE 2048
#define WAVEFORM_DECIMATION 8  // Downsample for display

typedef struct {
    int16_t samples[AUDIO_BUFFER_SIZE];
    uint16_t write_idx;
    uint16_t read_idx;
    SemaphoreHandle_t mutex;
} audio_buffer_t;

audio_buffer_t mic_buffer;
audio_buffer_t spk_buffer;

void audio_buffer_init(audio_buffer_t *buf) {
    memset(buf->samples, 0, sizeof(buf->samples));
    buf->write_idx = 0;
    buf->read_idx = 0;
    buf->mutex = xSemaphoreCreateMutex();
}

void audio_buffer_write(audio_buffer_t *buf, int16_t *samples, size_t count) {
    xSemaphoreTake(buf->mutex, portMAX_DELAY);
    for (size_t i = 0; i < count; i++) {
        buf->samples[buf->write_idx] = samples[i];
        buf->write_idx = (buf->write_idx + 1) % AUDIO_BUFFER_SIZE;
    }
    xSemaphoreGive(buf->mutex);
}

// Get downsampled samples for waveform display
void audio_buffer_get_waveform(audio_buffer_t *buf, int16_t *out, size_t out_count) {
    xSemaphoreTake(buf->mutex, portMAX_DELAY);
    for (size_t i = 0; i < out_count; i++) {
        size_t src_idx = (buf->write_idx + (i * WAVEFORM_DECIMATION)) % AUDIO_BUFFER_SIZE;
        out[i] = buf->samples[src_idx];
    }
    xSemaphoreGive(buf->mutex);
}

Audio Streaming Task

Microphone Input Task:

void audio_mic_task(void *pvParameters) {
    int16_t i2s_buffer[256];
    size_t bytes_read;

    while (1) {
        // Read from I2S microphone
        i2s_read(I2S_MIC_NUM, i2s_buffer, sizeof(i2s_buffer), &bytes_read, portMAX_DELAY);
        size_t samples_read = bytes_read / sizeof(int16_t);

        if (current_state == STATE_LISTENING) {
            // Write to circular buffer for waveform display
            audio_buffer_write(&mic_buffer, i2s_buffer, samples_read);

            // Send to Heimdall server via WiFi
            audio_send_to_server(i2s_buffer, samples_read);

            // Trigger waveform update
            xEventGroupSetBits(ui_event_group, WAVEFORM_UPDATE_BIT);
        }
    }
}

Speaker Output Task:

void audio_speaker_task(void *pvParameters) {
    int16_t i2s_buffer[256];
    size_t bytes_written;

    while (1) {
        // Receive audio from Heimdall server
        size_t samples_received = audio_receive_from_server(i2s_buffer, 256);

        if (samples_received > 0 && current_state == STATE_SPEAKING) {
            // Write to circular buffer for waveform display
            audio_buffer_write(&spk_buffer, i2s_buffer, samples_received);

            // Play through I2S speaker
            i2s_write(I2S_SPK_NUM, i2s_buffer, samples_received * sizeof(int16_t),
                     &bytes_written, portMAX_DELAY);

            // Trigger waveform update
            xEventGroupSetBits(ui_event_group, WAVEFORM_UPDATE_BIT);
        } else {
            vTaskDelay(pdMS_TO_TICKS(10));
        }
    }
}

LVGL Update Task

Waveform Rendering Task:

void lvgl_waveform_task(void *pvParameters) {
    int16_t waveform_samples[WAVEFORM_WIDTH];

    while (1) {
        // Wait for waveform update event
        EventBits_t bits = xEventGroupWaitBits(ui_event_group, WAVEFORM_UPDATE_BIT,
                                               pdTRUE, pdFALSE, pdMS_TO_TICKS(50));

        if (bits & WAVEFORM_UPDATE_BIT) {
            if (current_state == STATE_LISTENING) {
                // Get downsampled mic data
                audio_buffer_get_waveform(&mic_buffer, waveform_samples, WAVEFORM_WIDTH);

                // Draw on LVGL canvas (must lock LVGL)
                lvgl_lock();
                draw_waveform(waveform_canvas, waveform_samples, WAVEFORM_WIDTH);
                lvgl_unlock();

            } else if (current_state == STATE_SPEAKING) {
                // Get downsampled speaker data
                audio_buffer_get_waveform(&spk_buffer, waveform_samples, WAVEFORM_WIDTH);

                lvgl_lock();
                draw_output_waveform(output_waveform_canvas, waveform_samples, WAVEFORM_WIDTH);
                lvgl_unlock();
            }
        }
    }
}

Touch Gesture Integration

Touch Controller (CST816D)

Gestures Supported:

  • Single tap
  • Long press
  • Swipe up/down/left/right

Implementation:

#define TOUCH_I2C_NUM       I2C_NUM_0
#define TOUCH_SDA_PIN       GPIO_NUM_6
#define TOUCH_SCL_PIN       GPIO_NUM_7
#define TOUCH_INT_PIN       GPIO_NUM_9
#define TOUCH_RST_PIN       GPIO_NUM_10

typedef enum {
    GESTURE_NONE = 0,
    GESTURE_TAP,
    GESTURE_LONG_PRESS,
    GESTURE_SWIPE_UP,
    GESTURE_SWIPE_DOWN,
    GESTURE_SWIPE_LEFT,
    GESTURE_SWIPE_RIGHT
} touch_gesture_t;

void touch_init(void) {
    // I2C init for CST816D
    i2c_config_t conf = {
        .mode = I2C_MODE_MASTER,
        .sda_io_num = TOUCH_SDA_PIN,
        .scl_io_num = TOUCH_SCL_PIN,
        .sda_pullup_en = GPIO_PULLUP_ENABLE,
        .scl_pullup_en = GPIO_PULLUP_ENABLE,
        .master.clk_speed = 100000,
    };
    i2c_param_config(TOUCH_I2C_NUM, &conf);
    i2c_driver_install(TOUCH_I2C_NUM, conf.mode, 0, 0, 0);

    // Reset touch controller
    gpio_set_direction(TOUCH_RST_PIN, GPIO_MODE_OUTPUT);
    gpio_set_level(TOUCH_RST_PIN, 0);
    vTaskDelay(pdMS_TO_TICKS(10));
    gpio_set_level(TOUCH_RST_PIN, 1);
    vTaskDelay(pdMS_TO_TICKS(50));

    // Configure interrupt pin
    gpio_set_direction(TOUCH_INT_PIN, GPIO_MODE_INPUT);
    gpio_set_intr_type(TOUCH_INT_PIN, GPIO_INTR_NEGEDGE);
    gpio_install_isr_service(0);
    gpio_isr_handler_add(TOUCH_INT_PIN, touch_isr_handler, NULL);
}

touch_gesture_t touch_read_gesture(void) {
    uint8_t data[8];
    // Read gesture from CST816D register 0x01
    i2c_master_read_from_device(TOUCH_I2C_NUM, CST816D_ADDR, 0x01, data, 8, pdMS_TO_TICKS(100));
    return (touch_gesture_t)data[0];
}

Gesture Actions by State

IDLE State:

  • Tap: Wake up display (if dimmed)
  • Long Press: Open settings screen
  • Swipe Up: Show more info (weather, calendar)

LISTENING State:

  • Tap: Cancel listening, return to idle
  • Swipe Down: Lower wake word sensitivity
  • Swipe Up: Raise wake word sensitivity

SPEAKING State:

  • Tap: Skip response, return to idle
  • Swipe Left/Right: Volume down/up

PROCESSING State:

  • Tap: Cancel processing (if possible)

Network Communication

WiFi Configuration

Connection:

#define WIFI_SSID           "YourNetworkName"
#define WIFI_PASSWORD       "YourPassword"
#define SERVER_URL          "http://10.1.10.71:3006"

void wifi_init(void) {
    esp_netif_init();
    esp_event_loop_create_default();
    esp_netif_create_default_wifi_sta();

    wifi_init_config_t cfg = WIFI_INIT_CONFIG_DEFAULT();
    esp_wifi_init(&cfg);

    wifi_config_t wifi_config = {
        .sta = {
            .ssid = WIFI_SSID,
            .password = WIFI_PASSWORD,
        },
    };

    esp_wifi_set_mode(WIFI_MODE_STA);
    esp_wifi_set_config(WIFI_IF_STA, &wifi_config);
    esp_wifi_start();
    esp_wifi_connect();
}

Server Communication Protocol

Endpoints:

  • GET /health - Server health check
  • POST /audio/stream - Stream audio to server (multipart)
  • GET /audio/tts - Receive TTS audio response
  • GET /wake-word/status - Check wake word detection status

Audio Streaming (WebSockets Recommended):

#include "esp_websocket_client.h"

esp_websocket_client_handle_t ws_client;

void websocket_init(void) {
    esp_websocket_client_config_t ws_cfg = {
        .uri = "ws://10.1.10.71:3006/ws/audio",
        .buffer_size = 2048,
    };

    ws_client = esp_websocket_client_init(&ws_cfg);
    esp_websocket_register_events(ws_client, WEBSOCKET_EVENT_ANY,
                                   websocket_event_handler, NULL);
    esp_websocket_client_start(ws_client);
}

void audio_send_to_server(int16_t *samples, size_t count) {
    if (esp_websocket_client_is_connected(ws_client)) {
        esp_websocket_client_send_bin(ws_client, (char*)samples,
                                     count * sizeof(int16_t), portMAX_DELAY);
    }
}

size_t audio_receive_from_server(int16_t *out_buffer, size_t max_samples) {
    // Receive audio from server (blocking with timeout)
    int len = esp_websocket_client_recv(ws_client, (char*)out_buffer,
                                       max_samples * sizeof(int16_t), pdMS_TO_TICKS(100));
    return (len > 0) ? (len / sizeof(int16_t)) : 0;
}

Alternative: HTTP Chunked Transfer (Simpler):

void audio_stream_http(void) {
    esp_http_client_config_t config = {
        .url = "http://10.1.10.71:3006/audio/stream",
        .method = HTTP_METHOD_POST,
    };
    esp_http_client_handle_t client = esp_http_client_init(&config);

    // Set headers
    esp_http_client_set_header(client, "Content-Type", "audio/pcm");
    esp_http_client_set_header(client, "Transfer-Encoding", "chunked");

    esp_http_client_open(client, -1);  // -1 = chunked mode

    // Stream audio chunks
    int16_t buffer[256];
    while (current_state == STATE_LISTENING) {
        // Read from mic
        size_t bytes_read;
        i2s_read(I2S_MIC_NUM, buffer, sizeof(buffer), &bytes_read, portMAX_DELAY);

        // Send to server
        esp_http_client_write(client, (char*)buffer, bytes_read);
    }

    esp_http_client_close(client);
    esp_http_client_cleanup(client);
}

Power Management

Battery Monitoring

ETA6098 Charging Chip:

#define BATTERY_ADC_CHANNEL ADC1_CHANNEL_0  // GPIO1 (example)
#define BATTERY_FULL_MV     4200
#define BATTERY_EMPTY_MV    3300

void battery_init(void) {
    adc1_config_width(ADC_WIDTH_BIT_12);
    adc1_config_channel_atten(BATTERY_ADC_CHANNEL, ADC_ATTEN_DB_11);
}

uint8_t battery_get_percentage(void) {
    int adc_reading = adc1_get_raw(BATTERY_ADC_CHANNEL);
    int voltage_mv = esp_adc_cal_raw_to_voltage(adc_reading, &adc_chars);

    if (voltage_mv >= BATTERY_FULL_MV) return 100;
    if (voltage_mv <= BATTERY_EMPTY_MV) return 0;

    return ((voltage_mv - BATTERY_EMPTY_MV) * 100) / (BATTERY_FULL_MV - BATTERY_EMPTY_MV);
}

bool battery_is_charging(void) {
    // Check SYS_OUT pin (GPIO36) - high when charging
    gpio_set_direction(GPIO_NUM_36, GPIO_MODE_INPUT);
    return gpio_get_level(GPIO_NUM_36);
}

Low Power Modes

Deep Sleep When Idle (Optional):

#define IDLE_TIMEOUT_MS 300000  // 5 minutes

void enter_deep_sleep(void) {
    // Save state to RTC memory
    RTC_DATA_ATTR static uint32_t boot_count = 0;
    boot_count++;

    // Configure wake sources
    esp_sleep_enable_ext0_wakeup(TOUCH_INT_PIN, 0);  // Wake on touch
    esp_sleep_enable_timer_wakeup(3600 * 1000000ULL); // Wake every hour

    // Turn off display
    gpio_set_level(LCD_BL_PIN, 0);

    // Enter deep sleep
    esp_deep_sleep_start();
}

Performance Optimization

LVGL Performance

Buffer Configuration:

#define LVGL_BUFFER_SIZE (240 * 280 * 2)  // Full screen buffer

static lv_color_t buf_1[LVGL_BUFFER_SIZE / 10];  // 1/10 screen buffer
static lv_color_t buf_2[LVGL_BUFFER_SIZE / 10];  // Double buffering

lv_disp_draw_buf_t draw_buf;
lv_disp_draw_buf_init(&draw_buf, buf_1, buf_2, LVGL_BUFFER_SIZE / 10);

Task Priority:

#define LVGL_TASK_PRIORITY      5
#define AUDIO_MIC_TASK_PRIORITY 10  // Higher priority for audio
#define AUDIO_SPK_TASK_PRIORITY 10
#define WIFI_TASK_PRIORITY      8
#define WAVEFORM_TASK_PRIORITY  4   // Lower priority for visuals

void app_main(void) {
    // Create tasks with priorities
    xTaskCreatePinnedToCore(lvgl_task, "LVGL", 8192, NULL, LVGL_TASK_PRIORITY, NULL, 1);
    xTaskCreatePinnedToCore(audio_mic_task, "MIC", 4096, NULL, AUDIO_MIC_TASK_PRIORITY, NULL, 0);
    xTaskCreatePinnedToCore(audio_speaker_task, "SPK", 4096, NULL, AUDIO_SPK_TASK_PRIORITY, NULL, 0);
    xTaskCreatePinnedToCore(lvgl_waveform_task, "WAVE", 4096, NULL, WAVEFORM_TASK_PRIORITY, NULL, 1);
}

Reduce Waveform Update Rate:

// Only update waveform at 30 FPS, not every audio sample
#define WAVEFORM_UPDATE_MS 33  // ~30 FPS

void lvgl_waveform_task(void *pvParameters) {
    TickType_t last_update = xTaskGetTickCount();

    while (1) {
        TickType_t now = xTaskGetTickCount();
        if ((now - last_update) >= pdMS_TO_TICKS(WAVEFORM_UPDATE_MS)) {
            // Update waveform
            last_update = now;
        }
        vTaskDelay(pdMS_TO_TICKS(10));
    }
}

Memory Management

PSRAM Usage:

// Allocate large buffers in PSRAM (8MB available)
#define AUDIO_LARGE_BUFFER_SIZE (16000 * 10)  // 10 seconds at 16kHz

int16_t *audio_history = heap_caps_malloc(AUDIO_LARGE_BUFFER_SIZE * sizeof(int16_t),
                                          MALLOC_CAP_SPIRAM);

// Check if allocation succeeded
if (audio_history == NULL) {
    ESP_LOGE(TAG, "Failed to allocate PSRAM buffer");
}

Heap Monitoring:

void log_memory_stats(void) {
    ESP_LOGI(TAG, "Free heap: %d bytes", esp_get_free_heap_size());
    ESP_LOGI(TAG, "Free PSRAM: %d bytes", heap_caps_get_free_size(MALLOC_CAP_SPIRAM));
    ESP_LOGI(TAG, "Min free heap: %d bytes", esp_get_minimum_free_heap_size());
}

Example Code Structure

File Organization

esp32_voice_assistant/
├── main/
│   ├── main.c                  # Entry point, task creation
│   ├── audio/
│   │   ├── audio_input.c       # I2S microphone handling
│   │   ├── audio_output.c      # I2S speaker handling
│   │   ├── audio_buffer.c      # Circular buffer management
│   │   └── audio_network.c     # WebSocket/HTTP streaming
│   ├── ui/
│   │   ├── ui_init.c           # LVGL setup, screen creation
│   │   ├── ui_idle.c           # Idle screen UI
│   │   ├── ui_listening.c      # Listening screen + waveform
│   │   ├── ui_processing.c     # Processing screen + spinner
│   │   ├── ui_speaking.c       # Speaking screen + output waveform
│   │   ├── ui_settings.c       # Settings screen
│   │   └── ui_waveform.c       # Waveform drawing functions
│   ├── touch/
│   │   ├── touch_cst816d.c     # Touch controller driver
│   │   └── touch_gestures.c    # Gesture recognition
│   ├── network/
│   │   └── wifi_manager.c      # WiFi connection management
│   ├── power/
│   │   ├── battery.c           # Battery monitoring
│   │   └── power_mgmt.c        # Sleep modes
│   └── state_machine.c         # Voice assistant state machine
├── components/
│   └── lvgl/                   # LVGL library (ESP-IDF component)
├── CMakeLists.txt
└── sdkconfig                   # ESP-IDF configuration

Main Entry Point

// main/main.c
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "esp_log.h"

static const char *TAG = "VOICE_ASSISTANT";

void app_main(void) {
    ESP_LOGI(TAG, "Voice Assistant Starting...");

    // Initialize hardware
    nvs_flash_init();           // Non-volatile storage
    gpio_install_isr_service(0);// GPIO interrupts

    // Power management
    battery_init();

    // Display and touch
    lcd_init();
    touch_init();
    ui_init();

    // Audio pipeline
    audio_init_microphone();
    audio_init_speaker();
    audio_buffer_init(&mic_buffer);
    audio_buffer_init(&spk_buffer);

    // Network
    wifi_init();
    websocket_init();

    // State machine
    state_machine_init();

    // Create FreeRTOS tasks
    xTaskCreatePinnedToCore(lvgl_task, "LVGL", 8192, NULL, 5, NULL, 1);
    xTaskCreatePinnedToCore(audio_mic_task, "MIC", 4096, NULL, 10, NULL, 0);
    xTaskCreatePinnedToCore(audio_speaker_task, "SPK", 4096, NULL, 10, NULL, 0);
    xTaskCreatePinnedToCore(lvgl_waveform_task, "WAVE", 4096, NULL, 4, NULL, 1);
    xTaskCreatePinnedToCore(state_machine_task, "STATE", 4096, NULL, 7, NULL, 0);

    ESP_LOGI(TAG, "Voice Assistant Running!");
}

Testing Plan

Phase 1: Hardware Validation

  • LCD display working (show test pattern)
  • Touch controller responding (log touch coordinates)
  • Buzzer working (play test tone)
  • WiFi connecting (check IP address)
  • Battery reading (log voltage)
  • RTC working (log time)
  • IMU working (log accelerometer values)

Phase 2: Audio Pipeline

  • I2S microphone reading audio (log levels)
  • Audio streaming to Heimdall server
  • I2S speaker playing audio (test tone)
  • TTS audio playback from server
  • Audio buffer management (no overflows)

Phase 3: LVGL UI

  • Idle screen displays correctly
  • State transitions smooth
  • Waveform renders at 30 FPS
  • Touch gestures recognized
  • Settings screen functional
  • Status bar updates correctly

Phase 4: Integration

  • Wake word detection triggers listening state
  • Waveform shows mic input in real-time
  • Processing state shows after speech ends
  • TTS response plays with output waveform
  • Touch cancel works in all states
  • Battery indicator accurate

Phase 5: Optimization

  • Memory usage stable (no leaks)
  • CPU usage acceptable (<80% average)
  • WiFi latency <100ms
  • Audio latency <200ms end-to-end
  • Display framerate stable (30 FPS)
  • Battery life >4 hours continuous

Bill of Materials (BOM)

Component Part Number Quantity Unit Price Total
ESP32-S3-Touch-LCD-1.69 Waveshare 1 $12.00 $12.00
I2S MEMS Microphone INMP441 1 $3.50 $3.50
I2S Amplifier MAX98357A 1 $3.50 $3.50
Speaker (3W 8Ω) Generic 1 $5.00 $5.00
LiPo Battery (1000mAh) 503040 JST 1.25 1 $7.00 $7.00
MicroSD Card (8GB) SanDisk 1 $5.00 $5.00
Breadboard + Wires Generic 1 $5.00 $5.00
Total $41.00

Optional:

  • Enclosure/Case (3D printed or project box): $5-10
  • Backup battery: $7
  • USB-C cable: $3

Grand Total with Options: ~$56-63


References & Resources

LVGL Audio Visualization Examples

ESP32-S3 Resources

Voice Assistant Project


Next Steps

  1. Order Hardware - ESP32-S3-Touch-LCD + audio components (~$41)
  2. Setup ESP-IDF - Install ESP-IDF v5.3.1+ on development machine
  3. Clone Examples - Get LVGL audio visualization examples for reference
  4. Start Simple - Begin with LCD + LVGL test (no audio)
  5. Add Audio - Wire I2S mic, test audio streaming
  6. Waveform MVP - Get basic waveform rendering working
  7. Full Integration - Connect to Heimdall voice server
  8. Polish - Add touch controls, settings, battery support

Version: 1.0 Created: 2026-01-01 Status: Specification Complete, Ready for Implementation