Research: cf_voice.acoustic — unified acoustic backbone (wake word + context classifier) #10

Open
opened 2026-04-06 22:27:55 -07:00 by pyr0ball · 0 comments
Owner

Question

Should CF build a native wake word engine, and if so, what form should it take?

Background

Minerva currently uses Mycroft Precise (Apache 2.0) for wake word detection. Precise works, but has friction points:

  • TensorFlow-based architecture makes edge deployment painful (TF → ONNX → KMODEL conversion for K210)
  • Maintained by a community fork after Mycroft AI's restructuring — long-term trajectory uncertain
  • Runs as a separate process from cf_voice.context (tone/environment classifier), meaning acoustic feature extraction (MFCCs, mel spectrograms) happens twice per audio frame

The architectural case for a unified backbone

cf_voice.context (circuitforge-core#34) and a wake word detector share significant infrastructure:

  • Both process a continuous audio stream at frame level
  • Both need low-latency classification (< 100ms)
  • Both extract the same acoustic features (MFCCs, mel spectrograms, filter banks)

A cf_voice.acoustic module that runs a single shared feature extraction pass and feeds multiple classification heads would:

  • Eliminate duplicate feature extraction on every frame
  • Allow wake word and context classification to share a backbone (efficient on edge hardware)
  • Enable a single ONNX-native model exportable to K210 KMODEL, ESP32-S3 TFLite, etc.
  • Let CF own the full pipeline for privacy and training purposes

Evaluation steps before deciding to build

  1. Assess OpenWakeWord (MIT, Jeff Sharkey) — uses pre-trained Google embeddings, more modern than Precise, already ONNX-friendly. May be the right intermediate choice before building anything.
  2. Benchmark Precise vs OpenWakeWord on Heimdall for accuracy and latency
  3. Profile K210 edge deployment friction — how painful is the TF→ONNX→KMODEL path for Precise vs. a native ONNX model?
  4. Identify shared feature extraction surface with cf_voice.context — how much overlap is there really?

The long-term differentiator: federated training

A privacy-preserving, federated wake word training system:

  • Users train on-device with their own voice samples
  • Contribute anonymized model updates (gradients/weights, not audio) to a shared base model
  • No audio ever leaves the user's hardware
  • Community-improved models without a central audio corpus

No commercial wake word system offers this. It is directly on-brand for CF's privacy-first mission and would be a genuine differentiator for Minerva as a platform.

Decision criteria for building vs. adopting

Build a CF-native implementation when:

  • cf_voice.context is mature and the shared feature extraction surface is clear
  • Precise or OpenWakeWord is proving to be a friction point in real deployment
  • The federated training pipeline has community appetite to justify the ML investment
  • Minerva has enough deployment data to define what the wake word engine actually needs to do differently

Adopt/wrap OpenWakeWord when:

  • Alpha and Beta milestones need a working engine now
  • Edge deployment to K210/ESP32-S3 works acceptably with existing tools

Suggested sequencing

  1. Now (Alpha): Evaluate OpenWakeWord vs Precise — pick the better-fitting option
  2. Beta/Hardware v1: Revisit if K210 KMODEL conversion is still painful
  3. Platform milestone: Design cf_voice.acoustic as unified backbone; wake word as one output head
  4. Long term: Federated training pipeline if community demand exists

References

## Question Should CF build a native wake word engine, and if so, what form should it take? ## Background Minerva currently uses Mycroft Precise (Apache 2.0) for wake word detection. Precise works, but has friction points: - TensorFlow-based architecture makes edge deployment painful (TF → ONNX → KMODEL conversion for K210) - Maintained by a community fork after Mycroft AI's restructuring — long-term trajectory uncertain - Runs as a separate process from `cf_voice.context` (tone/environment classifier), meaning acoustic feature extraction (MFCCs, mel spectrograms) happens twice per audio frame ## The architectural case for a unified backbone `cf_voice.context` (circuitforge-core#34) and a wake word detector share significant infrastructure: - Both process a continuous audio stream at frame level - Both need low-latency classification (< 100ms) - Both extract the same acoustic features (MFCCs, mel spectrograms, filter banks) A `cf_voice.acoustic` module that runs a single shared feature extraction pass and feeds multiple classification heads would: - Eliminate duplicate feature extraction on every frame - Allow wake word and context classification to share a backbone (efficient on edge hardware) - Enable a single ONNX-native model exportable to K210 KMODEL, ESP32-S3 TFLite, etc. - Let CF own the full pipeline for privacy and training purposes ## Evaluation steps before deciding to build 1. **Assess OpenWakeWord** (MIT, Jeff Sharkey) — uses pre-trained Google embeddings, more modern than Precise, already ONNX-friendly. May be the right intermediate choice before building anything. 2. **Benchmark Precise vs OpenWakeWord** on Heimdall for accuracy and latency 3. **Profile K210 edge deployment friction** — how painful is the TF→ONNX→KMODEL path for Precise vs. a native ONNX model? 4. **Identify shared feature extraction surface** with `cf_voice.context` — how much overlap is there really? ## The long-term differentiator: federated training A privacy-preserving, federated wake word training system: - Users train on-device with their own voice samples - Contribute anonymized model updates (gradients/weights, not audio) to a shared base model - No audio ever leaves the user's hardware - Community-improved models without a central audio corpus No commercial wake word system offers this. It is directly on-brand for CF's privacy-first mission and would be a genuine differentiator for Minerva as a platform. ## Decision criteria for building vs. adopting Build a CF-native implementation when: - `cf_voice.context` is mature and the shared feature extraction surface is clear - Precise or OpenWakeWord is proving to be a friction point in real deployment - The federated training pipeline has community appetite to justify the ML investment - Minerva has enough deployment data to define what the wake word engine actually needs to do differently Adopt/wrap OpenWakeWord when: - Alpha and Beta milestones need a working engine now - Edge deployment to K210/ESP32-S3 works acceptably with existing tools ## Suggested sequencing 1. **Now (Alpha):** Evaluate OpenWakeWord vs Precise — pick the better-fitting option 2. **Beta/Hardware v1:** Revisit if K210 KMODEL conversion is still painful 3. **Platform milestone:** Design `cf_voice.acoustic` as unified backbone; wake word as one output head 4. **Long term:** Federated training pipeline if community demand exists ## References - circuitforge-core#34: cf_voice module (context classifier shares acoustic infrastructure) - Minerva milestones: Alpha (64), Platform (68) - OpenWakeWord: https://github.com/dscripka/openWakeWord - Mycroft Precise community fork: https://github.com/OpenVoiceOS/ovos-precise-runner
pyr0ball added this to the Platform milestone 2026-04-06 22:27:55 -07:00
pyr0ball added the
enhancement
label 2026-04-06 22:27:55 -07:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/minerva#10
No description provided.