Research: cf_voice.acoustic — unified acoustic backbone (wake word + context classifier) #10

New issue

Open

opened 2026-04-06 22:27:55 -07:00 by pyr0ball · 0 comments

pyr0ball commented

2026-04-06 22:27:55 -07:00

Owner

Question

Should CF build a native wake word engine, and if so, what form should it take?

Background

Minerva currently uses Mycroft Precise (Apache 2.0) for wake word detection. Precise works, but has friction points:

TensorFlow-based architecture makes edge deployment painful (TF → ONNX → KMODEL conversion for K210)
Maintained by a community fork after Mycroft AI's restructuring — long-term trajectory uncertain
Runs as a separate process from cf_voice.context (tone/environment classifier), meaning acoustic feature extraction (MFCCs, mel spectrograms) happens twice per audio frame

The architectural case for a unified backbone

cf_voice.context (circuitforge-core#34) and a wake word detector share significant infrastructure:

Both process a continuous audio stream at frame level
Both need low-latency classification (< 100ms)
Both extract the same acoustic features (MFCCs, mel spectrograms, filter banks)

A cf_voice.acoustic module that runs a single shared feature extraction pass and feeds multiple classification heads would:

Eliminate duplicate feature extraction on every frame
Allow wake word and context classification to share a backbone (efficient on edge hardware)
Enable a single ONNX-native model exportable to K210 KMODEL, ESP32-S3 TFLite, etc.
Let CF own the full pipeline for privacy and training purposes

Evaluation steps before deciding to build

Assess OpenWakeWord (MIT, Jeff Sharkey) — uses pre-trained Google embeddings, more modern than Precise, already ONNX-friendly. May be the right intermediate choice before building anything.
Benchmark Precise vs OpenWakeWord on Heimdall for accuracy and latency
Profile K210 edge deployment friction — how painful is the TF→ONNX→KMODEL path for Precise vs. a native ONNX model?
Identify shared feature extraction surface with cf_voice.context — how much overlap is there really?

The long-term differentiator: federated training

A privacy-preserving, federated wake word training system:

Users train on-device with their own voice samples
Contribute anonymized model updates (gradients/weights, not audio) to a shared base model
No audio ever leaves the user's hardware
Community-improved models without a central audio corpus

No commercial wake word system offers this. It is directly on-brand for CF's privacy-first mission and would be a genuine differentiator for Minerva as a platform.

Decision criteria for building vs. adopting

Build a CF-native implementation when:

cf_voice.context is mature and the shared feature extraction surface is clear
Precise or OpenWakeWord is proving to be a friction point in real deployment
The federated training pipeline has community appetite to justify the ML investment
Minerva has enough deployment data to define what the wake word engine actually needs to do differently

Adopt/wrap OpenWakeWord when:

Alpha and Beta milestones need a working engine now
Edge deployment to K210/ESP32-S3 works acceptably with existing tools

Suggested sequencing

Now (Alpha): Evaluate OpenWakeWord vs Precise — pick the better-fitting option
Beta/Hardware v1: Revisit if K210 KMODEL conversion is still painful
Platform milestone: Design cf_voice.acoustic as unified backbone; wake word as one output head
Long term: Federated training pipeline if community demand exists

References

circuitforge-core#34: cf_voice module (context classifier shares acoustic infrastructure)
Minerva milestones: Alpha (64), Platform (68)
OpenWakeWord: https://github.com/dscripka/openWakeWord
Mycroft Precise community fork: https://github.com/OpenVoiceOS/ovos-precise-runner

## Question Should CF build a native wake word engine, and if so, what form should it take? ## Background Minerva currently uses Mycroft Precise (Apache 2.0) for wake word detection. Precise works, but has friction points: - TensorFlow-based architecture makes edge deployment painful (TF → ONNX → KMODEL conversion for K210) - Maintained by a community fork after Mycroft AI's restructuring — long-term trajectory uncertain - Runs as a separate process from `cf_voice.context` (tone/environment classifier), meaning acoustic feature extraction (MFCCs, mel spectrograms) happens twice per audio frame ## The architectural case for a unified backbone `cf_voice.context` (circuitforge-core#34) and a wake word detector share significant infrastructure: - Both process a continuous audio stream at frame level - Both need low-latency classification (< 100ms) - Both extract the same acoustic features (MFCCs, mel spectrograms, filter banks) A `cf_voice.acoustic` module that runs a single shared feature extraction pass and feeds multiple classification heads would: - Eliminate duplicate feature extraction on every frame - Allow wake word and context classification to share a backbone (efficient on edge hardware) - Enable a single ONNX-native model exportable to K210 KMODEL, ESP32-S3 TFLite, etc. - Let CF own the full pipeline for privacy and training purposes ## Evaluation steps before deciding to build 1. **Assess OpenWakeWord** (MIT, Jeff Sharkey) — uses pre-trained Google embeddings, more modern than Precise, already ONNX-friendly. May be the right intermediate choice before building anything. 2. **Benchmark Precise vs OpenWakeWord** on Heimdall for accuracy and latency 3. **Profile K210 edge deployment friction** — how painful is the TF→ONNX→KMODEL path for Precise vs. a native ONNX model? 4. **Identify shared feature extraction surface** with `cf_voice.context` — how much overlap is there really? ## The long-term differentiator: federated training A privacy-preserving, federated wake word training system: - Users train on-device with their own voice samples - Contribute anonymized model updates (gradients/weights, not audio) to a shared base model - No audio ever leaves the user's hardware - Community-improved models without a central audio corpus No commercial wake word system offers this. It is directly on-brand for CF's privacy-first mission and would be a genuine differentiator for Minerva as a platform. ## Decision criteria for building vs. adopting Build a CF-native implementation when: - `cf_voice.context` is mature and the shared feature extraction surface is clear - Precise or OpenWakeWord is proving to be a friction point in real deployment - The federated training pipeline has community appetite to justify the ML investment - Minerva has enough deployment data to define what the wake word engine actually needs to do differently Adopt/wrap OpenWakeWord when: - Alpha and Beta milestones need a working engine now - Edge deployment to K210/ESP32-S3 works acceptably with existing tools ## Suggested sequencing 1. **Now (Alpha):** Evaluate OpenWakeWord vs Precise — pick the better-fitting option 2. **Beta/Hardware v1:** Revisit if K210 KMODEL conversion is still painful 3. **Platform milestone:** Design `cf_voice.acoustic` as unified backbone; wake word as one output head 4. **Long term:** Federated training pipeline if community demand exists ## References - circuitforge-core#34: cf_voice module (context classifier shares acoustic infrastructure) - Minerva milestones: Alpha (64), Platform (68) - OpenWakeWord: https://github.com/dscripka/openWakeWord - Mycroft Precise community fork: https://github.com/OpenVoiceOS/ovos-precise-runner