Parallel tone classifiers: audio + text with timestamp sync for semantic divergence detection #22

New issue

Open

opened 2026-04-12 17:42:11 -07:00 by pyr0ball · 0 comments

pyr0ball commented

2026-04-12 17:42:11 -07:00

Owner

Concept

Run two tone classifiers in parallel, synced by timestamp, and detect divergence between audio affect and text/linguistic affect. The divergence itself is the semantic signal.

Architecture

AudioToneClassifier (existing): wav2vec2/emotion — classifies prosody, energy, affect from raw audio

TextToneClassifier (new): runs on Whisper transcript text — lightweight sentiment/emotion model (or fast local LLM) on the text content

ToneSyncAnalyzer (new): combines both streams with timestamp alignment, maps divergence patterns to semantic modifiers:

Audio tone	Text tone	Divergence signal
calm / flat	distressed / urgent language	emotional suppression / masking
rising / warm	neutral content	emphasis / enthusiasm
flat / monotone	hyperbolic or positive phrasing	sarcasm / irony
urgent / tense	hedged / softened language	passive aggression / people-pleasing
aligned	aligned	literal / confident communication

Why this matters for Elcor

Currently Elcor annotation is "what does the audio feel like." With divergence detection it becomes "what does this mean given the gap between what is said and how it is said" — much stronger for ND use cases where tonal subtext is the hard part.

Implementation notes

Text classifier runs on the same 1-2s windows as audio, using the Whisper transcript from that window
Requires Whisper output to be reliable enough to classify (accuracy gating: skip if STT confidence low)
Combiner emits a new DivergenceEvent with audio_tone, text_tone, divergence_type, confidence
Divergence type becomes an input to the Elcor prefix generator
Could start with a simple sentiment lexicon / rule-based text classifier before graduating to a model

Dependencies

Whisper accuracy improvements (in progress)
cf-voice context.py classify pipeline
Elcor annotation layer

## Concept Run two tone classifiers in parallel, synced by timestamp, and detect divergence between audio affect and text/linguistic affect. The divergence itself is the semantic signal. ## Architecture **AudioToneClassifier** (existing): wav2vec2/emotion — classifies prosody, energy, affect from raw audio **TextToneClassifier** (new): runs on Whisper transcript text — lightweight sentiment/emotion model (or fast local LLM) on the text content **ToneSyncAnalyzer** (new): combines both streams with timestamp alignment, maps divergence patterns to semantic modifiers: | Audio tone | Text tone | Divergence signal | |---|---|---| | calm / flat | distressed / urgent language | emotional suppression / masking | | rising / warm | neutral content | emphasis / enthusiasm | | flat / monotone | hyperbolic or positive phrasing | sarcasm / irony | | urgent / tense | hedged / softened language | passive aggression / people-pleasing | | aligned | aligned | literal / confident communication | ## Why this matters for Elcor Currently Elcor annotation is "what does the audio feel like." With divergence detection it becomes "what does this *mean* given the gap between what is said and how it is said" — much stronger for ND use cases where tonal subtext is the hard part. ## Implementation notes - Text classifier runs on the same 1-2s windows as audio, using the Whisper transcript from that window - Requires Whisper output to be reliable enough to classify (accuracy gating: skip if STT confidence low) - Combiner emits a new `DivergenceEvent` with `audio_tone`, `text_tone`, `divergence_type`, `confidence` - Divergence type becomes an input to the Elcor prefix generator - Could start with a simple sentiment lexicon / rule-based text classifier before graduating to a model ## Dependencies - Whisper accuracy improvements (in progress) - cf-voice context.py classify pipeline - Elcor annotation layer