VAD & Turn Detection
How osmTalk decides when you've finished speaking — and how to fix it when it gets it wrong.
Voice agents need to know when you've stopped talking so the bot can respond. osmTalk uses two cooperating signals:
- Silero VAD v6 — a 2 MB neural model that scores each 32 ms audio frame for "is this speech, yes or no?"
- osmTalk Smart Turn v3.2 — a Whisper-encoder-based model that semantically detects "did the speaker finish a thought, or are they pausing mid-sentence?"
When VAD reports silence and Smart Turn agrees the utterance is complete, the bot starts generating its response. If either says "wait", the bot keeps listening.
Default tuning
Channel-aware defaults applied automatically when an agent is created. Override in Advanced → VAD & Turn Detection.
| Parameter | Web (mic) | Phone / WhatsApp call | What it does |
|---|---|---|---|
vadConfidence | 0.70 | 0.75 | Higher = more strict; raises bar for "this is speech" |
vadStartSecs | 0.20 | 0.20 | Time of speech before "user started speaking" fires |
vadStopSecs | 0.20 | 0.20 | Silence required before "user stopped speaking" fires |
vadMinVolume | 0.60 | 0.70 | Minimum normalized volume to consider speech |
smartTurnStopSecs | 0.80 | 1.20 | Max wait after VAD-stop before forcing end-of-turn |
smartTurnPreSpeechMs | 200 | 300 | Audio context window before the turn-detector model |
audioIdleTimeoutSecs | 15 | 15 | Force "speech ended" if mic goes silent for N seconds |
filterIncompleteTurns | true | true | Drop the user's turn if the LLM judges it incomplete |
These defaults reflect osmTalk's recommended values + the May 2026 hardening pass. Phone audio (mu-law 8 kHz upcoded to 16 kHz internally) carries more line noise than web audio, so confidence and min_volume are slightly higher; smartTurnStopSecs is longer to absorb dropouts.
"VAD is not working" — debug guide
If callers complain the bot doesn't hear them, interrupts them, or feels laggy, work through this in order. Each item lists the root cause first, then the fix.
1. Sample rate is wrong
Symptom: Smart Turn silently misclassifies most utterances — 2× pitch in audio recordings is the giveaway. Common on Twilio/Plivo trunks if upsampling isn't enabled.
Fix: osmTalk forces audio_in_sample_rate=16000 everywhere as of May 2026. Verify in bot logs:
docker compose -f docker-compose.prod.yml logs bot --since 5m | grep "Pipeline configured"
# Expected: audio_in=16000Hz2. Bot interrupts itself (no acoustic echo cancellation)
Symptom: The bot starts speaking, then immediately stops as if interrupted — its own output is being picked up by the caller's mic (typical when the caller is on speakerphone).
Fixes (any of):
- Web widget: Browser AEC is on by default. Confirm the widget is calling
getUserMedia({ audio: { echoCancellation: true } })— the default in the osmTalk web client is on unless explicitly disabled. - Phone calls: Enable background noise cancellation in the osmTalk media-layer config. Currently optional. For aggressive cafe / call-center noise, contact us about enabling Krisp VIVA on your account.
- As a fallback, raise
vadMinVolumeto0.80so the bot's own muffled echo doesn't cross the speech threshold.
3. Short utterances dropped ("OK", "yes", "no")
Symptom: Caller answers "yes" → bot responds with a generic greeting instead of acknowledging.
Fix: osmTalk's voice pipeline ships Smart Turn v3.2 which specifically retrained on short-utterance datasets. Older deployments on v3.0/v3.1 see this regularly. Upgrade by rebuilding the bot container:
docker compose -f docker-compose.prod.yml up -d --build botIf still seeing short-word drops, lower vadStartSecs from 0.2 to 0.1 (caller has less time to "be detected speaking" before VAD declares them speaking).
4. STT misses what VAD captured
Symptom: Bot logs show VADUserStartedSpeaking followed quickly by VADUserStoppedSpeaking — but the assistant turn that follows has empty user content like "Hmm".
Cause: Your STT (Sarvam saaras:v3 on English, Deepgram on a very noisy line) is returning too slowly or with garbage.
Fixes:
- For English, use Deepgram nova-3-general — it's both the most accurate and the fastest.
- For Indian languages, use Sarvam saaras:v3 (it's tuned for these).
- For Hindi-English code-switching, use Deepgram nova-3-general with
sttLanguage: "multi". - Keep
filterIncompleteTurns: trueso noise-only "turns" don't fire the LLM.
5. VAD never fires — bot doesn't react at all
Symptom: Caller speaks but the bot stays silent. No VADUserStartedSpeaking log lines.
Causes (in order of frequency):
- Mic permission denied in the browser. Check
console.logforNotAllowedError. Tell the caller to enable mic in their browser settings. - Audio track never published — check the osmTalk media-layer logs for
mediaTrack publishedfrom the user's participant. If missing, the session token might lackcanPublishpermission. vadMinVolumetoo high — quiet mic. Drop to0.5.- Wrong language code on STT — Sarvam will reject English with
saarika:v2.5ifsttLanguageisen(it expectsen-IN). osmTalk maps these for you, but custom integrations may not.
6. Bot reacts too slowly to end-of-turn
Symptom: Caller stops talking; bot waits 2-3 seconds before responding.
Fix: This is smartTurnStopSecs doing its job — it's the max wait after silence before the bot gives up on "maybe they have more to say". Drop to 0.5 for snappier replies (at cost of a few extra mid-sentence interrupts).
7. Bot keeps cutting itself off mid-sentence
Symptom: Bot says "Hi, this is —" and stops. Logs show the user-aggregator firing during the bot's own audio.
Fix: Enable a mute strategy. In Advanced Settings:
muteDuringWelcome: true— bot is unmuted only after the welcome message finishesmuteDuringFunctionCalls: true— bot stays muted while waiting for tool results
Both default to true for new agents.
8. False interruptions during noisy environments
Symptom: Callers on cafe/car/airport audio say "It keeps interrupting me!"
Fix: Smart Turn v3.2 specifically retrained on cafe/office noise to address this. Beyond that:
- Raise
vadConfidenceto0.8(default0.75on phone,0.7on web) - Raise
vadMinVolumeto0.75-0.80 - Enable osmTalk's web-side noise cancellation (web only); contact us about server-side Krisp for the loudest environments
What changed in May 2026
| Item | Before | After |
|---|---|---|
| osmTalk voice pipeline | v1.0 | v1.1 (Smart Turn v3.2 + Silero VAD v6.2) |
| Sample-rate enforcement | Implicit | Explicit 16 kHz in / variable out |
audio_idle_timeout_secs | Not set | 15s default |
| Media-layer server | :latest | Pinned v1.11.0 |
| SIP gateway | :latest | Pinned v1.3.0 |
| Channel-aware VAD defaults | Same on web + phone | Higher confidence/volume on phone |
Reading bot logs
When debugging, these log lines are gold:
docker compose -f docker-compose.prod.yml logs -f bot --since 5m | \
grep -E "VAD configured|Smart Turn|Pipeline configured|VADUser|SmartTurn|Transcript|filter"Look for:
VAD configured (channel=phone): confidence=0.75 ...— confirms your defaultsSmart Turn v3 (channel=phone): stop_secs=1.2 ...— confirms turn-detectorPipeline configured: audio_in=16000Hz audio_out=24000Hz— confirms sample rateVADUserStartedSpeakingfollowed byTranscriptionFrame— happy pathVADUserStartedSpeakingfollowed byLLMUserAggregator ... filtered incomplete turn— STT was empty/garbage
When to escalate
If you've tried all of the above and the issue persists, capture a stereo recording of one bad call (set enableMultiChannelRecording: true on the org), and share bot.log for that call_id. The stereo recording lets us isolate what the caller actually said vs what the agent heard — usually pinpoints the issue in under 5 minutes.