osmTalk Docs
Agents

VAD & Turn Detection

How osmTalk decides when you've finished speaking — and how to fix it when it gets it wrong.

Voice agents need to know when you've stopped talking so the bot can respond. osmTalk uses two cooperating signals:

  1. Silero VAD v6 — a 2 MB neural model that scores each 32 ms audio frame for "is this speech, yes or no?"
  2. osmTalk Smart Turn v3.2 — a Whisper-encoder-based model that semantically detects "did the speaker finish a thought, or are they pausing mid-sentence?"

When VAD reports silence and Smart Turn agrees the utterance is complete, the bot starts generating its response. If either says "wait", the bot keeps listening.

Default tuning

Channel-aware defaults applied automatically when an agent is created. Override in Advanced → VAD & Turn Detection.

ParameterWeb (mic)Phone / WhatsApp callWhat it does
vadConfidence0.700.75Higher = more strict; raises bar for "this is speech"
vadStartSecs0.200.20Time of speech before "user started speaking" fires
vadStopSecs0.200.20Silence required before "user stopped speaking" fires
vadMinVolume0.600.70Minimum normalized volume to consider speech
smartTurnStopSecs0.801.20Max wait after VAD-stop before forcing end-of-turn
smartTurnPreSpeechMs200300Audio context window before the turn-detector model
audioIdleTimeoutSecs1515Force "speech ended" if mic goes silent for N seconds
filterIncompleteTurnstruetrueDrop the user's turn if the LLM judges it incomplete

These defaults reflect osmTalk's recommended values + the May 2026 hardening pass. Phone audio (mu-law 8 kHz upcoded to 16 kHz internally) carries more line noise than web audio, so confidence and min_volume are slightly higher; smartTurnStopSecs is longer to absorb dropouts.

"VAD is not working" — debug guide

If callers complain the bot doesn't hear them, interrupts them, or feels laggy, work through this in order. Each item lists the root cause first, then the fix.

1. Sample rate is wrong

Symptom: Smart Turn silently misclassifies most utterances — 2× pitch in audio recordings is the giveaway. Common on Twilio/Plivo trunks if upsampling isn't enabled.

Fix: osmTalk forces audio_in_sample_rate=16000 everywhere as of May 2026. Verify in bot logs:

docker compose -f docker-compose.prod.yml logs bot --since 5m | grep "Pipeline configured"
# Expected: audio_in=16000Hz

2. Bot interrupts itself (no acoustic echo cancellation)

Symptom: The bot starts speaking, then immediately stops as if interrupted — its own output is being picked up by the caller's mic (typical when the caller is on speakerphone).

Fixes (any of):

  • Web widget: Browser AEC is on by default. Confirm the widget is calling getUserMedia({ audio: { echoCancellation: true } }) — the default in the osmTalk web client is on unless explicitly disabled.
  • Phone calls: Enable background noise cancellation in the osmTalk media-layer config. Currently optional. For aggressive cafe / call-center noise, contact us about enabling Krisp VIVA on your account.
  • As a fallback, raise vadMinVolume to 0.80 so the bot's own muffled echo doesn't cross the speech threshold.

3. Short utterances dropped ("OK", "yes", "no")

Symptom: Caller answers "yes" → bot responds with a generic greeting instead of acknowledging.

Fix: osmTalk's voice pipeline ships Smart Turn v3.2 which specifically retrained on short-utterance datasets. Older deployments on v3.0/v3.1 see this regularly. Upgrade by rebuilding the bot container:

docker compose -f docker-compose.prod.yml up -d --build bot

If still seeing short-word drops, lower vadStartSecs from 0.2 to 0.1 (caller has less time to "be detected speaking" before VAD declares them speaking).

4. STT misses what VAD captured

Symptom: Bot logs show VADUserStartedSpeaking followed quickly by VADUserStoppedSpeaking — but the assistant turn that follows has empty user content like "Hmm".

Cause: Your STT (Sarvam saaras:v3 on English, Deepgram on a very noisy line) is returning too slowly or with garbage.

Fixes:

  • For English, use Deepgram nova-3-general — it's both the most accurate and the fastest.
  • For Indian languages, use Sarvam saaras:v3 (it's tuned for these).
  • For Hindi-English code-switching, use Deepgram nova-3-general with sttLanguage: "multi".
  • Keep filterIncompleteTurns: true so noise-only "turns" don't fire the LLM.

5. VAD never fires — bot doesn't react at all

Symptom: Caller speaks but the bot stays silent. No VADUserStartedSpeaking log lines.

Causes (in order of frequency):

  1. Mic permission denied in the browser. Check console.log for NotAllowedError. Tell the caller to enable mic in their browser settings.
  2. Audio track never published — check the osmTalk media-layer logs for mediaTrack published from the user's participant. If missing, the session token might lack canPublish permission.
  3. vadMinVolume too high — quiet mic. Drop to 0.5.
  4. Wrong language code on STT — Sarvam will reject English with saarika:v2.5 if sttLanguage is en (it expects en-IN). osmTalk maps these for you, but custom integrations may not.

6. Bot reacts too slowly to end-of-turn

Symptom: Caller stops talking; bot waits 2-3 seconds before responding.

Fix: This is smartTurnStopSecs doing its job — it's the max wait after silence before the bot gives up on "maybe they have more to say". Drop to 0.5 for snappier replies (at cost of a few extra mid-sentence interrupts).

7. Bot keeps cutting itself off mid-sentence

Symptom: Bot says "Hi, this is —" and stops. Logs show the user-aggregator firing during the bot's own audio.

Fix: Enable a mute strategy. In Advanced Settings:

  • muteDuringWelcome: true — bot is unmuted only after the welcome message finishes
  • muteDuringFunctionCalls: true — bot stays muted while waiting for tool results

Both default to true for new agents.

8. False interruptions during noisy environments

Symptom: Callers on cafe/car/airport audio say "It keeps interrupting me!"

Fix: Smart Turn v3.2 specifically retrained on cafe/office noise to address this. Beyond that:

  • Raise vadConfidence to 0.8 (default 0.75 on phone, 0.7 on web)
  • Raise vadMinVolume to 0.75-0.80
  • Enable osmTalk's web-side noise cancellation (web only); contact us about server-side Krisp for the loudest environments

What changed in May 2026

ItemBeforeAfter
osmTalk voice pipelinev1.0v1.1 (Smart Turn v3.2 + Silero VAD v6.2)
Sample-rate enforcementImplicitExplicit 16 kHz in / variable out
audio_idle_timeout_secsNot set15s default
Media-layer server:latestPinned v1.11.0
SIP gateway:latestPinned v1.3.0
Channel-aware VAD defaultsSame on web + phoneHigher confidence/volume on phone

Reading bot logs

When debugging, these log lines are gold:

docker compose -f docker-compose.prod.yml logs -f bot --since 5m | \
  grep -E "VAD configured|Smart Turn|Pipeline configured|VADUser|SmartTurn|Transcript|filter"

Look for:

  • VAD configured (channel=phone): confidence=0.75 ... — confirms your defaults
  • Smart Turn v3 (channel=phone): stop_secs=1.2 ... — confirms turn-detector
  • Pipeline configured: audio_in=16000Hz audio_out=24000Hz — confirms sample rate
  • VADUserStartedSpeaking followed by TranscriptionFrame — happy path
  • VADUserStartedSpeaking followed by LLMUserAggregator ... filtered incomplete turn — STT was empty/garbage

When to escalate

If you've tried all of the above and the issue persists, capture a stereo recording of one bad call (set enableMultiChannelRecording: true on the org), and share bot.log for that call_id. The stereo recording lets us isolate what the caller actually said vs what the agent heard — usually pinpoints the issue in under 5 minutes.

References