Picking the Voice
How to make your agent sound the way you want — accent, language, and speed.
Your agent uses three AI services to have a conversation. You pick each one. Think of it like assembling a team:
| Service | What it does | Like a... |
|---|---|---|
| STT (Speech-to-Text) | Hears the caller and writes down what they said | Stenographer |
| LLM (Language Model) | Thinks about what to say back | Brain |
| TTS (Text-to-Speech) | Speaks the reply out loud | Voice actor |
Each one has a few choices with different trade-offs.
Quick recommendations
| Your situation | STT | LLM | TTS |
|---|---|---|---|
| English customers, want it cheap | Deepgram | OpenAI gpt-5.4-mini | Deepgram Helena |
| English customers, want top quality | Deepgram | Anthropic claude-sonnet-4-6 | ElevenLabs Sarah |
| Hindi / Tamil / regional language | Sarvam saaras:v3 | OpenAI gpt-5.4-mini | ElevenLabs Multilingual |
| Mixed Hindi-English ("Hinglish") | Deepgram nova-3 (multi) | OpenAI gpt-5.4-mini | ElevenLabs Multilingual |
| Highest speed, lowest cost | Groq Whisper Turbo | Groq llama-3.3-70b | Groq Orpheus |
If unsure, take row 1.
The ears: Speech-to-Text (STT)
This is what hears the caller and converts speech to text. It happens 20-30 times per minute.
| Provider | Best for | Price (per minute) |
|---|---|---|
| Deepgram nova-3-general ⭐ | English (any accent) | $0.0077 |
| Deepgram nova-3-medical | Medical conversations | $0.0145 |
| Deepgram nova-2-phonecall | Bad-quality phone audio | $0.0058 |
| Sarvam saaras:v3 | Hindi, Tamil, Telugu, Kannada, etc. | $0.0083 |
| Sarvam saarika:v2.5 | Indian languages (older) | $0.0083 |
| Groq Whisper Turbo | Cheapest option, lower accuracy | $0.0006 |
| ElevenLabs Scribe v2 | High accuracy batch | $0.0083 |
English: Use Deepgram. It's faster and more accurate than the others.
Indian languages: Use Sarvam. Deepgram does NOT support Hindi/Tamil/etc.
Hinglish (code-switching): Set language to multi and use Deepgram nova-3.
The brain: Language Model (LLM)
This is what decides what the bot says. It's by far the most important choice for quality.
| Model | Speed | Cost per 1K input tokens | When to pick |
|---|---|---|---|
| OpenAI gpt-5.4-nano | ⚡⚡⚡ | $0.20 | Simple FAQs, light dialog |
| OpenAI gpt-5.4-mini ⭐ | ⚡⚡⚡ | $0.40 | Default — most use cases |
| OpenAI gpt-5.4 | ⚡⚡ | $2.50 | Complex reasoning, agentic tasks |
| Anthropic claude-haiku-4-5 | ⚡⚡⚡ | $1.00 | Multilingual, formal tone |
| Anthropic claude-sonnet-4-6 | ⚡⚡ | $3.00 | Balanced quality + speed |
| Anthropic claude-opus-4-7 | ⚡ | $5.00 | Premium quality — long, complex calls |
| Groq llama-3.3-70b | ⚡⚡⚡⚡ | $0.59 | When you need very low latency |
| Groq gpt-oss-120b | ⚡⚡⚡⚡ | $0.15 | Cheap + open-source |
| Groq qwen3-32b | ⚡⚡⚡⚡ | $0.29 | Multi-language |
⭐ = default. Don't change unless you have a reason.
The voice: Text-to-Speech (TTS)
This is the voice the caller hears. Each TTS provider has multiple voices.
Deepgram (Aura-2) — best balance
Fast, natural-sounding English. 14 voices included in the price.
Female:
- Helena ⭐ — Warm, professional (default)
- Asteria — Confident, articulate
- Luna — Friendly, casual
- Athena — Authoritative
- Aurora — Bright, energetic
- Iris — Gentle, soothing
Male:
- Orpheus — Smooth, deep
- Apollo — Professional
- Zeus — Commanding
- Hermes — Friendly
- Atlas — Strong, mature
Price: $15 per 1M characters (about ₹2.50 per minute of speech).
ElevenLabs — best quality, especially for non-English
Models:
eleven_flash_v2_5⭐ — Best for voice agents, ~75ms latencyeleven_turbo_v2_5— Higher quality, ~250ms latencyeleven_multilingual_v2— 29 languages, highest quality, slower
Recommended voices:
- Sarah (
EXAVITQu4vr4xnSDxMaL) ⭐ — Mature female, English - Roger (
CwhRBWXzGAHq8TQ4Fs17) — Casual male, English - George — Warm British storyteller
- Daniel — Steady British broadcaster
For Hindi/Tamil/etc., use eleven_multilingual_v2 with any voice — they handle all 29 languages naturally.
Price: $50 per 1M characters for Flash, more for Turbo/Multilingual.
Groq (Orpheus) — cheapest
Six English-only voices: autumn, diana, hannah, austin, daniel, troy.
Price: $22 per 1M characters (about $0.40 of audio per dollar of TTS).
Setup note: First-time Orpheus use requires accepting Groq's terms once at console.groq.com/playground?model=canopylabs/orpheus-v1-english. One-time, per Groq org.
Voice speed and tuning (ElevenLabs only)
In Configure → Voice → Advanced, you can adjust:
| Setting | Range | Default | What it does |
|---|---|---|---|
| Speed | 0.7 – 1.2 | 1.0 | How fast the bot talks |
| Stability | 0 – 1 | 0.7 | Higher = more consistent. Lower = more emotional range. |
| Similarity boost | 0 – 1 | 0.75 | Tries to sound exactly like the original voice |
| Style | 0 – 1 | 0 | Adds expressive style. Slow but emotive. |
| Speaker boost | on/off | off | Improves clarity (slight latency hit) |
90% of users only ever touch Speed.
Background sound (optional)
You can play a quiet ambient sound during calls so the agent feels more human:
| Sound | When it helps |
|---|---|
| None ⭐ | Default — most calls |
| Office | "Sales agent calling from an office" |
| Cafe | "Friend casually chatting" |
| Rain | Calming, late-night support |
| White noise | Hide your real environment |
| Nature | Outdoor / wellness brands |
| Keyboard | "Tech support typing while I talk" |
Volume slider goes 0-100. 40 is the right default — audible but not distracting.
Pronunciation tweaks (advanced)
If your brand name keeps getting mispronounced ("AIVF" said as "ay-vif" instead of "ay-eye-vee-eff"), add a pronunciation entry:
Configure → Advanced → Pronunciation Dictionary
[
{ "word": "AIVF", "pronunciation": "ay-eye-vee-eff" },
{ "word": "osmTalk", "pronunciation": "awsm-talk" }
]Full guide: Pronunciation & Keyword Boost.
Per-agent provider keys (advanced)
If you want one agent to use YOUR Anthropic key and another to use the platform's, go to:
Settings → Provider Keys
Set a key globally there. Override per-agent in the agent's Voice tab if needed. Per-agent keys take priority over global keys.
Common questions
"My bot sounds robotic."
Try ElevenLabs Sarah. It's the most natural English voice. For Hindi, try eleven_multilingual_v2.
"My Hindi callers complain the bot doesn't understand them." Switch STT to Sarvam saaras:v3. Deepgram does not support Hindi.
"It's too slow." Switch LLM to Groq llama-3.3-70b (4× faster than OpenAI). Quality slightly lower but usually fine.
"It's expensive." Switch to gpt-5.4-mini + Deepgram Helena (the defaults). You're probably overspending on a flagship model you don't need.
"It speaks in the wrong language." Your Language setting (Configure → Voice → Language) is wrong. Or your system prompt says "respond in English" — check both.