Google Ships Its Most Expressive Gemini 3.1 Text-to-speech Model Yet With 70+ Language Support
The Decoder, Wednesday, April 15th, 2026
Google is rolling out its new text-to-speech model based on Gemini 3.1 Flash. The company says it's the most natural and expressive voice output it has shipped to date. The big new feature is audio tags-simple text commands that let developers control the style, tempo, tone, and accent of the generated speech. The model supports over 70 languages and can handle multi-speaker dialogs.
On the Artificial Analysis ranking list, the model hits an Elo rating of 1,211 and stands out for its quality-to-price ratio. It beats Elevenlabs v3 in overall quality and sits just behind Inworld 1.5 Max.
Gemini 3.1 Flash TTS has a free tier, but Google uses the data to improve its products. The paid tier runs $1.00 per million tokens for text input and $20.00 per million tokens for audio output. Batch mode cuts those prices in half to $0.50 and $10.00, respectively. On the paid tier, Google doesn't use the data for product improvement.