The AI startup NineNineSix has released Kani TTS 2, a next-generation open-source text-to-speech (TTS) model that significantly extends generation length, improves stability, and reinforces its mission to bring high-quality speech AI to underrepresented languages.
The new version introduces a stable generation of up to 40 seconds of continuous speech in a single pass, more than doubling the practical limit of the previous release. The model is already trending on Hugging Face, currently ranking among the top TTS models on the platform.
From 15 to 40 Seconds: A Structural Upgrade
The original Kani TTS gained attention for its lightweight architecture, efficient deployment, and multilingual adaptability. It was adopted by developers beyond its core team and has already been used as a foundation for community-trained models in Urdu, Vietnamese, Turkish, and Creole, among others. Kani TTS 2 builds on that momentum.
The expanded generation window enables:
- long-form responses for conversational AI agents
- multi-turn dialogue synthesis
- extended narration and content production
- more natural prosodic flow in continuous speech
Importantly, the architecture remains optimized for efficiency, requiring approximately 3 GB of GPU memory, making it suitable for both local and server deployments.
Zero-Shot Voice Cloning and Full Pretrain Code
Kani TTS 2 supports zero-shot voice cloning, allowing developers to replicate a speaker’s tone and style from a short audio reference without additional fine-tuning.
One of the most consequential decisions by the team was releasing the full pretraining code. This enables organizations and research groups to train TTS systems from scratch for any language, dialect, or domain.
“Kani TTS 2 is the next step after our first release: we made speech generation more stable and enabled the model to produce longer audio segments. We focus on compact and open models - they are easier to deploy and adapt to different languages and accents, including low-resource ones. For us, it is important to demonstrate that world-class technologies can be built in Kyrgyzstan. That is why we released not only the model weights, but the entire pretraining code - so any team can train a TTS system from scratch for their own language,” said Nursultan Bakashov, co-founder of nineninesix.ai.
Language Expansion as a Core Philosophy
The model currently supports:
- English
- Spanish
- Kyrgyz
Support for Kyrgyz is particularly notable, as it demonstrates the feasibility of building high-quality TTS for low-resource languages.
The previous version of Kani TTS already proved its adaptability. Community contributors independently trained new language models, including Urdu andVietnamese, using the open architecture. In several cases, these community-driven extensions achieved production-level quality.
This scalability suggests that Kani TTS is not just a single model, but a flexible foundation for speech generation in languages often overlooked by large AI providers.
Technical Overview
~400 million parameters
Pretrained on ~10,000 hours of speech data
Full training completed in approximately 6 hours on 8× NVIDIA H100 GPUs
Optimized for ~3 GB GPU VRAM
The efficient training time and moderate hardware requirements underline the model’s architectural focus on practicality rather than brute-force scaling.
Why It Matters
As AI systems increasingly shift toward voice-based interaction, language inclusion becomes a structural issue. Many low-resource languages remain underrepresented in high-quality speech models, limiting access to voice AI in local contexts.
Kani TTS 2 addresses this gap by combining:
- extended generation length
- efficient architecture
- zero-shot cloning
- and fully open training pipelines
Its rapid rise on Hugging Face signals growing demand for open, adaptable speech infrastructure beyond proprietary cloud APIs.
With Kani TTS 2, NineNineSix positions itself not merely as a model developer, but as a contributor to the global effort to democratize speech AI.
