Available APIs
Speech-to-Text: Gnani Prisma v2.5
| API | Description |
|---|---|
| STT REST | Transcribe short audio files (≤ 60s) via a single HTTP request |
| STT Realtime | Stream live audio over a WebSocket connection and receive transcript segments in real-time |
| STT Batch | Submit long or multiple audio files for async transcription; poll for results via job_id |
Text-to-Speech: Gnani Timbre v2.0
| API | Description |
|---|---|
| TTS REST | Synthesize text to audio in a single synchronous HTTP call |
| TTS Streaming | Submit text via an HTTP request and receive synthesized audio progressively as a server-sent event stream |
| TTS Realtime | Stream text incrementally and receive audio simultaneously over a persistent WebSocket connection, delivering low latency |
Voice Cloning
| API | Description |
|---|---|
| VC Embeddings | Upload a reference audio file to generate a speaker_embedding for use in voice cloning |
| Voice Cloned TTS REST | Synthesize audio in your cloned voice via a single synchronous HTTP call |
| Voice Cloned TTS Streaming | Stream cloned voice audio progressively using Server-Sent Events |
| Voice Cloned TTS Realtime | Stream text and receive cloned voice audio in real-time over a WebSocket connection |
Key Capabilities
| Feature | Detail |
|---|---|
| 10+ Indian Languages | Native script transcription and synthesis across 10+ Indian languages |
| Language Detection | Automatic — or specify language_code to target a specific language. |
| Code-Switching | Handles code-mixed speech naturally |
| Audio Flexibility | STT accepts WAV, MP3, OGG, FLAC, AAC, M4A |
| Voice Cloning | Clone any voice from a short audio sample using speaker embeddings |
| Latency | P95 200ms for STT Streaming TTS supported |
| Processing Modes | Real-time streaming and batch |
| Accuracy | Sub-4% WER on Indian English and 20-30% better accuracy for major Indian languages |
| Transcript Formatting | Auto-punctuation, inverse text normalization (numerals, dates, currency) |
| SSML Support | Full SSML for fine-grained speech synthesis control |
| Noise Robustness | Optimized for telephony-grade and noisy real-world audio |