Overview
Stream raw PCM audio frames and receive transcript segments as speech is detected. The server uses Voice Activity Detection (VAD) to identify speech boundaries and returns a transcript for each segment.
Use case Recommended endpoint Live microphone / phone call / real-time audio This endpoint Short pre-recorded clips ≤ 60 s STT REST Large files or bulk jobs STT Batch
Endpoint
WSS wss://api.vachana.ai/stt/v3/stream
All configuration is passed as WebSocket upgrade headers at connection time. Headers cannot be changed mid-session — reconnect with new headers to change settings.
Header Required Default Description x-api-key-id✅ — Your Vachana API key. lang_code✅ en-INBCP-47 language code for transcription. See Supported Languages . x-sample-rate❌ 16000Sample rate of the audio stream in Hz. Accepted values: 8000, 16000, 44100, 48000. Must match the actual sample rate of your audio source. x-format❌ verbatimverbatim — raw spoken-form output. transcribe — enables Inverse Text Normalization (ITN). See ITN .itn_native_numerals❌ falseWhen x-format=transcribe, set true to render digits in the native script of the target language (e.g. ₹५,००० instead of ₹5,000 for Hindi).
Choosing the right sample rate:
Value When to use 48000Browser getUserMedia default; Mac microphone 44100Mac microphone alternate; CD-quality audio 16000Wideband telephony; sent as-is with no resampling 8000Narrowband telephony (legacy PSTN / VoIP)
Connection Flow
A WebSocket session follows a strict sequence:
Client connects — opens a WebSocket to /stt/v3/stream with all required headers.
Server confirms — immediately sends a connected message echoing the active configuration.
Client streams audio — continuously sends binary frames of raw PCM audio at a steady real-time cadence.
Server detects speech — VAD identifies end-of-speech boundaries and emits a processing message to acknowledge that a segment was captured.
Server returns transcript — sends a transcript message with the transcribed text, segment metadata, and latency.
Either side closes — client or server may close the connection at any time.
The processing message is a low-latency signal that audio was captured and transcription has begun. Expect a transcript message shortly after.
All audio must be sent as raw PCM binary frames over the WebSocket. No container format (WAV, MP3, etc.) is accepted mid-stream.
PCM Specification
Property 16 kHz 8 kHz Encoding PCM signed 16-bit little-endian PCM signed 16-bit little-endian Sample Rate 16,000 Hz 8,000 Hz Channels 1 (mono) 1 (mono) Samples per chunk 512 512 Bytes per frame 1,024 bytes (512 samples × 2 bytes)1,024 bytes (512 samples × 2 bytes)Frame duration 32 ms 64 ms
Sending Rules
Each binary frame must be exactly 1,024 bytes .
Frames must be sent at real-time cadence — one frame every 32 ms (16 kHz) or 64 ms (8 kHz). Do not buffer and burst; this degrades VAD accuracy.
For 44100 and 48000 Hz sources, the server resamples internally — still send 1,024-byte frames at the appropriate cadence.
Server Messages
The server sends JSON text frames. All messages share a type discriminator field and an ISO-8601 timestamp.
connected
Sent once immediately after the WebSocket handshake succeeds.
{
"type" : "connected" ,
"message" : "STT service ready — VAD service connected" ,
"timestamp" : "2024-01-15T10:30:00.000Z" ,
"config" : {
"sample_rate" : 16000 ,
"chunk_size" : 512
}
}
Field Type Description typestringAlways "connected". messagestringHuman-readable status string. timestampstringISO-8601 server timestamp. config.sample_rateintegerActive sample rate in Hz, echoed from x-sample-rate. config.chunk_sizeintegerExpected chunk size in samples (always 512).
processing
Emitted when VAD detects the end of a speech segment and transcription has begun. Use this as a low-latency acknowledgment that audio was captured.
{
"type" : "processing" ,
"timestamp" : "2024-01-15T10:30:05.123Z"
}
Field Type Description typestringAlways "processing". timestampstringISO-8601 timestamp when speech-end was detected.
transcript
Contains the transcribed text for a completed speech segment.
{
"type" : "transcript" ,
"timestamp" : "2024-01-15T10:30:05.987Z" ,
"text" : "Hello, how are you today?" ,
"audio_duration_ms" : 2340 ,
"segment_id" : "<segment_id>" ,
"segment_index" : 0 ,
"latency" : 320
}
Field Type Description typestringAlways "transcript". timestampstringISO-8601 timestamp when the transcript was emitted. textstringTranscribed text. Format depends on the x-format header. audio_duration_msintegerDuration of the captured speech segment in milliseconds. segment_idstringUnique identifier for this speech segment. Use for deduplication or support correlation. segment_indexintegerSequential index of this segment within the session, starting at 0. latencyintegerTime in milliseconds from end-of-speech detection to transcript delivery.
error
Sent when the server encounters a recoverable or fatal error. The connection may remain open after a recoverable error.
Field Type Description typestringAlways "error". timestampstringISO-8601 timestamp of the error. messagestringHuman-readable description of the error.
{
"type" : "error" ,
"timestamp" : "2024-01-15T10:30:10.000Z" ,
"message" : "STT engine failed to initialize"
}
Python SDK
The official Python SDK wraps the WebSocket connection, audio pacing, and event parsing into a clean async interface.
Installation
pip install gnani-vachana
Requires Python 3.9+ .
Authentication
The streaming client requires your API key and language code.
Constructor argument
Environment variable
Environment variable (usage)
from gnani.stt import GnaniSTTStreamClient
stream = GnaniSTTStreamClient(
api_key = "your-api-key" ,
language_code = "hi-IN" ,
)
Stream Audio from a File
Use the async context manager and the stream_audio helper. It handles real-time pacing automatically so frames are sent at the correct cadence for VAD.
import asyncio
from gnani.stt import GnaniSTTStreamClient
async def main ():
async with GnaniSTTStreamClient(
api_key = "your-api-key" ,
language_code = "hi-IN" ,
sample_rate = 16000 ,
) as stream:
with open ( "audio.pcm" , "rb" ) as f:
transcripts = await stream.stream_audio(
f,
on_transcript = lambda t : print ( f "Transcript: { t.text } " ),
on_processing = lambda p : print ( "Processing..." ),
realtime_pace = True , # sends frames at real-time cadence
)
print ( f "Total segments: { len (transcripts) } " )
asyncio.run(main())
Iterate Over Events Manually
For lower-level control — handling each event type differently or interleaving sending and receiving — iterate over the stream directly.
import asyncio
from gnani.stt import GnaniSTTStreamClient, StreamTranscriptEvent, StreamProcessingEvent
async def main ():
async with GnaniSTTStreamClient(
api_key = "your-api-key" ,
language_code = "hi-IN" ,
) as stream:
with open ( "audio.pcm" , "rb" ) as f:
while chunk := f.read( 1024 ):
await stream.send_audio(chunk)
await asyncio.sleep( 0.032 ) # 32 ms per frame at 16 kHz
async for event in stream:
if isinstance (event, StreamTranscriptEvent):
print ( f "[Segment { event.segment_index } ] { event.text } " )
print ( f " Duration: { event.audio_duration_ms } ms Latency: { event.latency } ms" )
elif isinstance (event, StreamProcessingEvent):
print ( "Processing speech..." )
asyncio.run(main())
Using 8 kHz Audio (Telephony)
stream = GnaniSTTStreamClient(
api_key = "your-api-key" ,
language_code = "en-IN" ,
sample_rate = 8000 ,
)
SDK Event Types
All events are typed dataclasses with a raw field containing the full server JSON.
Event class Key fields Description StreamConnectedEventmessage, sample_rate, chunk_sizeSent once after the WebSocket handshake. Confirms the active config. StreamProcessingEventtimestampVAD detected end-of-speech; transcription has started. StreamTranscriptEventtext, segment_index, audio_duration_ms, latencyCompleted transcript for a speech segment. StreamErrorEventmessage, timestampServer-side error, recoverable or fatal.
Error Handling
from gnani.stt import (
StreamConnectionError, # Could not establish the WebSocket connection
StreamClosedError, # Attempted to send on an already-closed stream
StreamError, # Server returned an error message mid-session
)
try :
async with GnaniSTTStreamClient( api_key = "your-api-key" ) as stream:
await stream.send_audio(chunk)
except StreamConnectionError as e:
print ( f "Could not connect: { e } " )
except StreamClosedError as e:
print ( f "Stream was already closed: { e } " )
except StreamError as e:
print ( f "Server error: { e.message } (at { e.timestamp } )" )
Supported Languages
Language Code Native Script Example Bengali bn-INBengali (বাংলা) “আমি ভাত খাই” English en-INLatin ”I am going to the market” Gujarati gu-INGujarati (ગુજરાતી) “હું બજાર જાઉં છું” Hindi hi-INDevanagari (हिन्दी) “मैं बाज़ार जा रहा हूँ” Kannada kn-INKannada (ಕನ್ನಡ) “ನಾನು ಮಾರುಕಟ್ಟೆಗೆ ಹೋಗುತ್ತೇನೆ” Malayalam ml-INMalayalam (മലയാളം) “ഞാൻ ചന്തയിലേക്ക് പോകുന്നു” Marathi mr-INDevanagari (मराठी) “मी बाजारात जातोय” Punjabi pa-INGurmukhi (ਪੰਜਾਬੀ) “ਮੈਂ ਬਾਜ਼ਾਰ ਜਾ ਰਿਹਾ ਹਾਂ” Tamil ta-INTamil (தமிழ்) “நான் சந்தைக்கு செல்கிறேன்” Telugu te-INTelugu (తెలుగు) “నేను మార్కెట్కి వెళ్తున్నాను” Hinglish (experimental) en-hi-in-cmLatin + Devanagari ”मैं market जा रहा हूँ” Hinglish Latin (experimental) en-hi-IN-latnLatin only ”Main market ja raha hu” Auto-detect (experimental) en-IN,hi-IN,ta-IN,…All supported Pass all candidate codes comma-separated in lang_code
For auto-detection , pass all desired language codes comma-separated in the lang_code header. For example: en-IN,hi-IN,ta-IN,te-IN,kn-IN,ml-IN,gu-IN,mr-IN,bn-IN,pa-IN.
Inverse Text Normalization (ITN)
When x-format: transcribe is set, ITN runs on every transcript segment immediately after recognition — converting spoken-form numbers, currency, dates, times, and phone numbers into the compact written form a reader expects.
Currently supported for Hindi (hi-IN) and English (en-IN) only. Enabling ITN for other languages has no effect; transcripts are returned verbatim.
What ITN Normalizes
1 — Cardinal & Ordinal Numbers
Spoken input (ASR) Written output (ITN) Rule दो हज़ार 2,000 Indian comma grouping पाँच लाख बीस हज़ार 5,20,000 Lakh-scale grouping five lakh 5,00,000 English lakh convention पहला / twenty first 1st / 21st Ordinal suffix
2 — Currency & Money
Spoken input (ASR) Written output (ITN) Rule पाँच सौ रुपये ₹500 ₹ + amount तीन रुपये पचास पैसे ₹3.50 ₹ + rupees.paise I need five thousand rupees ₹5,000 English India pipeline pay do lakh rupees ₹2,00,000 Code-mixed en/hi
3 — Dates
Spoken input (ASR) Written output (ITN) Rule बीस जनवरी दो हज़ार पच्चीस 20 जनवरी 2025 DD Month YYYY (hi) fifteenth january twenty twenty five 15th January 2025 Ordinal Month YYYY (en)
4 — Times
Indian time-of-day words (सुबह, दोपहर, शाम, रात) automatically map to 24-hour HH:MM output.
Spoken input (ASR) Written output (ITN) Rule सुबह पाँच बजे सुबह 05:00 सुबह = AM शाम पाँच बजे शाम 17:00 शाम = evening (16–20 h) रात के दस बजे रात 22:00 रात = night (20–24 h) meeting at five fifteen in the evening meeting 17:15 in the evening en — 24-hour
5 — Phone Numbers & PIN Codes
Spoken input (ASR) Written output (ITN) Rule नौ आठ सात छह पाँच चार तीन दो एक शून्य 9876543210 10 digits → phone एक एक शून्य शून्य शून्य एक 110001 6 digits → PIN one two three four five six 123456 English digit words
6 — Mixed & Code-Mixed Utterances
Spoken input (ASR) Written output (ITN) कल थ्री फिफ्टी पीएम को पाँच सौ रुपये transfer करना है कल 15:50 को ₹500 transfer करना है pay do lakh rupees by fifteenth march pay ₹2,00,000 by 15th March
Native Script Digits — itn_native_numerals
By default, ITN outputs Western Arabic digits (0–9). Set itn_native_numerals: true in the connection headers to render digits in the native script of the target language.
Language Spoken input false (default)true — native scriptHindi hi-IN पाँच हज़ार रुपये ₹5,000 ₹५,००० English en-IN five thousand rupees ₹5,000 ₹5,000 (Latin — no change)
What ITN Does Not Change
ITN intentionally preserves idiomatic and ambiguous phrases to avoid incorrect normalization.
दो तीन (meaning a few ) stays as text, not 2 or 3
कर दो / ले दो (imperative verbs) are kept as words, not treated as cardinal 2