Skip to main content

Overview

Stream raw PCM audio frames and receive transcript segments as speech is detected. The server uses Voice Activity Detection (VAD) to identify speech boundaries and returns a transcript for each segment.
Use caseRecommended endpoint
Live microphone / phone call / real-time audioThis endpoint
Short pre-recorded clips ≤ 60 sSTT REST
Large files or bulk jobsSTT Batch

Endpoint

WSS wss://api.vachana.ai/stt/v3/stream

Connection Headers

All configuration is passed as WebSocket upgrade headers at connection time. Headers cannot be changed mid-session — reconnect with new headers to change settings.
HeaderRequiredDefaultDescription
x-api-key-idYour Vachana API key.
lang_codeen-INBCP-47 language code for transcription. See Supported Languages.
x-sample-rate16000Sample rate of the audio stream in Hz. Accepted values: 8000, 16000, 44100, 48000. Must match the actual sample rate of your audio source.
x-formatverbatimverbatim — raw spoken-form output. transcribe — enables Inverse Text Normalization (ITN). See ITN.
itn_native_numeralsfalseWhen x-format=transcribe, set true to render digits in the native script of the target language (e.g. ₹५,००० instead of ₹5,000 for Hindi).
Choosing the right sample rate:
ValueWhen to use
48000Browser getUserMedia default; Mac microphone
44100Mac microphone alternate; CD-quality audio
16000Wideband telephony; sent as-is with no resampling
8000Narrowband telephony (legacy PSTN / VoIP)

Connection Flow

A WebSocket session follows a strict sequence:
  1. Client connects — opens a WebSocket to /stt/v3/stream with all required headers.
  2. Server confirms — immediately sends a connected message echoing the active configuration.
  3. Client streams audio — continuously sends binary frames of raw PCM audio at a steady real-time cadence.
  4. Server detects speech — VAD identifies end-of-speech boundaries and emits a processing message to acknowledge that a segment was captured.
  5. Server returns transcript — sends a transcript message with the transcribed text, segment metadata, and latency.
  6. Either side closes — client or server may close the connection at any time.
The processing message is a low-latency signal that audio was captured and transcription has begun. Expect a transcript message shortly after.

Audio Format & Sending Audio

All audio must be sent as raw PCM binary frames over the WebSocket. No container format (WAV, MP3, etc.) is accepted mid-stream.

PCM Specification

Property16 kHz8 kHz
EncodingPCM signed 16-bit little-endianPCM signed 16-bit little-endian
Sample Rate16,000 Hz8,000 Hz
Channels1 (mono)1 (mono)
Samples per chunk512512
Bytes per frame1,024 bytes (512 samples × 2 bytes)1,024 bytes (512 samples × 2 bytes)
Frame duration32 ms64 ms

Sending Rules

  • Each binary frame must be exactly 1,024 bytes.
  • Frames must be sent at real-time cadence — one frame every 32 ms (16 kHz) or 64 ms (8 kHz). Do not buffer and burst; this degrades VAD accuracy.
  • For 44100 and 48000 Hz sources, the server resamples internally — still send 1,024-byte frames at the appropriate cadence.

Server Messages

The server sends JSON text frames. All messages share a type discriminator field and an ISO-8601 timestamp.

connected

Sent once immediately after the WebSocket handshake succeeds.
{
  "type": "connected",
  "message": "STT service ready — VAD service connected",
  "timestamp": "2024-01-15T10:30:00.000Z",
  "config": {
    "sample_rate": 16000,
    "chunk_size": 512
  }
}
FieldTypeDescription
typestringAlways "connected".
messagestringHuman-readable status string.
timestampstringISO-8601 server timestamp.
config.sample_rateintegerActive sample rate in Hz, echoed from x-sample-rate.
config.chunk_sizeintegerExpected chunk size in samples (always 512).

processing

Emitted when VAD detects the end of a speech segment and transcription has begun. Use this as a low-latency acknowledgment that audio was captured.
{
  "type": "processing",
  "timestamp": "2024-01-15T10:30:05.123Z"
}
FieldTypeDescription
typestringAlways "processing".
timestampstringISO-8601 timestamp when speech-end was detected.

transcript

Contains the transcribed text for a completed speech segment.
{
  "type": "transcript",
  "timestamp": "2024-01-15T10:30:05.987Z",
  "text": "Hello, how are you today?",
  "audio_duration_ms": 2340,
  "segment_id": "<segment_id>",
  "segment_index": 0,
  "latency": 320
}
FieldTypeDescription
typestringAlways "transcript".
timestampstringISO-8601 timestamp when the transcript was emitted.
textstringTranscribed text. Format depends on the x-format header.
audio_duration_msintegerDuration of the captured speech segment in milliseconds.
segment_idstringUnique identifier for this speech segment. Use for deduplication or support correlation.
segment_indexintegerSequential index of this segment within the session, starting at 0.
latencyintegerTime in milliseconds from end-of-speech detection to transcript delivery.

error

Sent when the server encounters a recoverable or fatal error. The connection may remain open after a recoverable error.
FieldTypeDescription
typestringAlways "error".
timestampstringISO-8601 timestamp of the error.
messagestringHuman-readable description of the error.
{
  "type": "error",
  "timestamp": "2024-01-15T10:30:10.000Z",
  "message": "STT engine failed to initialize"
}

Python SDK

The official Python SDK wraps the WebSocket connection, audio pacing, and event parsing into a clean async interface.

Installation

pip install gnani-vachana
Requires Python 3.9+.

Authentication

The streaming client requires your API key and language code.
from gnani.stt import GnaniSTTStreamClient

stream = GnaniSTTStreamClient(
    api_key="your-api-key",
    language_code="hi-IN",
)

Stream Audio from a File

Use the async context manager and the stream_audio helper. It handles real-time pacing automatically so frames are sent at the correct cadence for VAD.
import asyncio
from gnani.stt import GnaniSTTStreamClient

async def main():
    async with GnaniSTTStreamClient(
        api_key="your-api-key",
        language_code="hi-IN",
        sample_rate=16000,
    ) as stream:
        with open("audio.pcm", "rb") as f:
            transcripts = await stream.stream_audio(
                f,
                on_transcript=lambda t: print(f"Transcript: {t.text}"),
                on_processing=lambda p: print("Processing..."),
                realtime_pace=True,  # sends frames at real-time cadence
            )

    print(f"Total segments: {len(transcripts)}")

asyncio.run(main())

Iterate Over Events Manually

For lower-level control — handling each event type differently or interleaving sending and receiving — iterate over the stream directly.
import asyncio
from gnani.stt import GnaniSTTStreamClient, StreamTranscriptEvent, StreamProcessingEvent

async def main():
    async with GnaniSTTStreamClient(
        api_key="your-api-key",
        language_code="hi-IN",
    ) as stream:
        with open("audio.pcm", "rb") as f:
            while chunk := f.read(1024):
                await stream.send_audio(chunk)
                await asyncio.sleep(0.032)  # 32 ms per frame at 16 kHz

        async for event in stream:
            if isinstance(event, StreamTranscriptEvent):
                print(f"[Segment {event.segment_index}] {event.text}")
                print(f"  Duration: {event.audio_duration_ms} ms  Latency: {event.latency} ms")
            elif isinstance(event, StreamProcessingEvent):
                print("Processing speech...")

asyncio.run(main())

Using 8 kHz Audio (Telephony)

stream = GnaniSTTStreamClient(
    api_key="your-api-key",
    language_code="en-IN",
    sample_rate=8000,
)

SDK Event Types

All events are typed dataclasses with a raw field containing the full server JSON.
Event classKey fieldsDescription
StreamConnectedEventmessage, sample_rate, chunk_sizeSent once after the WebSocket handshake. Confirms the active config.
StreamProcessingEventtimestampVAD detected end-of-speech; transcription has started.
StreamTranscriptEventtext, segment_index, audio_duration_ms, latencyCompleted transcript for a speech segment.
StreamErrorEventmessage, timestampServer-side error, recoverable or fatal.

Error Handling

from gnani.stt import (
    StreamConnectionError,  # Could not establish the WebSocket connection
    StreamClosedError,      # Attempted to send on an already-closed stream
    StreamError,            # Server returned an error message mid-session
)

try:
    async with GnaniSTTStreamClient(api_key="your-api-key") as stream:
        await stream.send_audio(chunk)
except StreamConnectionError as e:
    print(f"Could not connect: {e}")
except StreamClosedError as e:
    print(f"Stream was already closed: {e}")
except StreamError as e:
    print(f"Server error: {e.message} (at {e.timestamp})")

Supported Languages

LanguageCodeNative ScriptExample
Bengalibn-INBengali (বাংলা)“আমি ভাত খাই”
Englishen-INLatin”I am going to the market”
Gujaratigu-INGujarati (ગુજરાતી)“હું બજાર જાઉં છું”
Hindihi-INDevanagari (हिन्दी)“मैं बाज़ार जा रहा हूँ”
Kannadakn-INKannada (ಕನ್ನಡ)“ನಾನು ಮಾರುಕಟ್ಟೆಗೆ ಹೋಗುತ್ತೇನೆ”
Malayalamml-INMalayalam (മലയാളം)“ഞാൻ ചന്തയിലേക്ക് പോകുന്നു”
Marathimr-INDevanagari (मराठी)“मी बाजारात जातोय”
Punjabipa-INGurmukhi (ਪੰਜਾਬੀ)“ਮੈਂ ਬਾਜ਼ਾਰ ਜਾ ਰਿਹਾ ਹਾਂ”
Tamilta-INTamil (தமிழ்)“நான் சந்தைக்கு செல்கிறேன்”
Telugute-INTelugu (తెలుగు)“నేను మార్కెట్‌కి వెళ్తున్నాను”
Hinglish (experimental)en-hi-in-cmLatin + Devanagari”मैं market जा रहा हूँ”
Hinglish Latin (experimental)en-hi-IN-latnLatin only”Main market ja raha hu”
Auto-detect (experimental)en-IN,hi-IN,ta-IN,…All supportedPass all candidate codes comma-separated in lang_code
For auto-detection, pass all desired language codes comma-separated in the lang_code header. For example: en-IN,hi-IN,ta-IN,te-IN,kn-IN,ml-IN,gu-IN,mr-IN,bn-IN,pa-IN.

Inverse Text Normalization (ITN)

When x-format: transcribe is set, ITN runs on every transcript segment immediately after recognition — converting spoken-form numbers, currency, dates, times, and phone numbers into the compact written form a reader expects. Currently supported for Hindi (hi-IN) and English (en-IN) only. Enabling ITN for other languages has no effect; transcripts are returned verbatim.

What ITN Normalizes

1 — Cardinal & Ordinal Numbers

Spoken input (ASR)Written output (ITN)Rule
दो हज़ार2,000Indian comma grouping
पाँच लाख बीस हज़ार5,20,000Lakh-scale grouping
five lakh5,00,000English lakh convention
पहला / twenty first1st / 21stOrdinal suffix

2 — Currency & Money

Spoken input (ASR)Written output (ITN)Rule
पाँच सौ रुपये₹500₹ + amount
तीन रुपये पचास पैसे₹3.50₹ + rupees.paise
I need five thousand rupees₹5,000English India pipeline
pay do lakh rupees₹2,00,000Code-mixed en/hi

3 — Dates

Spoken input (ASR)Written output (ITN)Rule
बीस जनवरी दो हज़ार पच्चीस20 जनवरी 2025DD Month YYYY (hi)
fifteenth january twenty twenty five15th January 2025Ordinal Month YYYY (en)

4 — Times

Indian time-of-day words (सुबह, दोपहर, शाम, रात) automatically map to 24-hour HH:MM output.
Spoken input (ASR)Written output (ITN)Rule
सुबह पाँच बजेसुबह 05:00सुबह = AM
शाम पाँच बजेशाम 17:00शाम = evening (16–20 h)
रात के दस बजेरात 22:00रात = night (20–24 h)
meeting at five fifteen in the eveningmeeting 17:15 in the eveningen — 24-hour

5 — Phone Numbers & PIN Codes

Spoken input (ASR)Written output (ITN)Rule
नौ आठ सात छह पाँच चार तीन दो एक शून्य987654321010 digits → phone
एक एक शून्य शून्य शून्य एक1100016 digits → PIN
one two three four five six123456English digit words

6 — Mixed & Code-Mixed Utterances

Spoken input (ASR)Written output (ITN)
कल थ्री फिफ्टी पीएम को पाँच सौ रुपये transfer करना हैकल 15:50 को ₹500 transfer करना है
pay do lakh rupees by fifteenth marchpay ₹2,00,000 by 15th March

Native Script Digits — itn_native_numerals

By default, ITN outputs Western Arabic digits (0–9). Set itn_native_numerals: true in the connection headers to render digits in the native script of the target language.
LanguageSpoken inputfalse (default)true — native script
Hindi hi-INपाँच हज़ार रुपये₹5,000₹५,०००
English en-INfive thousand rupees₹5,000₹5,000 (Latin — no change)

What ITN Does Not Change

ITN intentionally preserves idiomatic and ambiguous phrases to avoid incorrect normalization.
  • दो तीन (meaning a few) stays as text, not 2 or 3
  • कर दो / ले दो (imperative verbs) are kept as words, not treated as cardinal 2