Speech-to-Text (Realtime)

Overview

Stream raw PCM audio frames and receive transcript segments as speech is detected. The server uses Voice Activity Detection (VAD) to identify speech boundaries and returns a transcript for each segment.

Use case	Recommended endpoint
Live microphone / phone call / real-time audio	This endpoint
Short pre-recorded clips ≤ 60 s	STT REST
Large files or bulk jobs	STT Batch

Endpoint

WSS wss://api.vachana.ai/stt/v3/stream

Connection Headers

All configuration is passed as WebSocket upgrade headers at connection time. Headers cannot be changed mid-session — reconnect with new headers to change settings.

Header	Required	Default	Description
`x-api-key-id`	✅	—	Your Gnani API key.
`lang_code`	✅	`en-IN`	BCP-47 language code for transcription. See Supported Languages.
`x-sample-rate`	❌	`16000`	Sample rate of the audio stream in Hz. Accepted values: `8000`, `16000`, `44100`, `48000`. Must match the actual sample rate of your audio source.
`x-format`	❌	`verbatim`	`verbatim` — raw spoken-form output. `transcribe` — enables Inverse Text Normalization (ITN). See ITN.
`itn_native_numerals`	❌	`false`	When `x-format=transcribe`, set `true` to render digits in the native script of the target language (e.g. `₹५,०००` instead of `₹5,000` for Hindi).

Choosing the right sample rate:

Value	When to use
`48000`	Browser `getUserMedia` default; Mac microphone
`44100`	Mac microphone alternate; CD-quality audio
`16000`	Wideband telephony; sent as-is with no resampling
`8000`	Narrowband telephony (legacy PSTN / VoIP)

Connection Flow

A WebSocket session follows a strict sequence:

Client connects — opens a WebSocket to /stt/v3/stream with all required headers.
Server confirms — immediately sends a connected message echoing the active configuration.
Client streams audio — continuously sends binary frames of raw PCM audio at a steady real-time cadence.
Server detects speech — VAD identifies end-of-speech boundaries and emits a processing message to acknowledge that a segment was captured.
Server returns transcript — sends a transcript message with the transcribed text, segment metadata, and latency.
Either side closes — client or server may close the connection at any time.

The processing message is a low-latency signal that audio was captured and transcription has begun. Expect a transcript message shortly after.

Audio Format & Sending Audio

All audio must be sent as raw PCM binary frames over the WebSocket. No container format (WAV, MP3, etc.) is accepted mid-stream.

PCM Specification

Property	16 kHz	8 kHz
Encoding	PCM signed 16-bit little-endian	PCM signed 16-bit little-endian
Sample Rate	16,000 Hz	8,000 Hz
Channels	1 (mono)	1 (mono)
Samples per chunk	512	512
Bytes per frame	1,024 bytes (512 samples × 2 bytes)	1,024 bytes (512 samples × 2 bytes)
Frame duration	32 ms	64 ms

Sending Rules

Each binary frame must be exactly 1,024 bytes.
Frames must be sent at real-time cadence — one frame every 32 ms (16 kHz) or 64 ms (8 kHz). Do not buffer and burst; this degrades VAD accuracy.
For 44100 and 48000 Hz sources, the server resamples internally — still send 1,024-byte frames at the appropriate cadence.

Server Messages

The server sends JSON text frames. All messages share a type discriminator field and an ISO-8601 timestamp.

`connected`

Sent once immediately after the WebSocket handshake succeeds.

{
  "type": "connected",
  "message": "STT service ready — VAD service connected",
  "timestamp": "2024-01-15T10:30:00.000Z",
  "config": {
    "sample_rate": 16000,
    "chunk_size": 512
  }
}

Field	Type	Description
`type`	`string`	Always `"connected"`.
`message`	`string`	Human-readable status string.
`timestamp`	`string`	ISO-8601 server timestamp.
`config.sample_rate`	`integer`	Active sample rate in Hz, echoed from `x-sample-rate`.
`config.chunk_size`	`integer`	Expected chunk size in samples (always 512).

`processing`

Emitted when VAD detects the end of a speech segment and transcription has begun. Use this as a low-latency acknowledgment that audio was captured.

{
  "type": "processing",
  "timestamp": "2024-01-15T10:30:05.123Z"
}

Field	Type	Description
`type`	`string`	Always `"processing"`.
`timestamp`	`string`	ISO-8601 timestamp when speech-end was detected.

`transcript`

Contains the transcribed text for a completed speech segment.

{
  "type": "transcript",
  "timestamp": "2024-01-15T10:30:05.987Z",
  "text": "Hello, how are you today?",
  "audio_duration_ms": 2340,
  "segment_id": "<segment_id>",
  "segment_index": 0,
  "latency": 320
}

Field	Type	Description
`type`	`string`	Always `"transcript"`.
`timestamp`	`string`	ISO-8601 timestamp when the transcript was emitted.
`text`	`string`	Transcribed text. Format depends on the `x-format` header.
`audio_duration_ms`	`integer`	Duration of the captured speech segment in milliseconds.
`segment_id`	`string`	Unique identifier for this speech segment. Use for deduplication or support correlation.
`segment_index`	`integer`	Sequential index of this segment within the session, starting at `0`.
`latency`	`integer`	Time in milliseconds from end-of-speech detection to transcript delivery.

`error`

Sent when the server encounters a recoverable or fatal error. The connection may remain open after a recoverable error.

Field	Type	Description
`type`	`string`	Always `"error"`.
`timestamp`	`string`	ISO-8601 timestamp of the error.
`message`	`string`	Human-readable description of the error.

{
  "type": "error",
  "timestamp": "2024-01-15T10:30:10.000Z",
  "message": "STT engine failed to initialize"
}

Python SDK

The official Python SDK wraps the WebSocket connection, audio pacing, and event parsing into a clean async interface.

Installation

pip install gnani-vachana

Requires Python 3.10+.

Authentication

The streaming client requires your API key and language code.

from gnani.stt import GnaniSTTStreamClient

stream = GnaniSTTStreamClient(
    api_key="your-api-key",
    language_code="hi-IN",
)

export GNANI_API_KEY="your-api-key"

from gnani.stt import GnaniSTTStreamClient

# Picks up GNANI_API_KEY from environment automatically
stream = GnaniSTTStreamClient(language_code="hi-IN")

Stream Audio from a File

Use the async context manager and the stream_audio helper. It handles real-time pacing automatically so frames are sent at the correct cadence for VAD.

import asyncio
from gnani.stt import GnaniSTTStreamClient

async def main():
    async with GnaniSTTStreamClient(
        api_key="your-api-key",
        language_code="hi-IN",
        sample_rate=16000,
    ) as stream:
        with open("audio.pcm", "rb") as f:
            transcripts = await stream.stream_audio(
                f,
                on_transcript=lambda t: print(f"Transcript: {t.text}"),
                on_processing=lambda p: print("Processing..."),
                realtime_pace=True,  # sends frames at real-time cadence
            )

    print(f"Total segments: {len(transcripts)}")

asyncio.run(main())

Iterate Over Events Manually

For lower-level control — handling each event type differently or interleaving sending and receiving — iterate over the stream directly.

import asyncio
from gnani.stt import GnaniSTTStreamClient, StreamTranscriptEvent, StreamProcessingEvent

async def main():
    async with GnaniSTTStreamClient(
        api_key="your-api-key",
        language_code="hi-IN",
    ) as stream:
        with open("audio.pcm", "rb") as f:
            while chunk := f.read(1024):
                await stream.send_audio(chunk)
                await asyncio.sleep(0.032)  # 32 ms per frame at 16 kHz

        async for event in stream:
            if isinstance(event, StreamTranscriptEvent):
                print(f"[Segment {event.segment_index}] {event.text}")
                print(f"  Duration: {event.audio_duration_ms} ms  Latency: {event.latency} ms")
            elif isinstance(event, StreamProcessingEvent):
                print("Processing speech...")

asyncio.run(main())

Using 8 kHz Audio (Telephony)

stream = GnaniSTTStreamClient(
    api_key="your-api-key",
    language_code="en-IN",
    sample_rate=8000,
)

SDK Event Types

All events are typed dataclasses with a raw field containing the full server JSON.

Event class	Key fields	Description
`StreamConnectedEvent`	`message`, `sample_rate`, `chunk_size`	Sent once after the WebSocket handshake. Confirms the active config.
`StreamProcessingEvent`	`timestamp`	VAD detected end-of-speech; transcription has started.
`StreamTranscriptEvent`	`text`, `segment_index`, `audio_duration_ms`, `latency`	Completed transcript for a speech segment.
`StreamErrorEvent`	`message`, `timestamp`	Server-side error, recoverable or fatal.

Error Handling

from gnani.stt import (
    StreamConnectionError,  # Could not establish the WebSocket connection
    StreamClosedError,      # Attempted to send on an already-closed stream
    StreamError,            # Server returned an error message mid-session
)

try:
    async with GnaniSTTStreamClient(api_key="your-api-key") as stream:
        await stream.send_audio(chunk)
except StreamConnectionError as e:
    print(f"Could not connect: {e}")
except StreamClosedError as e:
    print(f"Stream was already closed: {e}")
except StreamError as e:
    print(f"Server error: {e.message} (at {e.timestamp})")

Supported Languages

Language	Code	Native Script	Example
Bengali	`bn-IN`	Bengali (বাংলা)	“আমি ভাত খাই”
English	`en-IN`	Latin	”I am going to the market”
Gujarati	`gu-IN`	Gujarati (ગુજરાતી)	“હું બજાર જાઉં છું”
Hindi	`hi-IN`	Devanagari (हिन्दी)	“मैं बाज़ार जा रहा हूँ”
Kannada	`kn-IN`	Kannada (ಕನ್ನಡ)	“ನಾನು ಮಾರುಕಟ್ಟೆಗೆ ಹೋಗುತ್ತೇನೆ”
Malayalam	`ml-IN`	Malayalam (മലയാളം)	“ഞാൻ ചന്തയിലേക്ക് പോകുന്നു”
Marathi	`mr-IN`	Devanagari (मराठी)	“मी बाजारात जातोय”
Punjabi	`pa-IN`	Gurmukhi (ਪੰਜਾਬੀ)	“ਮੈਂ ਬਾਜ਼ਾਰ ਜਾ ਰਿਹਾ ਹਾਂ”
Tamil	`ta-IN`	Tamil (தமிழ்)	“நான் சந்தைக்கு செல்கிறேன்”
Telugu	`te-IN`	Telugu (తెలుగు)	“నేను మార్కెట్‌కి వెళ్తున్నాను”

Inverse Text Normalization (ITN)

When x-format: transcribe is set, ITN runs on every transcript segment immediately after recognition — converting spoken-form numbers, currency, dates, times, and phone numbers into the compact written form a reader expects.

What ITN Normalizes

1 — Cardinal & Ordinal Numbers

Spoken input (ASR)	Written output (ITN)	Rule
दो हज़ार	2,000	Indian comma grouping
पाँच लाख बीस हज़ार	5,20,000	Lakh-scale grouping
five lakh	5,00,000	English lakh convention
पहला / twenty first	1st / 21st	Ordinal suffix

2 — Currency & Money

Spoken input (ASR)	Written output (ITN)	Rule
पाँच सौ रुपये	₹500	₹ + amount
तीन रुपये पचास पैसे	₹3.50	₹ + rupees.paise
I need five thousand rupees	₹5,000	English India pipeline
pay do lakh rupees	₹2,00,000	Code-mixed en/hi

3 — Dates

Spoken input (ASR)	Written output (ITN)	Rule
बीस जनवरी दो हज़ार पच्चीस	20 जनवरी 2025	DD Month YYYY (hi)
fifteenth january twenty twenty five	15th January 2025	Ordinal Month YYYY (en)

4 — Times

Indian time-of-day words (सुबह, दोपहर, शाम, रात) automatically map to 24-hour HH:MM output.

Spoken input (ASR)	Written output (ITN)	Rule
सुबह पाँच बजे	सुबह 05:00	सुबह = AM
शाम पाँच बजे	शाम 17:00	शाम = evening (16–20 h)
रात के दस बजे	रात 22:00	रात = night (20–24 h)
meeting at five fifteen in the evening	meeting 17:15 in the evening	en — 24-hour

5 — Phone Numbers & PIN Codes

Spoken input (ASR)	Written output (ITN)	Rule
नौ आठ सात छह पाँच चार तीन दो एक शून्य	9876543210	10 digits → phone
एक एक शून्य शून्य शून्य एक	110001	6 digits → PIN
one two three four five six	123456	English digit words

6 — Mixed & Code-Mixed Utterances

Spoken input (ASR)	Written output (ITN)
कल थ्री फिफ्टी पीएम को पाँच सौ रुपये transfer करना है	कल 15:50 को ₹500 transfer करना है
pay do lakh rupees by fifteenth march	pay ₹2,00,000 by 15th March

Native Script Digits — `itn_native_numerals`

By default, ITN outputs Western Arabic digits (0–9). Set itn_native_numerals: true in the connection headers to render digits in the native script of the target language.

Language	Spoken input	`false` (default)	`true` — native script
Hindi `hi-IN`	पाँच हज़ार रुपये	₹5,000	₹५,०००
English `en-IN`	five thousand rupees	₹5,000	₹5,000 (Latin — no change)

What ITN Does Not Change

ITN intentionally preserves idiomatic and ambiguous phrases to avoid incorrect normalization.

दो तीन (meaning a few) stays as text, not 2 or 3
कर दो / ले दो (imperative verbs) are kept as words, not treated as cardinal 2

Gnani APIs

APIs

Use Cases

Speech-to-Text (Realtime)

Overview

Endpoint

Connection Headers

Connection Flow

Audio Format & Sending Audio

PCM Specification

Sending Rules

Server Messages

`connected`

`processing`

`transcript`

`error`

Python SDK

Installation

Authentication

Stream Audio from a File

Iterate Over Events Manually

Using 8 kHz Audio (Telephony)

SDK Event Types

Error Handling

Supported Languages

Inverse Text Normalization (ITN)

What ITN Normalizes

1 — Cardinal & Ordinal Numbers

2 — Currency & Money

3 — Dates

4 — Times

5 — Phone Numbers & PIN Codes

6 — Mixed & Code-Mixed Utterances

Native Script Digits — `itn_native_numerals`

What ITN Does Not Change

​Overview

​Endpoint

​Connection Headers

​Connection Flow

​Audio Format & Sending Audio

​PCM Specification

​Sending Rules

​Server Messages

​connected

​processing

​transcript

​error

​Python SDK

​Installation

​Authentication

​Stream Audio from a File

​Iterate Over Events Manually

​Using 8 kHz Audio (Telephony)

​SDK Event Types

​Error Handling

​Supported Languages

​Inverse Text Normalization (ITN)

​What ITN Normalizes

​1 — Cardinal & Ordinal Numbers

​2 — Currency & Money

​3 — Dates

​4 — Times

​5 — Phone Numbers & PIN Codes

​6 — Mixed & Code-Mixed Utterances

​Native Script Digits — itn_native_numerals

​What ITN Does Not Change

Overview

Endpoint

Connection Headers

Connection Flow

Audio Format & Sending Audio

PCM Specification

Sending Rules

Server Messages

`connected`

`processing`

`transcript`

`error`

Python SDK

Installation

Authentication

Stream Audio from a File

Iterate Over Events Manually

Using 8 kHz Audio (Telephony)

SDK Event Types

Error Handling

Supported Languages

Inverse Text Normalization (ITN)

What ITN Normalizes

1 — Cardinal & Ordinal Numbers

2 — Currency & Money

3 — Dates

4 — Times

5 — Phone Numbers & PIN Codes

6 — Mixed & Code-Mixed Utterances

Native Script Digits — `itn_native_numerals`

What ITN Does Not Change