Skip to main content
POST
/
stt
/
v3
Speech to Text (REST)
curl --request POST \
  --url https://api.vachana.ai/stt/v3 \
  --header 'Content-Type: multipart/form-data' \
  --header 'X-API-Key-ID: <api-key>' \
  --form audio_file='@example-file' \
  --form language_code=hi-IN
{
  "success": true,
  "request_id": "req_abc123",
  "timestamp": "20251226_143052.123",
  "transcript": "नमस्ते, आप कैसे हैं?"
}

Overview

The REST endpoint transcribes an audio file in a single synchronous HTTP request and returns the transcript immediately. It is best suited for short, pre-recorded audio clips.
Use caseRecommended endpoint
Short clips ≤ 60 s (ideal ≤ 30 s)This endpoint
Live microphone / real-time audioSTT Realtime (WebSocket)
Large files or bulk jobsSTT Batch

Endpoint

POST https://api.vachana.ai/stt/v3
Content-Type: multipart/form-data

Authentication

Pass your API key in the request header.
HeaderTypeRequiredDescription
X-API-Key-IDstringYour Vachana API key. Obtain one from the Vachana dashboard.

Request Parameters

All parameters are sent as multipart/form-data fields.
audio_file
file
required
Audio file to transcribe. Supported formats: WAV, MP3, OGG, FLAC, AAC, M4A. Maximum duration: 60 seconds (ideal ≤ 30 s).
language_code
string
required
BCP-47 language code. See Supported Languages below. Pass a comma-separated list of codes to enable auto-detection.
preferred_language
string
Forces processing with the single-language model for the specified code. Must be one of the values passed in language_code. Useful to improve accuracy when the audio is predominantly one language.
format
enum
default:"verbatim"
verbatim — raw spoken-form output. transcribe — enables Inverse Text Normalization (ITN): numbers, currency, dates, and phone numbers are written in their conventional form. See ITN below.
itn_native_numerals
boolean
default:"false"
When format=transcribe, set true to render digits in the native script of the target language (e.g. ₹५,००० instead of ₹5,000 for Hindi). Has no effect when format=verbatim. Currently supported for hi-IN and en-IN only.

Response

200 — Success

{
  "success": true,
  "request_id": "req_abc123",
  "timestamp": "20251226_143052.123",
  "transcript": "नमस्ते, आप कैसे हैं?"
}
FieldTypeDescription
successbooleantrue when transcription completed without error.
request_idstringUnique identifier for this request. Use it when contacting support or correlating logs.
timestampstringServer-side request timestamp in YYYYMMDD_HHMMSS.mmm format.
transcriptstringThe transcribed text. Format depends on the format parameter.

Error Responses

StatusMeaning
400Bad request — invalid parameters or unsupported audio format.
429Rate limit exceeded — slow down or contact support to increase limits.
500Internal server error — transient issue on our side; retry with backoff.
503Service unavailable — the STT service is temporarily down.

Code Example

curl --request POST \
  --url https://api.vachana.ai/stt/v3 \
  --header 'Content-Type: multipart/form-data' \
  --header 'X-API-Key-ID: <api-key>' \
  --form audio_file='@recording.wav' \
  --form language_code=hi-IN \
  --form format=transcribe \
  --form itn_native_numerals=true

Python SDK

The official Python SDK handles multipart construction, authentication headers, and retries automatically.

Installation

pip install gnani-vachana
Requires Python 3.9+.

Authentication

The client requires three credentials: organization_id, api_key, and user_id. You can pass them directly or load them from environment variables.
from gnani.stt import GnaniSTTClient

client = GnaniSTTClient(
    organization_id="your-organization-id",
    api_key="your-api-key",
    user_id="your-user-id",
)

Transcribe Audio

result = client.transcribe("recording.wav", language_code="hi-IN")
print(result["transcript"])

Custom Request ID

Pass a request_id to correlate SDK calls with your own logs or support tickets.
result = client.transcribe(
    "call.flac",
    language_code="hi-IN",
    request_id="my-trace-123",
)

Error Handling

from gnani.stt import (
    AuthenticationError,
    InvalidAudioError,
    APIError,
)

try:
    result = client.transcribe("audio.wav", language_code="hi-IN")
    print(result["transcript"])
except AuthenticationError:
    print("Invalid credentials — check your organization_id, api_key, and user_id.")
except InvalidAudioError as e:
    print(f"Bad audio file: {e}")
except APIError as e:
    print(f"API error {e.status_code}: {e}")

Supported Languages

The Vachana API supports 10 Indian languages.
LanguageCodeNative ScriptExample
Bengalibn-INBengali (বাংলা)“আমি ভাত খাই”
Englishen-INLatin”I am going to the market”
Gujaratigu-INGujarati (ગુજરાતી)“હું બજાર જાઉં છું”
Hindihi-INDevanagari (हिन्दी)“मैं बाज़ार जा रहा हूँ”
Kannadakn-INKannada (ಕನ್ನಡ)“ನಾನು ಮಾರುಕಟ್ಟೆಗೆ ಹೋಗುತ್ತೇನೆ”
Malayalamml-INMalayalam (മലയാളം)“ഞാൻ ചന്തയിലേക്ക് പോകുന്നു”
Marathimr-INDevanagari (मराठी)“मी बाजारात जातोय”
Punjabipa-INGurmukhi (ਪੰਜਾਬੀ)“ਮੈਂ ਬਾਜ਼ਾਰ ਜਾ ਰਿਹਾ ਹਾਂ”
Tamilta-INTamil (தமிழ்)“நான் சந்தைக்கு செல்கிறேன்”
Telugute-INTelugu (తెలుగు)“నేను మార్కెట్‌కి వెళ్తున్నాను”
Hinglish (experimental)en-hi-in-cmLatin + Devanagari”मैं market जा रहा हूँ”
Auto-detect (experimental)en-IN,hi-IN,ta-IN,…All supportedPass all desired codes comma-separated
For auto-detection, pass the full set of candidate language codes as a comma-separated value in language_code. For example: en-IN,hi-IN,ta-IN,te-IN,kn-IN,ml-IN,gu-IN,mr-IN,bn-IN,pa-IN.

Inverse Text Normalization (ITN)

ITN converts the spoken-form output of the ASR engine into the conventional written form a reader expects — numbers become digits, currency gets the ₹ symbol, dates are formatted, and phone numbers are compacted — all in one pass, immediately after transcription. How to enable: Set format=transcribe in the request body. Currently supported for Hindi (hi-IN) and English (en-IN) only. All other languages use verbatim output regardless of the format value.

What ITN Normalizes

1 — Cardinal & Ordinal Numbers

Whole numbers and positional ranks are formatted using Indian comma grouping (groups of 2 after the first 3 digits).
Spoken input (ASR)Written output (ITN)Rule
दो हज़ार2,000Indian comma grouping
पाँच लाख बीस हज़ार5,20,000Lakh-scale grouping
उन्नीस सौ चौरानवे1,994Hundred-base year form
five lakh5,00,000English lakh convention
पहला / twenty first1st / 21stOrdinal suffix

2 — Currency & Money

All Indian currency expressions — including paise fractions and lakh/crore scales — are formatted with the ₹ symbol and Indian comma grouping.
Spoken input (ASR)Written output (ITN)Rule
पाँच सौ रुपये₹500₹ + amount
तीन रुपये पचास पैसे₹3.50₹ + rupees.paise
दस लाख रुपये₹10,00,000₹ + lakh grouping
I need five thousand rupees₹5,000English India pipeline

3 — Dates

Spoken input (ASR)Written output (ITN)Rule
बीस जनवरी दो हज़ार पच्चीस20 जनवरी 2025DD Month YYYY (hi)
fifteenth january twenty twenty five15th January 2025Ordinal Month YYYY (en)

4 — Times

Indian time-of-day words (सुबह, दोपहर, शाम, रात) automatically map to 24-hour HH:MM output.
Spoken input (ASR)Written output (ITN)Rule
सुबह पाँच बजेसुबह 05:00सुबह = AM
शाम पाँच बजेशाम 17:00शाम = evening (16–20 h)
रात के दस बजेरात 22:00रात = night (20–24 h)
meeting at five fifteen in the eveningmeeting 17:15 in the eveningen — 24-hour

5 — Phone Numbers & PIN Codes

Digit streams are concatenated into compact numeric strings. 10-digit streams → mobile number; 6-digit streams → PIN. Repeat prefixes (double/डबल, triple/ट्रिपल) are expanded.
Spoken input (ASR)Written output (ITN)Rule
नौ आठ सात छह पाँच चार तीन दो एक शून्य987654321010 digits → phone
एक एक शून्य शून्य शून्य एक1100016 digits → PIN
डबल आठ नौ शून्य एक दो तीन चार पाँच छह8890123456double prefix
one two three four five six123456English digit words

6 — Mixed & Code-Mixed Utterances

A single sentence may contain multiple entity types or blend Hindi and English. ITN handles all in one pass, normalizing each entity independently.
Spoken input (ASR)Written output (ITN)
कल थ्री फिफ्टी पीएम को पाँच सौ रुपये transfer करना हैकल 15:50 को ₹500 transfer करना है

Native Script Digits — itn_native_numerals

By default, ITN outputs Western Arabic digits (0–9) regardless of language. Set itn_native_numerals=true to render digits in the native script of the target language.
LanguageSpoken inputfalse (default)true — native script
Hindi hi-INपाँच हज़ार रुपये₹5,000₹५,०००
English en-INfive thousand rupees₹5,000₹5,000 (Latin — no change)

What ITN Does Not Change

ITN intentionally preserves idiomatic and ambiguous phrases to avoid incorrect normalization.
  • दो तीन (meaning a few) stays as text, not 2 or 3
  • कर दो / ले दो (imperative verbs) are kept as words, not treated as cardinal 2
If a word or phrase is unchanged in the output, treat it as a failure only when the input was unambiguously a numeric entity.

Authorizations

X-API-Key-ID
string
header
required

API key for authentication. Sign up in Vachana to get the API Key.

Body

multipart/form-data
audio_file
file
required

Audio file to transcribe. Supported formats - WAV, MP3, OGG, FLAC, AAC, M4A. Maximum duration - 60 seconds (Ideal duration is 30 seconds).

language_code
enum<string>
required

Language code for transcription. Use one of the supported language codes.

Supported values: bn-IN, en-IN, gu-IN, hi-IN, kn-IN, ml-IN, mr-IN, pa-IN, ta-IN, te-IN

Available options:
bn-IN,
en-IN,
gu-IN,
hi-IN,
kn-IN,
ml-IN,
mr-IN,
pa-IN,
ta-IN,
te-IN
Example:

"hi-IN"

preferred_language
enum<string>

Optional preferred language for processing when multiple languages are specified. Must be one of the languages in language_code. When set, forces processing with the single-language model for the specified language, which may improve accuracy for predominantly single-language audio.

Available options:
bn-IN,
en-IN,
gu-IN,
hi-IN,
kn-IN,
ml-IN,
mr-IN,
pa-IN,
ta-IN,
te-IN
Example:

"hi-IN"

format
enum<string>
default:verbatim

Output format for the transcript.

  • verbatim (default) — Returns the raw spoken-form transcript as recognised by the ASR engine. No post-processing is applied.
  • transcribe — Enables Inverse Text Normalization (ITN). Spoken numeric expressions, currency, dates, times, and phone numbers are automatically converted to their written form (e.g. "five thousand rupees" → "₹5,000"). Currently supported for hi-IN and en-IN only.
Available options:
verbatim,
transcribe
Example:

"transcribe"

itn_native_numerals
boolean
default:false

When format=transcribe, set to true to render digits in the native script of the target language instead of Western Arabic digits (0–9).

For example, with hi-IN: "पाँच हज़ार रुपये" → "₹५,०००" instead of "₹5,000".

Has no effect when format=verbatim. Currently supported for hi-IN and en-IN only (English always uses Western Arabic digits regardless of this setting).

Example:

true

Response

Successful transcription

success
boolean

Indicates if the transcription was successful

timestamp
string

Request timestamp in format YYYYMMDD_HHMMSS.mmm

transcript
string

The transcribed text from the audio