Whisper-Large-V3-Turbo

The Speech-to-text guide has runnable examples for basic transcription, multi-language usage, large-file chunking, and a transcription + summarisation pipeline.

Description

“Whisper-Large-V3-Turbo” is a multilingual automatic speech recognition model (ASR) developed by OpenAI, optimized for speed and efficiency. It is based on the architecture of the well-known “Whisper-Large-V3” model, but uses a lighter decoder structure to significantly reduce latency with only a minimal loss in accuracy. The model supports over 99 languages and is ideal for transcribing speech inputs.

The following limitations apply to this model on our platform:

Maximum file size: 25 MB per upload
No explicit context length limit – depends on audio duration and file size
Translation is currently not supported (to_language)
Supported output formats: json, verbose_json
- response_format="text" is accepted but returns a JSON body regardless — use "json" instead
- srt and vtt are not supported (HTTP 400)

Supported Input Formats

mp3, ogg, wav, flac

Supported values for parameter `language` (ISO-639-1 language codes)

af, ar, az, be, bg, bs, ca, cs, cy, da, de, el, en, es, et, fa, fi, fr, gl, he, hi, hr, hu, hy, id, is, it, ja, kk, kn, ko, lt, lv, mk, mi, mr, ms, ne, nl, no, pl, pt, ro, ru, sk, sl, sr, sv, sw, ta, th, tl, tr, uk, ur, vi, zh

Recommended Inference Parameters

temperature=1.0
top_p=1.0
response_format="json"
language like language="de"should always be set explicitly to maximize accuracy. If no value is provided, German ("de") will be assumed by default, which may result in poorer outcomes for inputs in other languages.

Example output — `response_format="json"`

{
  "text": "This is the transcribed text of a speech input.",
  "usage": {
    "type": "duration",
    "seconds": 8
  }
}

Example output — `response_format="verbose_json"`

Returns additional metadata including detected language, duration, and per-segment timestamps:

{
  "text": "This is the transcribed text.",
  "language": "en",
  "duration": "8.0",
  "words": null,
  "segments": [
    {
      "id": 0,
      "avg_logprob": -0.45,
      "text": " This is the transcribed text.",
      "start": 0.0,
      "end": 2.4
    }
  ]
}

Terms of Use and Licensing

The general terms of use apply. The model is provided by OpenAI under the MIT License, and reuse of the generated content is subject to no additional restrictions.

Description​

Supported Input Formats​

Supported values for parameter language (ISO-639-1 language codes)​

Recommended Inference Parameters​

Example output — response_format="json"​

Example output — response_format="verbose_json"​

Terms of Use and Licensing​