Speech-to-text with Whisper

whisper-large-v3-turbo is accessible via the same /v1/audio/transcriptions endpoint as the OpenAI Whisper API, so any code written for OpenAI works as a drop-in replacement by changing base_url. See the Whisper model page for supported formats, language codes, and inference parameters.

Setup

user@local $ pip install openai
user@local $ export OPENAI_API_KEY="sk-…"

For large-file chunking (files over 25 MB):

user@local $ pip install pydub
# pydub requires ffmpeg
user@local $ brew install ffmpeg          # macOS
user@local $ apt-get install ffmpeg       # Debian/Ubuntu

Basic transcription

import os
from openai import OpenAI

client = OpenAI(base_url="https://llm.aihosting.mittwald.de/v1")

with open("recording.mp3", "rb") as f:
    result = client.audio.transcriptions.create(
        model="whisper-large-v3-turbo",
        file=f,
        language="en",             # Always set explicitly — default is "de"
        response_format="json",
        temperature=1.0,
    )

print(result.text)

Multiple languages

files_and_languages = [
    ("meeting_de.mp3", "de"),
    ("interview_fr.wav", "fr"),
    ("podcast_en.ogg", "en"),
    ("lecture_ja.flac", "ja"),
]

for filename, lang in files_and_languages:
    with open(filename, "rb") as f:
        result = client.audio.transcriptions.create(
            model="whisper-large-v3-turbo",
            file=f,
            language=lang,
            response_format="json",
            temperature=1.0,
        )
    print(f"[{lang}] {result.text[:120]}")

Large files (> 25 MB)

The API accepts files up to 25 MB. Split larger recordings at silence points using pydub so that each chunk is well under the limit and no words are cut mid-utterance:

from pydub import AudioSegment
from pydub.silence import split_on_silence
from io import BytesIO


def transcribe_large_file(path: str, language: str = "en") -> str:
    """Transcribe a file of any size by splitting at silence points."""
    audio = AudioSegment.from_file(path)

    # Split at pauses longer than 700 ms with silence below -40 dBFS
    chunks = split_on_silence(
        audio,
        min_silence_len=700,
        silence_thresh=-40,
        keep_silence=300,   # Keep 300 ms of silence at each edge for context
    )

    # Guard: if no silence found, fall back to fixed 60-second chunks
    if not chunks:
        chunk_ms = 60_000
        chunks = [audio[i: i + chunk_ms] for i in range(0, len(audio), chunk_ms)]

    transcripts: list[str] = []
    for chunk in chunks:
        buf = BytesIO()
        chunk.export(buf, format="mp3")
        buf.seek(0)
        buf.name = "chunk.mp3"   # openai SDK reads the .name attribute for MIME type

        result = client.audio.transcriptions.create(
            model="whisper-large-v3-turbo",
            file=buf,
            language=language,
            response_format="json",
            temperature=1.0,
        )
        transcripts.append(result.text)

    return " ".join(transcripts)


full_text = transcribe_large_file("long_interview.mp3", language="de")
print(full_text)

Transcription + summarisation pipeline

Chain Whisper with a chat model to go directly from audio to a written summary:

def transcribe_and_summarise(audio_path: str, language: str = "en") -> dict:
    # Step 1: transcribe
    with open(audio_path, "rb") as f:
        result = client.audio.transcriptions.create(
            model="whisper-large-v3-turbo",
            file=f,
            language=language,
            response_format="json",
            temperature=1.0,
        )
    transcript = result.text

    # Step 2: summarise
    summary_resp = client.chat.completions.create(
        model="Qwen3.6-35B-A3B-FP8",
        messages=[
            {
                "role": "system",
                "content": "Summarise the transcript concisely. Use bullet points for key topics.",
            },
            {"role": "user", "content": transcript},
        ],
        temperature=0.7,
        max_tokens=512,
        extra_body={"chat_template_kwargs": {"enable_thinking": False}},
    )

    return {
        "transcript": transcript,
        "summary": summary_resp.choices[0].message.content,
    }


output = transcribe_and_summarise("team_standup.mp3", language="en")
print("Transcript:\n", output["transcript"])
print("\nSummary:\n", output["summary"])

Drop-in replacement for OpenAI

Existing code written for the OpenAI Whisper API requires only a base_url change:

# Before (OpenAI)
client = OpenAI()

# After (mittwald AI Hosting)
client = OpenAI(base_url="https://llm.aihosting.mittwald.de/v1")

All other code — file handling, response_format, language, temperature — stays identical.

Timestamps and language detection with verbose_json

response_format="verbose_json" returns per-segment timestamps, detected language, and total duration:

with open("recording.mp3", "rb") as f:
    result = client.audio.transcriptions.create(
        model="whisper-large-v3-turbo",
        file=f,
        language="en",
        response_format="verbose_json",
        temperature=1.0,
    )

print(result.text)
print(f"Language: {result.language}")
print(f"Duration: {result.duration}s")
for seg in (result.segments or []):
    print(f"  [{seg['start']:.1f}s–{seg['end']:.1f}s] {seg['text']}")

Setup​

Basic transcription​

Multiple languages​

Large files (> 25 MB)​

Transcription + summarisation pipeline​

Drop-in replacement for OpenAI​

Timestamps and language detection with verbose_json​