Skip to main content

Error handling and retries

This page covers the HTTP error codes your application may receive, how to handle capacity behavior, and how to configure retries correctly for the dedicated endpoint.

HTTP error codes

CodeMeaningWhat to do
400Bad request — input too long or malformedReduce input length or fix request format (see context limit below)
401Invalid or missing API keyCheck Authorization: Bearer YOUR_API_KEY header
503Service temporarily unavailable — endpoint overloaded or restartingRetry with exponential backoff
500Internal server errorRetry with backoff; contact support if it persists
502Bad gatewayRetry with backoff
504Gateway timeout — request took too longReduce input size, lower concurrency, or retry

Context length limit

Requests whose total token count (prompt + completion) exceeds the model's configured context window are rejected with 400. The response uses the OpenAI error object format:

{
"object": "error",
"message": "This model's maximum context length is X tokens. However, you requested Y tokens ...",
"type": "invalid_request_error",
"param": "messages",
"code": 400
}

Check the context window for your provisioned model in the Getting started guide via the /v1/models endpoint. Plan your input + expected output to stay within that limit.

Request body size limit

The maximum request body size is 25 MB. Requests exceeding this are rejected before reaching the model. This limit is only relevant for multimodal inputs (for example, base64-encoded images). Standard text chat requests are far below this limit.

What to expect under heavy load

The endpoint queues requests internally when all compute slots are occupied. This means:

  • Requests do not immediately fail when the model is busy — they wait in the queue.
  • Under very high sustained load, queued requests may eventually time out. The connection closes as a connection reset rather than returning an HTTP error code. This is most visible with streaming responses (see Streaming and mid-stream failures below).
  • The endpoint returns 503 when it cannot accept new requests at all (for example, during a restart or on a true server error).

If you regularly experience high latency or timeouts under your expected load, contact us to review your provisioned capacity.

Retry guidance

Retry on 503, 502, 500. Do not retry on 400 or 401.

Recommended retry pattern:

  1. On 5xx: wait, then retry
  2. Start with a 1-second wait
  3. Double the wait on each subsequent attempt (exponential backoff)
  4. Cap the wait at 60 seconds
  5. Stop after 5 attempts

Most OpenAI-compatible client libraries have built-in retry logic you can configure directly:

import openai

client = openai.OpenAI(
api_key="YOUR_API_KEY",
base_url="https://your-company.llm.aihosting.mittwald.de/v1",
max_retries=5,
)

Streaming and mid-stream failures

For streaming requests ("stream": true), the HTTP 200 response header is written when the first token is ready. If a failure occurs mid-stream (for example, a timeout or a server restart), the connection closes rather than returning an HTTP error code.

Detect this by checking whether the stream ended with the expected stop reason. Retry the full request if not.

Protocol support

The endpoint supports HTTPS only. gRPC is not available.