Error handling and retries
This page covers the HTTP error codes your application may receive, how to handle capacity behavior, and how to configure retries correctly for the dedicated endpoint.
HTTP error codes
| Code | Meaning | What to do |
|---|---|---|
400 | Bad request — input too long or malformed | Reduce input length or fix request format (see context limit below) |
401 | Invalid or missing API key | Check Authorization: Bearer YOUR_API_KEY header |
503 | Service temporarily unavailable — endpoint overloaded or restarting | Retry with exponential backoff |
500 | Internal server error | Retry with backoff; contact support if it persists |
502 | Bad gateway | Retry with backoff |
504 | Gateway timeout — request took too long | Reduce input size, lower concurrency, or retry |
Context length limit
Requests whose total token count (prompt + completion) exceeds the model's configured context window are rejected with 400. The response uses the OpenAI error object format:
{
"object": "error",
"message": "This model's maximum context length is X tokens. However, you requested Y tokens ...",
"type": "invalid_request_error",
"param": "messages",
"code": 400
}
Check the context window for your provisioned model in the Getting started guide via the /v1/models endpoint. Plan your input + expected output to stay within that limit.
Request body size limit
The maximum request body size is 25 MB. Requests exceeding this are rejected before reaching the model. This limit is only relevant for multimodal inputs (for example, base64-encoded images). Standard text chat requests are far below this limit.
What to expect under heavy load
The endpoint queues requests internally when all compute slots are occupied. This means:
- Requests do not immediately fail when the model is busy — they wait in the queue.
- Under very high sustained load, queued requests may eventually time out. The connection closes as a connection reset rather than returning an HTTP error code. This is most visible with streaming responses (see Streaming and mid-stream failures below).
- The endpoint returns
503when it cannot accept new requests at all (for example, during a restart or on a true server error).
If you regularly experience high latency or timeouts under your expected load, contact us to review your provisioned capacity.
Retry guidance
Retry on 503, 502, 500. Do not retry on 400 or 401.
Recommended retry pattern:
- On
5xx: wait, then retry - Start with a 1-second wait
- Double the wait on each subsequent attempt (exponential backoff)
- Cap the wait at 60 seconds
- Stop after 5 attempts
Most OpenAI-compatible client libraries have built-in retry logic you can configure directly:
- Python
- JavaScript / TypeScript
import openai
client = openai.OpenAI(
api_key="YOUR_API_KEY",
base_url="https://your-company.llm.aihosting.mittwald.de/v1",
max_retries=5,
)
import OpenAI from "openai";
const client = new OpenAI({
apiKey: "YOUR_API_KEY",
baseURL: "https://your-company.llm.aihosting.mittwald.de/v1",
maxRetries: 5,
});
Streaming and mid-stream failures
For streaming requests ("stream": true), the HTTP 200 response header is written when the first token is ready. If a failure occurs mid-stream (for example, a timeout or a server restart), the connection closes rather than returning an HTTP error code.
Detect this by checking whether the stream ended with the expected stop reason. Retry the full request if not.
Protocol support
The endpoint supports HTTPS only. gRPC is not available.