OpenAI API compatibility
The dedicated AI Hosting endpoint is OpenAI-compatible. You can point any OpenAI SDK or library at it by changing two values — the base URL and the API key. This page documents exactly what is supported, what is not, and where the behavior differs from OpenAI's hosted service.
Switching from OpenAI
# Before
client = OpenAI(api_key="sk-...")
# After
client = OpenAI(
api_key="YOUR_DEDICATED_API_KEY",
base_url="https://your-company.llm.aihosting.mittwald.de/v1",
)
The model name also changes. Use the ID returned by /v1/models instead of an OpenAI model name like gpt-4o.
Supported endpoints
| Endpoint | Supported |
|---|---|
GET /v1/models | ✅ |
POST /v1/chat/completions | ✅ |
POST /v1/completions | ✅ (legacy text completions) |
POST /v1/responses | ✅ |
POST /v1/embeddings | ❌ — this endpoint serves a generative model, not an embedding model |
Assistants API (/v1/assistants, /v1/threads, …) | ❌ |
Batch API (/v1/batches) | ❌ |
Fine-tuning API (/v1/fine_tuning) | ❌ |
Files API (/v1/files) | ❌ |
Moderations API (/v1/moderations) | ❌ |
| Audio / image endpoints | ❌ |
/v1/chat/completions parameter support
| Parameter | Support | Notes |
|---|---|---|
model | ✅ | Use the model ID from /v1/models |
messages | ✅ | |
stream | ✅ | See Streaming |
temperature, top_p | ✅ | |
max_tokens / max_completion_tokens | ✅ | |
stop | ✅ | |
n | ✅ | Multiple completions per request |
presence_penalty, frequency_penalty | ✅ | |
logprobs, top_logprobs | ✅ | |
tools, tool_choice | ✅ | See Tool calling |
parallel_tool_calls | ✅ | false limits to one tool call; true (default) allows multiple |
response_format | ✅ | See Structured outputs |
seed | ⚠️ | Accepted and passed through — reproducibility is best-effort due to GPU non-determinism |
user | ⚠️ | Accepted but ignored |
logit_bias | ✅ | Supported; token IDs outside the model vocabulary return a validation error |
stream_options | ✅ | Supported when stream: true; passing it without streaming returns a validation error. include_usage: true appends a usage chunk at the end of the stream |
Structured outputs
Both JSON mode and JSON schema mode are supported.
JSON object mode — constrains output to valid JSON:
{
"model": "YOUR_MODEL_ID",
"messages": [{"role": "user", "content": "Return a JSON object with keys name and age."}],
"response_format": {"type": "json_object"}
}
JSON schema mode — constrains output to a specific schema:
{
"model": "YOUR_MODEL_ID",
"messages": [{"role": "user", "content": "Extract the person's name and age."}],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "person",
"strict": true,
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"}
},
"required": ["name", "age"],
"additionalProperties": false
}
}
}
}
Tool calling
Tool calling (function calling) follows the OpenAI format. Pass your tools in the tools array and set tool_choice to control behavior.
from openai import OpenAI
client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://your-company.llm.aihosting.mittwald.de/v1",
)
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location.",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City and country"},
},
"required": ["location"],
},
},
}
]
response = client.chat.completions.create(
model="YOUR_MODEL_ID",
messages=[{"role": "user", "content": "What is the weather in Berlin?"}],
tools=tools,
tool_choice="auto",
)
tool_calls = response.choices[0].message.tool_calls
parallel_tool_calls=False restricts the model to at most one tool call per turn, which simplifies handling in multi-step agent loops.
Configuring client timeouts
LLM requests with long inputs or outputs can take 60 seconds or more. Many HTTP clients and frameworks default to much shorter timeouts. Set your client timeout explicitly to avoid spurious failures.
- Python
- JavaScript / TypeScript
import httpx
from openai import OpenAI
client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://your-company.llm.aihosting.mittwald.de/v1",
timeout=httpx.Timeout(300.0, connect=10.0),
)
A 300-second (5-minute) read timeout covers most inference workloads. Increase for very long contexts or high-concurrency conditions.
import OpenAI from "openai";
const client = new OpenAI({
apiKey: "YOUR_API_KEY",
baseURL: "https://your-company.llm.aihosting.mittwald.de/v1",
timeout: 300 * 1000, // 300 seconds in milliseconds
});