Qwen3.5-122B-A10B-FP8

Description

"Qwen3.5-122B-A10B-FP8" is a Mixture-of-Experts (MoE) language model by Alibaba with 122 billion total parameters, of which approximately 10 billion are active per forward pass. It is designed for high-quality chat, agentic workflows, and reasoning tasks while remaining computationally efficient thanks to the MoE architecture.

It supports and is suitable for:

Text generation within a chat completion (text to text)
Tool-calling for agentic workflows
Image understanding (vision)
Thinking / reasoning for step-by-step problem solving

The following limitations apply:

Maximum context length: 245,760 tokens
Thinking mode requires at least 128,000 tokens of remaining context to function properly
Images must be submitted as Base64-encoded data URLs (no remote URLs)

Thinking mode is enabled by default. See Disabling thinking mode below for the correct way — the parameter must be nested inside chat_template_kwargs.

Using this model from n8n? The built-in OpenAI Chat Model node can't set chat_template_kwargs — see Reasoning models and thinking mode for a community-node workaround.

Disabling thinking mode

Python
JavaScript
PHP

from openai import OpenAI

client = OpenAI(
    base_url="https://llm.aihosting.mittwald.de/v1",
    api_key="sk-your-api-key-here",
)

response = client.chat.completions.create(
    model="Qwen3.5-122B-A10B-FP8",
    messages=[{"role": "user", "content": "What is 2 + 2?"}],
    temperature=0.7,
    top_p=0.8,
    max_tokens=32768,
    extra_body={
        "chat_template_kwargs": {"enable_thinking": False},
        #                        ^^^^^^^^^^^^^^^^^^^^^^^^
        # Must be nested here — passing enable_thinking at the top level
        # is silently ignored by the API.
    },
)

print(response.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://llm.aihosting.mittwald.de/v1",
  apiKey: "sk-your-api-key-here",
});

const response = await client.chat.completions.create({
  model: "Qwen3.5-122B-A10B-FP8",
  messages: [{ role: "user", content: "What is 2 + 2?" }],
  temperature: 0.7,
  top_p: 0.8,
  max_tokens: 32768,
  // @ts-ignore – vLLM extension; must be nested here, not enable_thinking at top level
  chat_template_kwargs: { enable_thinking: false },
} as any);

console.log(response.choices[0].message.content);

<?php
// composer require openai-php/client guzzlehttp/guzzle

$client = OpenAI::factory()
    ->withBaseUri('https://llm.aihosting.mittwald.de/v1')
    ->withApiKey('sk-your-api-key-here')
    ->make();

$response = $client->chat()->create([
    'model' => 'Qwen3.5-122B-A10B-FP8',
    'messages' => [
        ['role' => 'user', 'content' => 'What is 2 + 2?'],
    ],
    'temperature' => 0.7,
    'top_p' => 0.8,
    'max_tokens' => 32768,
    'chat_template_kwargs' => ['enable_thinking' => false],
    // Must be nested here — passing 'enable_thinking' => false at the top
    // level of the request is silently ignored by the API.
]);

echo $response->choices[0]->message->content;

Reading the response

When thinking mode is enabled (default), the model returns two separate fields:

Field	Contents
`choices[0].message.reasoning_content`	Internal chain-of-thought (may be very long)
`choices[0].message.content`	Final answer

If content is empty, the model placed its answer inside the reasoning block — disable thinking mode to ensure content is always populated.

print(response.choices[0].message.reasoning_content)  # internal chain-of-thought
print(response.choices[0].message.content)             # final answer

Recommended inference parameters

The model has different recommended settings depending on the use case. Do not use greedy decoding (temperature 0) - it can cause performance degradation and repetitions.

Thinking mode (default)

General tasks:

Parameter	Value
`temperature`	1.0
`top_p`	0.95
`top_k`	20
`presence_penalty`	1.5

Precise coding / web development:

Parameter	Value
`temperature`	0.6
`top_p`	0.95
`top_k`	20

Qwen recommends presence_penalty=0.0 for precise coding tasks. If this produces empty content responses or reasoning-only loops, presence_penalty=1.0 can be tested as a more stable operating value; values up to 2.0 are permitted by Qwen depending on the framework, but may likewise cause errors.

Non-thinking mode (`enable_thinking: false`)

General tasks:

Parameter	Value
`temperature`	0.7
`top_p`	0.8
`top_k`	20
`presence_penalty`	1.5

Reasoning / math / complex problem solving:

Parameter	Value
`temperature`	1.0
`top_p`	1.0
`top_k`	40
`presence_penalty`	2.0

Output length

Set max_tokens according to task complexity to control cost and latency:

Task type	Recommended `max_tokens`
Standard queries	32,768
Complex problems (math, programming contests)	81,920

Tips for specific tasks

Vision (image to text)

Always disable thinking mode for vision tasks - thinking adds latency without improving image understanding:

extra_body={"chat_template_kwargs": {"enable_thinking": False}}

Recommended parameters for vision:

Parameter	Value
`temperature`	0.7
`top_p`	0.8
`top_k`	20
`max_tokens`	512–2048 depending on task

For accurate text extraction (OCR) or data reading, use temperature=0.1 instead.

Always resize images to a maximum of 1024 px on the longest edge before encoding as Base64 - large images significantly increase time to first token (TTFT). See the Python examples or JavaScript examples for a ready-to-use helper.

Math problems

For best results on mathematical tasks, append the following instruction to your prompt:

Please reason step by step, and put your final answer within \boxed{}.

Multiple-choice questions

To get consistent, parseable output on multiple-choice tasks, add this to your prompt:

Please show your choice in the 'answer' field with only the choice letter, e.g., 'answer': 'C'.

Terms of use and licensing

The general terms of use apply. The model is provided by Alibaba under the Apache 2.0 License, and reuse of the generated content is not subject to any additional restrictions.

Description​

Disabling thinking mode​

Reading the response​

Recommended inference parameters​

Thinking mode (default)​

Non-thinking mode (enable_thinking: false)​

Output length​

Tips for specific tasks​

Vision (image to text)​

Math problems​

Multiple-choice questions​

Terms of use and licensing​