LLM cascade routing
Serving every request with a large frontier model is accurate but expensive. Routing cheap queries to a small model and reserving the large model for genuinely hard requests typically cuts cost by 40–80% with less than 1% quality loss.
This guide builds a three-tier cascade on mittwald AI Hosting using only the openai package:
| Tier | Model | When to use |
|---|---|---|
| 1 — fast | Qwen3.5-0.8B | Simple lookups, yes/no, short factual answers |
| 2 — balanced | Qwen3.6-35B-A3B-FP8 | Multi-step reasoning, code, comparison |
| 3 — frontier | Mistral-Medium-3.5-128B | Complex analysis, long documents, 40+ languages |
For tier 3 you can swap in any large model depending on your workload:
| Tier-3 alternative | Strengths |
|---|---|
Mistral-Medium-3.5-128B | 40+ languages, 256k context |
Qwen3.5-122B-A10B-FP8 | Thinking mode, vision, strong coding |
gpt-oss-120b | Text-centric reasoning, 131k context |
See the model pages for parameter details: Qwen3.5-0.8B, Qwen3.6-35B-A3B-FP8, Mistral-Medium-3.5-128B, Qwen3.5-122B-A10B-FP8.
Setup
user@local $ pip install openai
user@local $ export OPENAI_API_KEY="sk-…"
The router
The router sends each query to Qwen3.5-0.8B and asks it to classify complexity in one word. Because the classifier output is just a single token (simple, moderate, or complex), the cost is negligible — typically less than 0.1% of the total spend.
import os
from openai import OpenAI
client = OpenAI(base_url="https://llm.aihosting.mittwald.de/v1")
# Swap the tier-3 model to match your workload:
# "Mistral-Medium-3.5-128B" — 40+ languages, 256k context
# "Qwen3.5-122B-A10B-FP8" — thinking mode, vision, strong coding
# "gpt-oss-120b" — text-centric reasoning, 131k context
TIERS = {
"simple": "Qwen3.5-0.8B",
"moderate": "Qwen3.6-35B-A3B-FP8",
"complex": "Mistral-Medium-3.5-128B",
}
CLASSIFIER_SYSTEM = """\
Classify the complexity of the user's question with exactly one word: simple, moderate, or complex.
simple — factual lookup, yes/no, basic definition, short translation
moderate — multi-step reasoning, code with explanation, structured comparison
complex — research-level analysis, multi-file code review, long-document synthesis,
or anything requiring deep domain knowledge
Reply with only the single classification word and nothing else."""
def classify(query: str) -> str:
resp = client.chat.completions.create(
model="Qwen3.5-0.8B",
messages=[
{"role": "system", "content": CLASSIFIER_SYSTEM},
{"role": "user", "content": query},
],
temperature=0.0,
max_tokens=5,
)
label = resp.choices[0].message.content.strip().lower()
return label if label in TIERS else "moderate"
def answer(query: str, model: str, **kwargs) -> str:
resp = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": query}],
temperature=0.7,
**kwargs,
)
return resp.choices[0].message.content
def cascade(query: str) -> dict:
tier = classify(query)
model = TIERS[tier]
text = answer(query, model)
return {"tier": tier, "model": model, "answer": text}
Usage
queries = [
"What is the capital of France?", # simple
"Write a Python function that merges two sorted lists.", # moderate
"Analyse the architectural tradeoffs between event-driven and "
"request-response patterns for a high-volume payment system.", # complex
]
for q in queries:
result = cascade(q)
print(f"[{result['tier']:8s}] {result['model']}")
print(f" Q: {q[:70]}")
print(f" A: {result['answer'][:120]}\n")
Expected output:
[simple ] Qwen3.5-0.8B
Q: What is the capital of France?
A: Paris.
[moderate] Qwen3.6-35B-A3B-FP8
Q: Write a Python function that merges two sorted lists.
A: def merge_sorted(a, b): ...
[complex ] Mistral-Medium-3.5-128B
Q: Analyse the architectural tradeoffs between event-driven and request-response…
A: Event-driven architectures decouple producers from consumers…
Escalation on short answers
Some queries are deceptively classified as simple but the small model returns an unhelpfully brief answer. Add a word-count guard to automatically escalate:
def cascade_with_escalation(query: str, min_words: int = 20) -> dict:
tier = classify(query)
model = TIERS[tier]
text = answer(query, model)
# Escalate if answer is suspiciously short and a bigger tier exists
tiers = list(TIERS.keys())
current_idx = tiers.index(tier)
while len(text.split()) < min_words and current_idx < len(tiers) - 1:
current_idx += 1
tier = tiers[current_idx]
model = TIERS[tier]
text = answer(query, model)
return {"tier": tier, "model": model, "answer": text}
System prompt passthrough
To add a system prompt (e.g. for a customer service persona), pass it through to whichever tier handles the query:
SYSTEM = "You are a concise support agent for mittwald. Answer in the user's language."
def cascade_with_system(query: str) -> dict:
tier = classify(query)
model = TIERS[tier]
resp = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": SYSTEM},
{"role": "user", "content": query},
],
temperature=0.7,
)
return {"tier": tier, "model": model, "answer": resp.choices[0].message.content}