Skip to main content

LLM cascade routing

Serving every request with a large frontier model is accurate but expensive. Routing cheap queries to a small model and reserving the large model for genuinely hard requests typically cuts cost by 40–80% with less than 1% quality loss.

This guide builds a three-tier cascade on mittwald AI Hosting using only the openai package:

TierModelWhen to use
1 — fastQwen3.5-0.8BSimple lookups, yes/no, short factual answers
2 — balancedQwen3.6-35B-A3B-FP8Multi-step reasoning, code, comparison
3 — frontierMistral-Medium-3.5-128BComplex analysis, long documents, 40+ languages

For tier 3 you can swap in any large model depending on your workload:

Tier-3 alternativeStrengths
Mistral-Medium-3.5-128B40+ languages, 256k context
Qwen3.5-122B-A10B-FP8Thinking mode, vision, strong coding
gpt-oss-120bText-centric reasoning, 131k context

See the model pages for parameter details: Qwen3.5-0.8B, Qwen3.6-35B-A3B-FP8, Mistral-Medium-3.5-128B, Qwen3.5-122B-A10B-FP8.

Setup

user@local $ pip install openai
user@local $ export OPENAI_API_KEY="sk-…"

The router

The router sends each query to Qwen3.5-0.8B and asks it to classify complexity in one word. Because the classifier output is just a single token (simple, moderate, or complex), the cost is negligible — typically less than 0.1% of the total spend.

import os
from openai import OpenAI

client = OpenAI(base_url="https://llm.aihosting.mittwald.de/v1")

# Swap the tier-3 model to match your workload:
# "Mistral-Medium-3.5-128B" — 40+ languages, 256k context
# "Qwen3.5-122B-A10B-FP8" — thinking mode, vision, strong coding
# "gpt-oss-120b" — text-centric reasoning, 131k context
TIERS = {
"simple": "Qwen3.5-0.8B",
"moderate": "Qwen3.6-35B-A3B-FP8",
"complex": "Mistral-Medium-3.5-128B",
}

CLASSIFIER_SYSTEM = """\
Classify the complexity of the user's question with exactly one word: simple, moderate, or complex.

simple — factual lookup, yes/no, basic definition, short translation
moderate — multi-step reasoning, code with explanation, structured comparison
complex — research-level analysis, multi-file code review, long-document synthesis,
or anything requiring deep domain knowledge

Reply with only the single classification word and nothing else."""


def classify(query: str) -> str:
resp = client.chat.completions.create(
model="Qwen3.5-0.8B",
messages=[
{"role": "system", "content": CLASSIFIER_SYSTEM},
{"role": "user", "content": query},
],
temperature=0.0,
max_tokens=5,
)
label = resp.choices[0].message.content.strip().lower()
return label if label in TIERS else "moderate"


def answer(query: str, model: str, **kwargs) -> str:
resp = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": query}],
temperature=0.7,
**kwargs,
)
return resp.choices[0].message.content


def cascade(query: str) -> dict:
tier = classify(query)
model = TIERS[tier]
text = answer(query, model)
return {"tier": tier, "model": model, "answer": text}

Usage

queries = [
"What is the capital of France?", # simple
"Write a Python function that merges two sorted lists.", # moderate
"Analyse the architectural tradeoffs between event-driven and "
"request-response patterns for a high-volume payment system.", # complex
]

for q in queries:
result = cascade(q)
print(f"[{result['tier']:8s}] {result['model']}")
print(f" Q: {q[:70]}")
print(f" A: {result['answer'][:120]}\n")

Expected output:

[simple ] Qwen3.5-0.8B
Q: What is the capital of France?
A: Paris.

[moderate] Qwen3.6-35B-A3B-FP8
Q: Write a Python function that merges two sorted lists.
A: def merge_sorted(a, b): ...

[complex ] Mistral-Medium-3.5-128B
Q: Analyse the architectural tradeoffs between event-driven and request-response…
A: Event-driven architectures decouple producers from consumers…

Escalation on short answers

Some queries are deceptively classified as simple but the small model returns an unhelpfully brief answer. Add a word-count guard to automatically escalate:

def cascade_with_escalation(query: str, min_words: int = 20) -> dict:
tier = classify(query)
model = TIERS[tier]
text = answer(query, model)

# Escalate if answer is suspiciously short and a bigger tier exists
tiers = list(TIERS.keys())
current_idx = tiers.index(tier)
while len(text.split()) < min_words and current_idx < len(tiers) - 1:
current_idx += 1
tier = tiers[current_idx]
model = TIERS[tier]
text = answer(query, model)

return {"tier": tier, "model": model, "answer": text}

System prompt passthrough

To add a system prompt (e.g. for a customer service persona), pass it through to whichever tier handles the query:

SYSTEM = "You are a concise support agent for mittwald. Answer in the user's language."

def cascade_with_system(query: str) -> dict:
tier = classify(query)
model = TIERS[tier]
resp = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": SYSTEM},
{"role": "user", "content": query},
],
temperature=0.7,
)
return {"tier": tier, "model": model, "answer": resp.choices[0].message.content}