LLM cascade routing

Serving every request with a large frontier model is accurate but expensive. Routing cheap queries to a small model and reserving the large model for genuinely hard requests typically cuts cost by 40–80% with less than 1% quality loss.

This guide builds a three-tier cascade on mittwald AI Hosting using only the openai package:

Tier	Model	When to use
1 — fast	`Qwen3.5-0.8B`	Simple lookups, yes/no, short factual answers
2 — balanced	`Qwen3.6-35B-A3B-FP8`	Multi-step reasoning, code, comparison
3 — frontier	`Qwen3.5-122B-A10B-FP8`	Complex analysis, long documents, vision

For tier 3 you can swap in any large model depending on your workload:

Tier-3 alternative	Strengths
`Qwen3.5-122B-A10B-FP8`	Thinking mode, vision, strong coding, 245k context
`gpt-oss-120b`	Text-centric reasoning, 131k context

See the model pages for parameter details: Qwen3.5-0.8B, Qwen3.6-35B-A3B-FP8, Qwen3.5-122B-A10B-FP8.

Setup

user@local $ pip install openai
user@local $ export OPENAI_API_KEY="sk-…"

The router

The router sends each query to Qwen3.5-0.8B and asks it to classify complexity in one word. Because the classifier output is just a single token (simple, moderate, or complex), the cost is negligible — typically less than 0.1% of the total spend.

import os
from openai import OpenAI

client = OpenAI(base_url="https://llm.aihosting.mittwald.de/v1")

# Swap the tier-3 model to match your workload:
#   "Qwen3.5-122B-A10B-FP8"   — thinking mode, vision, strong coding, 245k context
#   "gpt-oss-120b"             — text-centric reasoning, 131k context
TIERS = {
    "simple":   "Qwen3.5-0.8B",
    "moderate": "Qwen3.6-35B-A3B-FP8",
    "complex":  "Qwen3.5-122B-A10B-FP8",
}

CLASSIFIER_SYSTEM = """\
Classify the complexity of the user's question with exactly one word: simple, moderate, or complex.

simple   — factual lookup, yes/no, basic definition, short translation
moderate — multi-step reasoning, code with explanation, structured comparison
complex  — research-level analysis, multi-file code review, long-document synthesis,
           or anything requiring deep domain knowledge

Reply with only the single classification word and nothing else."""


def classify(query: str) -> str:
    resp = client.chat.completions.create(
        model="Qwen3.5-0.8B",
        messages=[
            {"role": "system", "content": CLASSIFIER_SYSTEM},
            {"role": "user", "content": query},
        ],
        temperature=0.0,
        max_tokens=5,
    )
    label = resp.choices[0].message.content.strip().lower()
    return label if label in TIERS else "moderate"


def answer(query: str, model: str, **kwargs) -> str:
    resp = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": query}],
        temperature=0.7,
        **kwargs,
    )
    return resp.choices[0].message.content


def cascade(query: str) -> dict:
    tier = classify(query)
    model = TIERS[tier]
    text = answer(query, model)
    return {"tier": tier, "model": model, "answer": text}

Usage

queries = [
    "What is the capital of France?",                                    # simple
    "Write a Python function that merges two sorted lists.",             # moderate
    "Analyse the architectural tradeoffs between event-driven and "
    "request-response patterns for a high-volume payment system.",       # complex
]

for q in queries:
    result = cascade(q)
    print(f"[{result['tier']:8s}] {result['model']}")
    print(f"  Q: {q[:70]}")
    print(f"  A: {result['answer'][:120]}\n")

Expected output:

[simple  ] Qwen3.5-0.8B
  Q: What is the capital of France?
  A: Paris.

[moderate] Qwen3.6-35B-A3B-FP8
  Q: Write a Python function that merges two sorted lists.
  A: def merge_sorted(a, b): ...

[complex ] Qwen3.5-122B-A10B-FP8
  Q: Analyse the architectural tradeoffs between event-driven and request-response…
  A: Event-driven architectures decouple producers from consumers…

Escalation on short answers

Some queries are deceptively classified as simple but the small model returns an unhelpfully brief answer. Add a word-count guard to automatically escalate:

def cascade_with_escalation(query: str, min_words: int = 20) -> dict:
    tier = classify(query)
    model = TIERS[tier]
    text = answer(query, model)

    # Escalate if answer is suspiciously short and a bigger tier exists
    tiers = list(TIERS.keys())
    current_idx = tiers.index(tier)
    while len(text.split()) < min_words and current_idx < len(tiers) - 1:
        current_idx += 1
        tier = tiers[current_idx]
        model = TIERS[tier]
        text = answer(query, model)

    return {"tier": tier, "model": model, "answer": text}

System prompt passthrough

To add a system prompt (e.g. for a customer service persona), pass it through to whichever tier handles the query:

SYSTEM = "You are a concise support agent for mittwald. Answer in the user's language."

def cascade_with_system(query: str) -> dict:
    tier = classify(query)
    model = TIERS[tier]
    resp = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": SYSTEM},
            {"role": "user", "content": query},
        ],
        temperature=0.7,
    )
    return {"tier": tier, "model": model, "answer": resp.choices[0].message.content}

Cost estimate

With a typical support workload (70% simple, 20% moderate, 10% complex) and 100k requests/month at 300 tokens average:

Scenario	Monthly cost
All requests → Qwen3.5-122B-A10B-FP8	~€30
Three-tier cascade	~€8

The classifier call costs less than €0.10 total and is not shown separately.

Setup​

The router​

Usage​

Escalation on short answers​

System prompt passthrough​

Setup

The router

Usage

Escalation on short answers

System prompt passthrough