Skip to main content

Qwen3-VL-Reranker-2B

Description

"Qwen3-VL-Reranker-2B" is a vision-language cross-encoder reranking model by Alibaba. Given a query and a list of candidate documents, it scores each (query, document) pair by semantic relevance and returns a ranked list. Both the query and documents can contain images, making it suitable for multimodal RAG pipelines.

It supports and is suitable for:

  • Reranking text candidates retrieved from a vector index (RAG pipelines)
  • Reranking documents that contain images alongside text
  • Improving retrieval precision as a second pass after embedding-based search
  • Scoring query–document relevance for any ranking task

The following limitations apply:

  • Accessible only via /v1/rerank — not via /v1/chat/completions
  • Maximum context length per (query + document) pair: 32,768 tokens
  • Maximum images per request: 5; maximum videos: 1
  • Maximum image size: 1,310,720 pixels — larger images are downscaled proportionally (1 token per 32 × 32 pixel block)
  • No streaming — all scores are returned in a single response
  • Not for text generation, tool calling, or conversation
  • Write instruction values in English for best results, even when querying multilingual documents — the model's training data is predominantly English

API usage

Qwen3-VL-Reranker-2B is served via the /v1/rerank endpoint, which follows the Cohere-compatible reranking API format.

Basic reranking

import os
import requests

documents = [
"Payment is due within 30 days of invoice.",
"The vendor must deliver goods within 14 business days.",
"Late payments incur a 2% monthly surcharge.",
]

response = requests.post(
"https://llm.aihosting.mittwald.de/v1/rerank",
headers={"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}"},
json={
"model": "Qwen3-VL-Reranker-2B",
"query": "What are the payment terms?",
"documents": documents,
},
timeout=30,
)
response.raise_for_status()

for result in sorted(response.json()["results"], key=lambda r: r["relevance_score"], reverse=True):
print(f"[{result['relevance_score']:.3f}] {documents[result['index']][:80]}")

Response format

{
"results": [
{ "index": 0, "relevance_score": 0.912 },
{ "index": 2, "relevance_score": 0.874 },
{ "index": 1, "relevance_score": 0.031 }
]
}

index is the position in the original documents array. relevance_score is a float between 0 and 1 — higher is more relevant.

Custom instruction

The optional instruction field tells the model which retrieval task to optimise for. Omitting it uses a generic default ("Given a search query, retrieve relevant candidates that answer the query."):

response = requests.post(
"https://llm.aihosting.mittwald.de/v1/rerank",
headers={"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}"},
json={
"model": "Qwen3-VL-Reranker-2B",
"query": "Total amount due on this invoice",
"documents": chunks,
"instruction": "Given an OCR-extracted invoice, find sections containing payment amounts or totals.",
},
timeout=30,
)

Domain-specific instructions yield a 1–5 % precision improvement on specialised corpora (legal, medical, financial). Always write the instruction in English — even when querying non-English documents.

Integrating into a RAG pipeline

Use the reranker as a second pass after vector search. Retrieve a wider candidate set from your vector index, then narrow it down:

import math

def cosine(a, b):
dot = sum(x * y for x, y in zip(a, b))
norm = math.sqrt(sum(x**2 for x in a)) * math.sqrt(sum(x**2 for x in b))
return dot / (norm + 1e-9)

# Step 1: rough retrieval via embedding similarity (top 10)
q_vec = embed([query])[0]
candidates = sorted(index, key=lambda v_c: cosine(q_vec, v_c[0]), reverse=True)[:10]
candidate_texts = [c for _, c in candidates]

# Step 2: precise reranking (top 3)
resp = requests.post(
"https://llm.aihosting.mittwald.de/v1/rerank",
headers={"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}"},
json={"model": "Qwen3-VL-Reranker-2B", "query": query, "documents": candidate_texts},
timeout=30,
)
ranked = sorted(resp.json()["results"], key=lambda r: r["relevance_score"], reverse=True)
top_chunks = [candidate_texts[r["index"]] for r in ranked[:3]]

Combining GLM-OCR for text extraction, Qwen3-Embedding-8B for indexing, Qwen3-VL-Reranker-2B for reranking, and a chat model for answering gives a complete document Q&A pipeline.

Terms of use and licensing

The general terms of use apply. The model is provided by Alibaba under the Apache 2.0 License, and reuse of the generated content is not subject to any additional restrictions.