Extracting text from documents with GLM-OCR
GLM-OCR is a document OCR model — built only for reading text, not for chat or image understanding. Use it when you need to digitise PDFs, parse scanned invoices, or pull fields from forms.
Setup
Install the OpenAI library and store your API key from mStudio:
user@local $ pip install openai
user@local $ export OPENAI_API_KEY="sk-…"
All examples below use a shared client:
from openai import OpenAI
client = OpenAI(base_url="https://llm.aihosting.mittwald.de/v1")
Extracting text from a PDF
The most common use case: reading a scanned or digital PDF. The integrated proxy converts each page to an image automatically — you just send the file as a base64 data URI.
import base64
from openai import OpenAI
client = OpenAI(base_url="https://llm.aihosting.mittwald.de/v1")
with open("invoice.pdf", "rb") as f:
pdf_b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="GLM-OCR",
messages=[{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:application/pdf;base64,{pdf_b64}"},
},
{"type": "text", "text": "Extract all text from this document."},
],
}],
temperature=0.1,
)
print(response.choices[0].message.content)
Up to 30 pages are processed per request — see Batch-processing large PDFs for larger documents.
Extracting text from images
Scanned pages sent as JPEG, PNG, WebP, or SVG are processed directly by the model. Change the MIME type in the data URI to match your file:
import base64
from openai import OpenAI
client = OpenAI(base_url="https://llm.aihosting.mittwald.de/v1")
# Works with: image/jpeg image/png image/webp image/svg+xml
MIME = {"jpg": "image/jpeg", "jpeg": "image/jpeg", "png": "image/png",
"webp": "image/webp", "svg": "image/svg+xml"}
def ocr_image(path: str) -> str:
ext = path.rsplit(".", 1)[-1].lower()
with open(path, "rb") as f:
b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="GLM-OCR",
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:{MIME[ext]};base64,{b64}"}},
{"type": "text", "text": "Extract all text from this image."},
],
}],
temperature=0.1,
)
return response.choices[0].message.content
print(ocr_image("scan.jpg"))
print(ocr_image("diagram.png"))
print(ocr_image("chart.webp"))
print(ocr_image("drawing.svg"))
Reading Office documents (DOCX, PPTX, XLSX)
GLM-OCR accepts Word, PowerPoint, and Excel files directly — no conversion on your side. Only the MIME type in the data URI changes:
import base64
from openai import OpenAI
client = OpenAI(base_url="https://llm.aihosting.mittwald.de/v1")
MIME = {
"docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"pptx": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
"xlsx": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
}
def read_office_doc(path: str) -> str:
ext = path.rsplit(".", 1)[-1].lower()
with open(path, "rb") as f:
b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="GLM-OCR",
messages=[{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:{MIME[ext]};base64,{b64}"},
},
{"type": "text", "text": "Extract all text from this document."},
],
}],
temperature=0.1,
)
return response.choices[0].message.content
print(read_office_doc("contract.docx"))
print(read_office_doc("slides.pptx"))
print(read_office_doc("report.xlsx"))
Markdown document output
Ask GLM-OCR to preserve document structure (headings, bullet lists, emphasis) when the output will be displayed or stored for RAG. Being explicit about the formatting scheme is required — the model needs concrete instructions to produce heading syntax reliably:
import base64
from openai import OpenAI
client = OpenAI(base_url="https://llm.aihosting.mittwald.de/v1")
with open("report.pdf", "rb") as f:
pdf_b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="GLM-OCR",
messages=[{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:application/pdf;base64,{pdf_b64}"},
},
{
"type": "text",
"text": (
"Extract the text from this document and format it as Markdown. "
"Use # for main headings, ## for subheadings, and - for bullet lists."
),
},
],
}],
temperature=0.1,
)
markdown = response.choices[0].message.content
print(markdown)
# Example output:
# # Annual Report 2024
#
# Revenue: 50,000 EUR
#
# - Q1: 12,000 EUR
# - Q2: 13,000 EUR
Extracting structured data from invoices (KIE)
Key Information Extraction (KIE) lets you pull specific fields out of a document as a JSON object. Describe the schema you want in plain language — the model fills in the values it finds.
The model always wraps its JSON output in markdown code fences regardless of how the prompt is worded — strip them before parsing:
import base64
import json
import re
from openai import OpenAI
client = OpenAI(base_url="https://llm.aihosting.mittwald.de/v1")
with open("invoice.pdf", "rb") as f:
pdf_b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="GLM-OCR",
messages=[{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:application/pdf;base64,{pdf_b64}"},
},
{
"type": "text",
"text": (
"Extract the following fields from this invoice and return them as a JSON object:\n"
'{ "invoice_number": "", "date": "", "vendor": "", '
'"total_amount": "", "line_items": [] }'
),
},
],
}],
temperature=0.1,
)
# Strip the markdown code fences the model always adds around JSON
raw = response.choices[0].message.content
clean = re.sub(r"^```[a-z]*\n?", "", raw.strip())
clean = re.sub(r"\n?```$", "", clean)
data = json.loads(clean)
print(data)
The same approach works for any document type — just adapt the schema to the fields you need:
# German Lohnsteuerbescheinigung
"text": (
"Extrahiere die folgenden Felder und gib sie als JSON zurück:\n"
'{ "steuernummer": "", "veranlagungszeitraum": "", "bruttoarbeitslohn": "", '
'"lohnsteuer": "", "solidaritaetszuschlag": "" }'
)
Batch-processing large PDFs
The 30-page limit applies per request. Split large documents with pypdf and concatenate the results:
import base64
from io import BytesIO
from pypdf import PdfReader, PdfWriter
from openai import OpenAI
client = OpenAI(base_url="https://llm.aihosting.mittwald.de/v1")
def _ocr_batch(pdf_bytes: bytes) -> str:
b64 = base64.b64encode(pdf_bytes).decode()
resp = client.chat.completions.create(
model="GLM-OCR",
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:application/pdf;base64,{b64}"}},
{"type": "text", "text": "Extract all text from this document."},
],
}],
temperature=0.1,
)
return resp.choices[0].message.content
def ocr_large_pdf(path: str, batch_size: int = 30) -> str:
reader = PdfReader(path)
parts = []
for start in range(0, len(reader.pages), batch_size):
writer = PdfWriter()
for page in reader.pages[start : start + batch_size]:
writer.add_page(page)
buf = BytesIO()
writer.write(buf)
parts.append(_ocr_batch(buf.getvalue()))
return "\n\n".join(parts)
text = ocr_large_pdf("annual_report.pdf")
Full pipeline: OCR → embedding → search → answer
Combine three mittwald AI Hosting models into a complete document question-answering pipeline:
- GLM-OCR digitises your documents — using Markdown output mode so headings and structure are preserved, which produces cleaner chunks and better retrieval quality than flat text
- Qwen3-Embedding-8B turns text chunks into vectors
- Qwen3.5-122B-A10B-FP8 answers questions in natural language based on the retrieved context
user@local $ pip install openai pypdf
import base64
import math
import re
from io import BytesIO
from pypdf import PdfReader, PdfWriter
from openai import OpenAI
client = OpenAI(base_url="https://llm.aihosting.mittwald.de/v1")
# ── Step 1: OCR (Markdown mode for clean structure) ───────────────────────────
# Using Markdown output mode instead of plain text produces structured chunks
# that respect document headings — significantly improving retrieval precision.
MIME = {
"pdf": "application/pdf",
"jpg": "image/jpeg",
"jpeg": "image/jpeg",
"png": "image/png",
"webp": "image/webp",
"svg": "image/svg+xml",
"html": "text/html",
"docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"pptx": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
"xlsx": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
}
def ocr_document(path: str) -> str:
ext = path.rsplit(".", 1)[-1].lower()
with open(path, "rb") as f:
b64 = base64.b64encode(f.read()).decode()
resp = client.chat.completions.create(
model="GLM-OCR",
messages=[{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": f"data:{MIME.get(ext, 'application/pdf')};base64,{b64}"}},
{
"type": "text",
"text": (
"Extract the text from this document and format it as Markdown. "
"Use # for main headings, ## for subheadings, and - for bullet lists."
),
},
]}],
temperature=0.1,
)
return resp.choices[0].message.content
# ── Step 2: Chunk (split at markdown headings, then by word count) ────────────
def chunk_text(text: str, size: int = 400, overlap: int = 50) -> list[str]:
# Split at top-level headings first to keep sections together
sections = re.split(r"(?m)^(?=# )", text)
chunks: list[str] = []
for section in sections:
words = section.split()
i = 0
while i < len(words):
chunks.append(" ".join(words[i : i + size]))
i += size - overlap
return [c for c in chunks if c.strip()]
# ── Step 3: Embed with Qwen3-Embedding-8B ────────────────────────────────────
def embed(texts: list[str]) -> list[list[float]]:
resp = client.embeddings.create(model="Qwen3-Embedding-8B", input=texts)
return [item.embedding for item in resp.data]
# ── Step 4: Retrieve via cosine similarity ────────────────────────────────────
def cosine(a: list[float], b: list[float]) -> float:
dot = sum(x * y for x, y in zip(a, b))
norm = math.sqrt(sum(x * x for x in a)) * math.sqrt(sum(x * x for x in b))
return dot / (norm + 1e-9)
def retrieve(query_vec: list[float], index: list, top_k: int = 3) -> list[str]:
scored = [(cosine(query_vec, vec), chunk) for vec, chunk in index]
return [chunk for _, chunk in sorted(scored, reverse=True)[:top_k]]
# ── Step 5: Answer with Qwen3.5-122B-A10B-FP8 ────────────────────────────────
# Qwen3.5 has thinking mode enabled by default — disable it for Q&A tasks so
# the answer is returned in message.content rather than reasoning_content.
# See: /docs/v2/platform/aihosting/models/qwen3-5-122b-a10b-fp8
def answer(question: str, chunks: list[str]) -> str:
context = "\n\n---\n\n".join(chunks)
resp = client.chat.completions.create(
model="Qwen3.5-122B-A10B-FP8",
messages=[
{
"role": "system",
"content": "Answer questions based only on the provided document context. Be concise and precise.",
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}",
},
],
temperature=0.7,
top_p=0.8,
max_tokens=1024,
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
return resp.choices[0].message.content
# ── Build index ───────────────────────────────────────────────────────────────
# Accepts PDFs, images (jpg/png/webp/svg), DOCX, PPTX, XLSX, HTML — mix freely
documents = ["invoice.pdf", "contract.docx"]
index: list[tuple[list[float], str]] = []
for path in documents:
text = ocr_document(path)
chunks = chunk_text(text)
vectors = embed(chunks)
index.extend(zip(vectors, chunks))
print(f"Indexed {len(index)} chunks from {len(documents)} documents.")
# ── Ask questions ─────────────────────────────────────────────────────────────
questions = [
"What is the total invoice amount?",
"How long is the contract term?",
]
for q in questions:
[q_vec] = embed([q])
context_chunks = retrieve(q_vec, index)
print(f"\nQ: {q}")
print(f"A: {answer(q, context_chunks)}")
Optional: reranking with Qwen3-VL-Reranker-2B
Vector search retrieves candidates by embedding similarity — a fast approximation. A cross-encoder reranker scores each candidate against the full question text and reorders them by semantic relevance. The improvement is most noticeable on large indexes where many chunks share similar embeddings.
Qwen3-VL-Reranker-2B is served via the /v1/rerank endpoint. Drop it in between the retrieve() and answer() steps in the pipeline above:
import os
import requests
def rerank(
query: str,
chunks: list[str],
top_k: int = 3,
instruction: str | None = None,
) -> list[str]:
payload = {
"model": "Qwen3-VL-Reranker-2B",
"query": query,
"documents": chunks,
}
if instruction:
payload["instruction"] = instruction
resp = requests.post(
"https://llm.aihosting.mittwald.de/v1/rerank",
headers={"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}"},
json=payload,
timeout=30,
)
resp.raise_for_status()
results = resp.json()["results"]
ranked = sorted(results, key=lambda r: r["relevance_score"], reverse=True)
return [chunks[r["index"]] for r in ranked[:top_k]]
Retrieve a wider candidate set from the vector index first, then let the reranker narrow it down:
# Retrieve 10 candidates, rerank to top 3
candidates = retrieve(q_vec, index, top_k=10)
context_chunks = rerank(
q,
candidates,
top_k=3,
instruction="Given OCR-extracted documents, find the sections most relevant to the query.",
)
The /v1/rerank response contains a results list with an index (position in the original documents array) and a relevance_score (higher is better) for each entry. The optional instruction parameter tells the reranker what retrieval task to optimise for — omitting it uses a generic default.
Full pipeline with n8n and pgvector
n8n is a workflow automation platform that lets you build the same OCR → embed → search → answer pipeline visually — without writing application code. mittwald provides a dedicated n8n hosting guide that covers running n8n together with a pgvector database on mittwald Container Hosting. This is the recommended production setup.
Because mittwald AI Hosting exposes an OpenAI-compatible API, n8n's built-in OpenAI nodes work out of the box with a custom base URL.
Deploy n8n + pgvector on mittwald
The mittwald n8n guide describes deploying a complete stack via mw stack deploy. The essential Docker Compose configuration:
services:
n8n:
image: n8nio/n8n:stable
ports:
- "5678:5678"
volumes:
- n8n_data:/home/node/.n8n
environment:
- DB_TYPE=postgresdb
- DB_POSTGRESDB_HOST=pgvector
- DB_POSTGRESDB_DATABASE=n8n
- DB_POSTGRESDB_USER=n8n
- DB_POSTGRESDB_PASSWORD=<password>
pgvector:
image: ankane/pgvector:latest
environment:
- POSTGRES_DB=n8n
- POSTGRES_USER=n8n
- POSTGRES_PASSWORD=<password>
volumes:
- pgvector_data:/var/lib/postgresql/data
volumes:
n8n_data:
pgvector_data:
See the mittwald n8n guide for the complete setup instructions including custom domain, SSL, and mStudio configuration.
Alternative: Qdrant
Qdrant is a dedicated vector search engine with a REST and gRPC API. Run it as a container alongside your application:
services:
qdrant:
image: qdrant/qdrant:latest
ports:
- "6333:6333" # REST API
- "6334:6334" # gRPC
volumes:
- qdrant_data:/qdrant/storage
volumes:
qdrant_data:
Deploy with mw stack deploy and connect from Python using the official client:
# pip install qdrant-client openai
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
from openai import OpenAI
import uuid
ai = OpenAI(base_url="https://llm.aihosting.mittwald.de/v1")
qdrant = QdrantClient(host="localhost", port=6333)
# Create collection (run once)
qdrant.create_collection(
collection_name="documents",
vectors_config=VectorParams(size=4096, distance=Distance.COSINE),
)
def embed(texts: list[str]) -> list[list[float]]:
resp = ai.embeddings.create(model="Qwen3-Embedding-8B", input=texts)
return [item.embedding for item in resp.data]
def index_chunks(chunks: list[str]) -> None:
vectors = embed(chunks)
qdrant.upsert(
collection_name="documents",
points=[
PointStruct(id=str(uuid.uuid4()), vector=vec, payload={"text": chunk})
for vec, chunk in zip(vectors, chunks)
],
)
def search(question: str, top_k: int = 3) -> list[str]:
[q_vec] = embed([question])
result = qdrant.query_points(collection_name="documents", query=q_vec, limit=top_k)
return [hit.payload["text"] for hit in result.points]
Alternative: ChromaDB
ChromaDB offers a simple HTTP API that works well for prototyping and smaller document collections:
services:
chromadb:
image: chromadb/chroma:latest
ports:
- "8000:8000"
volumes:
- chroma_data:/chroma/chroma
volumes:
chroma_data:
Connect from Python:
# pip install chromadb openai
import chromadb
from openai import OpenAI
ai = OpenAI(base_url="https://llm.aihosting.mittwald.de/v1")
chroma = chromadb.HttpClient(host="localhost", port=8000)
collection = chroma.get_or_create_collection("documents")
def embed(texts: list[str]) -> list[list[float]]:
resp = ai.embeddings.create(model="Qwen3-Embedding-8B", input=texts)
return [item.embedding for item in resp.data]
def index_chunks(chunks: list[str]) -> None:
vectors = embed(chunks)
collection.add(
ids=[str(i) for i in range(len(chunks))],
embeddings=vectors,
documents=chunks,
)
def search(question: str, top_k: int = 3) -> list[str]:
[q_vec] = embed([question])
results = collection.query(query_embeddings=[q_vec], n_results=top_k)
return results["documents"][0]
Credential setup in n8n
Create a new OpenAI API credential in n8n and override the base URL:
| Field | Value |
|---|---|
| API Key | your mStudio AI Hosting key (sk-…) |
| Base URL | https://llm.aihosting.mittwald.de/v1 |
GLM-OCR, Qwen3-Embedding-8B, and Qwen3.5-122B-A10B-FP8 all share this single credential.
Document ingestion workflow
Build this n8n workflow to automatically digitise and index new documents:
[Trigger: new file / webhook / schedule]
↓
[Read Binary File]
↓
[Code node] base64-encode file → data URI
↓
[HTTP Request] POST https://llm.aihosting.mittwald.de/v1/chat/completions
model: GLM-OCR
prompt: "Extract the text and format it as Markdown. Use # for headings and - for lists."
↓
[Code node] split Markdown into 400-word chunks with 50-word overlap
↓
[Embeddings OpenAI node] model: Qwen3-Embedding-8B
↓
[Postgres (pgvector) node] INSERT chunks + embeddings into the vector table
The GLM-OCR call uses an HTTP Request node because the file must be base64-encoded first:
// n8n Code node — encode binary to base64 data URI
const binaryData = $input.first().binary.data;
const b64 = binaryData.data; // n8n already base64-encodes binary items
const mime = binaryData.mimeType; // e.g. "application/pdf"
return [{ json: { dataUri: `data:${mime};base64,${b64}` } }];
Question-answering workflow
A second workflow answers user questions on demand:
[Trigger: webhook ?question=How+long+is+the+contract%3F]
↓
[Embeddings OpenAI node] embed the question (Qwen3-Embedding-8B)
↓
[Postgres (pgvector) node] SELECT top-3 chunks by vector similarity
↓
[OpenAI Chat node] model: Qwen3.5-122B-A10B-FP8
system: "Answer only from the provided context."
user: "Context: {{chunks}}\n\nQuestion: {{question}}"
↓
[Respond to Webhook] return the answer as JSON
Prompting tips
Tables — preserve structure as Markdown
Without guidance, the model may flatten table rows into plain text. Ask for markdown tables explicitly:
Extract all text from this document.
Preserve table structures as markdown tables where applicable.
Tables — output as HTML
For spreadsheet data that needs to be inserted into a database or further processed, request HTML:
Extract the table from this document and return it as an HTML <table> element with <tr> and <td> tags.
Formulas — output LaTeX
For documents with mathematical content, request LaTeX so the result can be re-rendered downstream:
Extract all text from this document.
Render mathematical formulas in LaTeX: $...$ for inline math, $$...$$ for display math.
Low-quality scans — flag uncertain words
When working with faded or degraded originals, instruct the model to mark unclear words rather than silently guess:
Extract all legible text from this document.
If a word is unclear, write your best guess followed by [?].
Multilingual documents
GLM-OCR natively supports Chinese, English, German, French, Spanish, Russian, Japanese, and Korean in a single request — no language parameter needed. For mixed-language documents, instruct the model to preserve the original language of each section:
Extract all text from this document. Keep each text passage in its original language — do not translate.