Skip to main content

GLM-OCR

The GLM-OCR guide has runnable examples for all common use cases, including a full pipeline with Qwen3-Embedding-8B, Qwen3.5-122B-A10B-FP8, n8n, and vector database options (pgvector, Qdrant, ChromaDB).

Description

"GLM-OCR" is a document optical character recognition (OCR) model by Z.ai (ZhipuAI), specialized for accurate text extraction from documents and images. An integrated document proxy on our platform automatically converts PDF, DOCX, PPTX, XLSX, HTML, SVG, and raster image formats to PNG pages before the model processes them.

It supports and is suitable for:

  • Extracting text from PDF, DOCX, PPTX, XLSX, HTML, and many other document formats
  • Processing scanned documents, invoices, contracts, forms, and reports
  • Text extraction from tables and structured layouts
  • Formula and mathematical expression recognition
  • Key information extraction (KIE) — structured JSON output from forms, receipts, certificates, and cards
  • Retrieval-Augmented Generation (RAG) pre-processing — high-accuracy document parsing for knowledge bases
  • Multilingual documents — Chinese, English, German, French, Spanish, Russian, Japanese, Korean, and others

The following limitations apply:

  • Maximum 30 pages per request — the API returns HTTP 413 if the document exceeds this limit. Split larger documents into batches of 30 pages.
  • Maximum request body: 200 MB
  • Maximum context length: 131,072 tokens (~4,000 tokens per page at typical document density)
  • No tool-calling or function calling support
  • No memory between requests — the model does not remember previous extractions. Each API call is independent; send the document again if you need to ask something new about it.

Supported input formats

All content is delivered as a base64-encoded data URI in the image_url field. The proxy automatically detects the format and converts it to per-page PNG images before passing them to the model.

FormatMIME type for data URINotes
PDFapplication/pdfUp to 30 pages per request
JPEGimage/jpegHandled natively
PNGimage/pngHandled natively
TIFFimage/tiffMulti-frame → one page per frame
GIFimage/gifAnimated → one page per frame
WebPimage/webpAnimated → one page per frame
BMPimage/bmp
SVGimage/svg+xmlRasterised via cairosvg
HTMLtext/htmlRendered via WeasyPrint
DOCXapplication/vnd.openxmlformats-officedocument.wordprocessingml.documentConverted via mammoth + WeasyPrint
PPTXapplication/vnd.openxmlformats-officedocument.presentationml.presentationOne page per slide
XLSXapplication/vnd.openxmlformats-officedocument.spreadsheetml.sheetOne page per sheet, max 2,000 rows
XLSapplication/vnd.ms-excelLegacy Excel format

API usage

GLM-OCR is accessed via the standard chat completions endpoint (/v1/chat/completions) with the model name GLM-OCR. The document proxy intercepts the request, converts the document to PNG pages, and forwards them to the model — no page splitting required on your side.

PDF document extraction

import base64
from openai import OpenAI

client = OpenAI(
base_url="https://llm.aihosting.mittwald.de/v1",
api_key="<your-api-key>",
)

with open("document.pdf", "rb") as f:
pdf_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
model="GLM-OCR",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": f"data:application/pdf;base64,{pdf_b64}",
},
},
{
"type": "text",
"text": "Extract all text from this document.",
},
],
}
],
temperature=0.1,
)

print(response.choices[0].message.content)

Single image extraction

For individual document images (JPEG, PNG):

import base64
from openai import OpenAI

client = OpenAI(
base_url="https://llm.aihosting.mittwald.de/v1",
api_key="<your-api-key>",
)

with open("page.jpg", "rb") as f:
img_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
model="GLM-OCR",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{img_b64}",
},
},
{
"type": "text",
"text": "Extract all text from this image.",
},
],
}
],
temperature=0.1,
)

print(response.choices[0].message.content)

GLM-OCR is a deterministic extraction model. Use low temperature for accurate, faithful text extraction:

ParameterValue
temperature0.1
top_p1.0
max_tokens4096 per page (scale with page count)

Output modes

The model's output format is controlled entirely through your prompt — there is no separate API parameter for it.

ModeHow to activateBehaviour
Plain text"Extract all text from this document."Raw text, no formatting
Markdown"Extract the text and format it as Markdown. Use # for headings and - for lists."Preserves headings, lists, emphasis — good for RAG
JSON (KIE)"Extract these fields and return them as a JSON object: {…}"Structured extraction; output always wrapped in ```json ``` fences — strip before parsing
HTML table"Return the table as an HTML <table> element."Useful for spreadsheet-like data

Terms of Use and Licensing

The general terms of use apply. The model is provided by Z.ai under the MIT License, and reuse of the extracted content is not subject to any additional restrictions imposed by the model license.