GLM-OCR

The GLM-OCR guide has runnable examples for all common use cases, including a full pipeline with Qwen3-Embedding-8B, Qwen3.5-122B-A10B-FP8, n8n, and vector database options (pgvector, Qdrant, ChromaDB).

Description

"GLM-OCR" is a document optical character recognition (OCR) model by Z.ai (ZhipuAI), specialized for accurate text extraction from documents and images. An integrated document proxy on our platform automatically converts PDF, DOCX, PPTX, XLSX, HTML, SVG, and raster image formats to PNG pages before the model processes them.

It supports and is suitable for:

Extracting text from PDF, DOCX, PPTX, XLSX, HTML, and many other document formats
Processing scanned documents, invoices, contracts, forms, and reports
Text extraction from tables and structured layouts
Formula and mathematical expression recognition
Key information extraction (KIE) — structured JSON output from forms, receipts, certificates, and cards
Retrieval-Augmented Generation (RAG) pre-processing — high-accuracy document parsing for knowledge bases
Multilingual documents — Chinese, English, German, French, Spanish, Russian, Japanese, Korean, and others

The following limitations apply:

Maximum 30 pages per request — the API returns HTTP 413 if the document exceeds this limit. Split larger documents into batches of 30 pages.
Maximum request body: 200 MB
Maximum context length: 131,072 tokens (~4,000 tokens per page at typical document density)
No tool-calling or function calling support
No memory between requests — the model does not remember previous extractions. Each API call is independent; send the document again if you need to ask something new about it.

Supported input formats

All content is delivered as a base64-encoded data URI in the image_url field. The proxy automatically detects the format and converts it to per-page PNG images before passing them to the model.

Format	MIME type for data URI	Notes
PDF	`application/pdf`	Up to 30 pages per request
JPEG	`image/jpeg`	Handled natively
PNG	`image/png`	Handled natively
TIFF	`image/tiff`	Multi-frame → one page per frame
GIF	`image/gif`	Animated → one page per frame
WebP	`image/webp`	Animated → one page per frame
BMP	`image/bmp`
SVG	`image/svg+xml`	Rasterised via cairosvg
HTML	`text/html`	Rendered via WeasyPrint
DOCX	`application/vnd.openxmlformats-officedocument.wordprocessingml.document`	Converted via mammoth + WeasyPrint
PPTX	`application/vnd.openxmlformats-officedocument.presentationml.presentation`	One page per slide
XLSX	`application/vnd.openxmlformats-officedocument.spreadsheetml.sheet`	One page per sheet, max 2,000 rows
XLS	`application/vnd.ms-excel`	Legacy Excel format

API usage

GLM-OCR is accessed via the standard chat completions endpoint (/v1/chat/completions) with the model name GLM-OCR. The document proxy intercepts the request, converts the document to PNG pages, and forwards them to the model — no page splitting required on your side.

PDF document extraction

Python
JavaScript
PHP

import base64
from openai import OpenAI

client = OpenAI(
    base_url="https://llm.aihosting.mittwald.de/v1",
    api_key="<your-api-key>",
)

with open("document.pdf", "rb") as f:
    pdf_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="GLM-OCR",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:application/pdf;base64,{pdf_b64}",
                    },
                },
                {
                    "type": "text",
                    "text": "Extract all text from this document.",
                },
            ],
        }
    ],
    temperature=0.1,
)

print(response.choices[0].message.content)

import fs from "fs";
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://llm.aihosting.mittwald.de/v1",
  apiKey: "<your-api-key>",
});

const pdfB64 = fs.readFileSync("document.pdf").toString("base64");

const response = await client.chat.completions.create({
  model: "GLM-OCR",
  messages: [
    {
      role: "user",
      content: [
        {
          type: "image_url",
          image_url: { url: `data:application/pdf;base64,${pdfB64}` },
        },
        { type: "text", text: "Extract all text from this document." },
      ],
    },
  ],
  temperature: 0.1,
});

console.log(response.choices[0].message.content);

<?php
// composer require openai-php/client guzzlehttp/guzzle

$client = OpenAI::factory()
    ->withBaseUri('https://llm.aihosting.mittwald.de/v1')
    ->withApiKey('<your-api-key>')
    ->make();

$pdfB64 = base64_encode(file_get_contents('document.pdf'));

$response = $client->chat()->create([
    'model' => 'GLM-OCR',
    'messages' => [[
        'role' => 'user',
        'content' => [
            [
                'type' => 'image_url',
                'image_url' => ['url' => "data:application/pdf;base64,{$pdfB64}"],
            ],
            ['type' => 'text', 'text' => 'Extract all text from this document.'],
        ],
    ]],
    'temperature' => 0.1,
]);

echo $response->choices[0]->message->content;

Single image extraction

For individual document images (JPEG, PNG):

Python
JavaScript
PHP

import base64
from openai import OpenAI

client = OpenAI(
    base_url="https://llm.aihosting.mittwald.de/v1",
    api_key="<your-api-key>",
)

with open("page.jpg", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="GLM-OCR",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{img_b64}",
                    },
                },
                {
                    "type": "text",
                    "text": "Extract all text from this image.",
                },
            ],
        }
    ],
    temperature=0.1,
)

print(response.choices[0].message.content)

import fs from "fs";
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://llm.aihosting.mittwald.de/v1",
  apiKey: "<your-api-key>",
});

const imgB64 = fs.readFileSync("page.jpg").toString("base64");

const response = await client.chat.completions.create({
  model: "GLM-OCR",
  messages: [
    {
      role: "user",
      content: [
        {
          type: "image_url",
          image_url: { url: `data:image/jpeg;base64,${imgB64}` },
        },
        { type: "text", text: "Extract all text from this image." },
      ],
    },
  ],
  temperature: 0.1,
});

console.log(response.choices[0].message.content);

<?php
// composer require openai-php/client guzzlehttp/guzzle

$client = OpenAI::factory()
    ->withBaseUri('https://llm.aihosting.mittwald.de/v1')
    ->withApiKey('<your-api-key>')
    ->make();

$imgB64 = base64_encode(file_get_contents('page.jpg'));

$response = $client->chat()->create([
    'model' => 'GLM-OCR',
    'messages' => [[
        'role' => 'user',
        'content' => [
            [
                'type' => 'image_url',
                'image_url' => ['url' => "data:image/jpeg;base64,{$imgB64}"],
            ],
            ['type' => 'text', 'text' => 'Extract all text from this image.'],
        ],
    ]],
    'temperature' => 0.1,
]);

echo $response->choices[0]->message->content;

Recommended inference parameters

GLM-OCR is a deterministic extraction model. Use low temperature for accurate, faithful text extraction:

Parameter	Value
`temperature`	0.1
`top_p`	1.0
`max_tokens`	4096 per page (scale with page count)

Output modes

The model's output format is controlled entirely through your prompt — there is no separate API parameter for it.

Mode	How to activate	Behaviour
Plain text	`"Extract all text from this document."`	Raw text, no formatting
Markdown	`"Extract the text and format it as Markdown. Use # for headings and - for lists."`	Preserves headings, lists, emphasis — good for RAG
JSON (KIE)	`"Extract these fields and return them as a JSON object: {…}"`	Structured extraction; output always wrapped in ```json ``` fences — strip before parsing
HTML table	`"Return the table as an HTML <table> element."`	Useful for spreadsheet-like data

Observed model behaviours

JSON is always fence-wrapped. The model wraps JSON responses in markdown code fences regardless of the prompt. Strip with re.sub(r"^```[a-z]*\n?", "", raw.strip()) before calling json.loads().
Markdown mode needs explicit instructions. A generic "convert to Markdown" prompt does not reliably produce heading syntax — always specify # for main headings, ## for subheadings, - for lists.
HTML input → HTML table output. When the input is an HTML file containing a <table>, the model reproduces it as HTML even when markdown tables are requested. Use a PDF or image source for clean markdown table output.
XLSX output includes the sheet name. For Excel files, the model prepends the worksheet name as the first line of output. Strip or handle that line in post-processing if needed.
PPTX and XLSX require valid Open XML. Corrupted or programmatically generated invalid Office ZIP structures return HTTP 400.

Terms of Use and Licensing

The general terms of use apply. The model is provided by Z.ai under the MIT License, and reuse of the extracted content is not subject to any additional restrictions imposed by the model license.

Description​

Supported input formats​

API usage​

PDF document extraction​

Single image extraction​

Recommended inference parameters​

Output modes​

Terms of Use and Licensing​