GLM 4.5: Open-Weights Model Aimed Squarely at Agents and Long Context | by Fariha Batool | Medium

Sitemap

Open in app

Get app

Write

GLM 4.5: Open-Weights Model Aimed Squarely at Agents and Long Context

Fariha Batool

2 min read

Jul 31, 2025

Listen

On 28 July 2025 Z.ai (the company formerly known as Zhipu AI) released two fully open Mixture-of-Experts (MoE) checkpoints under an MIT licence:

GLM-4.5–355B-A32B — 355 B total parameters, 32 B active per token

GLM-4.5-Air-106B-A12B — cost-sensitive sibling, 106 B total / 12 B active

Both appear on Hugging Face and ModelScope in BF16, FP8 and INT formats; all 52 internal evaluation traces are public.

Why this release is notable

1 .Frontier-level, benchmarked quality

Internal runs across 12 public suites place the flagship model third overall (trailing only OpenAI o3 and Grok 4) and the Air variant sixth. Highlights:

Tool use: matches or beats Claude 4 Sonnet on TAU-Bench Retail & Airline, beats Claude 4 Opus on BrowseComp
Reasoning: MMLU-Pro 84.6 %, AIME-24 91 %, MATH-500 98 %
Coding: SWE-bench Verified 64.2 %, Terminal-Bench 37.5 % — without external plugins

2 Agent-centric architecture

128 k context window
Native function calling
Dual thinking / non-thinking modes to balance deep chain-of-thought with fast chat

3 Built-in speculative decoding

A new Multi-Token Prediction (MTP) layer drafts several tokens inside a single forward pass, no external “draft” model. The same pass also verifies the draft, yielding lower latency on mixed CPU-GPU rigs.

4 Transparent release artefacts

Weights, evaluation trajectories, and vLLM / SGLang guides ship together. GLM 4.5 is also only the second frontier model to validate the Muon optimiser at massive scale.

A deliberately broad prompt test

I ran the same wide but imprecise prompt against GLM 4.5 and OpenAI o3 to gauge coverage:

Let’s think step by step. Explain Azure Bot Framework and Custom Engine Agents—include App Registration, Redis, Application Insights, Key Vault, Web App, Tunnel Relay… Be very detailed.

GLM 4.5

Returned a sectioned, exhaustive explanation
Embedded code snippets (C#, Python) for Key Vault, Redis, App Insights
Auto-generated a Mermaid graph of the end-to-end architecture; the Graph TD rendered cleanly, refreshable SVG download included

o3

Produced a shorter outline
Gave a Graph TD block that failed to render in the Mermaid sandbox
No embedded code

For documentation or diagram-heavy workflows, that’s a practical win for GLM 4.5. You can reproduce the test here:

Chat with Z.ai — Free AI for Presentations, Writing & Coding

What’s still missing

Third-party replication:all scores are vendor-reported; no peer-reviewed paper yet
Hardware footprint:the 355 B model still needs multi-GPU servers in BF16, lightweight GGUF/GGML quantisations are not yet published
Dual-use debate:MIT-licensed frontier weights re-ignite questions around bio-risk and misinformation controls

Bottom line

GLM 4.5 blends large-scale MoE, agent-friendly features and built-in speculative decoding while keeping everything under an open license. If the internal numbers hold up, it could become a go-to base for multilingual coding agents and long-context RAG pipelines. Another push in the East-West race toward truly general-purpose, tool-using LLMs.

Get Fariha Batool’s stories in your inbox

Join Medium for free to get updates from this writer.

Remember me for faster sign in

Have you benchmarked it yet? Comments are most welcome!

Written by Fariha Batool

6 followers

·3 following

AI enthusiast turning generative AI ideas into product features. I write bite‑sized notes on AI tools and breakthroughs, usually with a cup of coffee in hand.