Bluesky Thread

Z.ai released a paper very similar to DeepSeek-OCR on the same exact day (a f...

View original thread
Z.ai released a paper very similar to DeepSeek-OCR on the same exact day (a few hours earlier afaict)

Glyph is just a framework, not a model, but they got Qwen3-8B (128k context) to handle over 1 million context by rendering input as images

arxiv.org/abs/2510.17800
Figure 1.

(Upper) Diagram comparing two paradigms for long-context tasks:
	•	Left path (Plain Text): A long novel (~180K words, example “Jane Eyre”) is fed directly as text into an LLM, requiring roughly 240K tokens.
	•	Right path (Rendering): The same text is rendered into images, producing about 80K tokens—achieving 3× input-token compression—and processed by a VLM (Vision-Language Model) instead of a pure text LLM.

(Lower) Two sets of bar charts:
	•	Left chart (Accuracy): Glyph performs comparably to Qwen-3-8B, GLM-4.9B-Chat-1M, and Qwen-2.5-7B-Instruct-1M on LongBench and MRCR tasks.
	•	Right chart (Compression/Speedup): Glyph shows 3.2× KV cache reduction, 4.8× prefill speedup, and 4.4× decoding throughput compared to the text backbone model.

Caption text:

Comparison of two paradigms for long-context tasks: conventional approaches directly feeding plain text into LLMs, and the proposed VLM-based paradigm, Glyph, which renders text as compact images to achieve substantial input-token compression. Glyph attains competitive performance on LongBench and MRCR while offering significant compression and inference speedup over its text backbone model on 128K-token inputs.
51 7
3 hours later
it occurs to me that maybe chinese labs are working together and this framework maybe works really well with DeepSeek-OCR
2
51 likes 7 reposts

More like this

×