Bluesky Thread

i think this is the crux of DeepSeek-OCR

October 20, 2025 View original thread

i think this is the crux of DeepSeek-OCR

1. (text) context gets longer as you add words
2. long context is quadratic
3. you can fit lots of words in an image
4. if you use encoder-decoder architecture, your tokens encode a ton of information

1. Introduction
Current Large Language Models (LLMs) face significant computational challenges when process-
ing long textual content due to quadratic scaling with sequence length. We explore a potential
solution: leveraging visual modality as an efficient compression medium for textual information.
A single image containing document text can represent rich information using substantially
fewer tokens than the equivalent digital text, suggesting that optical compression through vision
tokens could achieve much higher compression ratios.
This insight motivates us to reexamine vision-language models (VLMs) from an LLM-centric
perspective, focusing on how vision encoders can enhance LLMs’ efficiency in processing textual
information rather than basic VQA [ 12, 16 , 24, 32, 41] what humans excel at. OCR tasks, as an
intermediate modality bridging vision and language, provide an ideal testbed for this vision-
text compression paradigm, as they establish a natural compression-decompression mapping
between visual and textual representations while offering quantitative evaluation metrics

Tim Kellogg @timkellogg.me

this paper deserves a deep breath and a slow exhaling “what the fuck”

who even talks about compression in OCR models?

who tries to spin an OCR model as a SOTA LLM? isn’t OCR a solved problem? what?

but oddly, i feel like they got something here, idk

they’re talking about unlimited context..

for review — transformers can be encoder-only (often used with embedding models), decoder-only (what LLMs have been for years) or encoder-decoder (both)

we haven't really touched encoder-decoder in a while, for the most part

the only other lab i can think of is twelve labs, they use an encoder-only to produce video embeddings and then a decoder-only for reasoning

the encoder-decoder is interesting bc the encoder results can be cached in a vector DB (they're embeddings)

as i understand it, a reasoning model doesn't go back through the encoder (at least, that's not how twelve does it)

so you're super heavy, high compression, encoder model is only run once to fully compress the input

A vertical flowchart on a dark background depicts a multimodal model pipeline.

From top to bottom, the labeled boxes read:

* **images** → input data
* **encoder** → processes images into
* **embeddings** → compact representations
* **decoder** → generates outputs

From the **decoder**, two arrows branch out:

1. one leads to **output text** (the final generated response)
2. another loops downward to **thinking**, which then feeds back into the **decoder**, showing an internal reasoning or iterative refinement cycle before producing text output.

More like this