i think this is the crux of DeepSeek-OCR
1. (text) context gets longer as you add words
2. long context is quadratic
3. you can fit lots of words in an image
4. if you use encoder-decoder architecture, your tokens encode a ton of information
i think this is the crux of DeepSeek-OCR
View original threadfor review — transformers can be encoder-only (often used with embedding models), decoder-only (what LLMs have been for years) or encoder-decoder (both)
we haven't really touched encoder-decoder in a while, for the most part
we haven't really touched encoder-decoder in a while, for the most part
6
the only other lab i can think of is twelve labs, they use an encoder-only to produce video embeddings and then a decoder-only for reasoning
the encoder-decoder is interesting bc the encoder results can be cached in a vector DB (they're embeddings)
the encoder-decoder is interesting bc the encoder results can be cached in a vector DB (they're embeddings)
5
as i understand it, a reasoning model doesn't go back through the encoder (at least, that's not how twelve does it)
so you're super heavy, high compression, encoder model is only run once to fully compress the input
so you're super heavy, high compression, encoder model is only run once to fully compress the input
5