Bluesky Thread

πŸ‹ Alert! DeepSeek Janus-Pro-7B

View original thread
πŸ‹ Alert! DeepSeek Janus-Pro-7B

It’s multimodal and outperforms Dalle-E and StableDiffusion

Probably the biggest feature is it’s ability to generate text in an image that actually makes sense

They be cooking, I’m here for whatever is served

huggingface.co/deepseek-ai/...
The chart illustrates two sets of comparisons for large language models (LLMs) in multimodal and text-to-image benchmarks:

Left Panel:

Performance vs. Model Size
	β€’	X-axis: Number of LLM Parameters (in billions).
	β€’	Y-axis: Average performance on four multimodal understanding benchmarks.

Key Observations:
	β€’	Janus-Pro-7B: Achieves the highest average performance (~64) with 7 billion parameters.
	β€’	LLaVA-v1.5-7B: Performs slightly lower (~60), with similar parameters.
	β€’	TokenFlow-XL also shows notable performance at a higher parameter scale (>10B).
	β€’	Smaller models, such as Show-o and Janus-Pro-1B, have significantly reduced performance scores (~46–54).

Right Panel:

Instruction-Following Benchmarks (GenEval and DPG-Bench)
Accuracy (Y-axis):
	β€’	GenEval:
	β€’	Top-performing models: Janus-Pro-7B (80%), SDXL (~67%).
	β€’	Lowest-performing model: PixArt-Ξ± (48%).
	β€’	DPG-Bench:
	β€’	Best performance: Janus-Pro-7B (84.2%) and SDXL (~83.5%).
	β€’	Other models like Emu3-Gen (~71.1%) perform less consistently.

Key Takeaways:
	1.	Janus-Pro Family consistently outperforms other models across both understanding and generation tasks, emphasizing its robustness.
	2.	Model size correlates positively with performance in multimodal understanding tasks. However, some smaller models (e.g., LLaVA) deliver competitive results in specific benchmarks.
33 4
for the nerds, a PDF: github.com/deepseek-ai/...
github.com
3
33 likes 4 reposts

More like this

×